Viewed   120 times

I have a MySQL database with words containing accents in Spanish (áéíóú). I'd like to know if there's any way to do a diacritic insensitive search. For instance, if I search for "lapiz" (without accent), i'd like to get results containing the word "lápiz" from my db. The way I'm currently doing the query is as follows:

$result = mysql_query("SELECT * FROM $lookuptable WHERE disabled = '0' AND name LIKE '%$q%' OR productCode LIKE '%$q%' LIMIT $sugglimit");

This is for an online store, so I don't know what people will be searching for... "lapiz" is just and example.

alt text




Character sets & collations, not my favorites, but they DO work:

mysql> SET NAMES latin1;
mysql> SELECT 'lápiz' LIKE 'lapiz';
| 'lápiz' LIKE 'lapiz' |
|                     0 | 
1 row in set (0.01 sec)

mysql> SET NAMES utf8;
mysql> SELECT 'lápiz' LIKE 'lapiz';
| 'lápiz' LIKE 'lapiz' |
|                     1 | 

mysql> SET NAMES latin1;
mysql> SELECT _utf8'lápiz' LIKE _utf8'lapiz' ;
| _utf8'lápiz' LIKE _utf8'lapiz' |
|                               1 | 

A nice chapter to read in the manual:Character Set Support

Friday, December 2, 2022

If you have made sure that both the tables, and the output encoding are UTF-8, almost the only thing left is the connection encoding.

The reason for the change in behaviour when updating servers could be a change of the default connection encoding:


However, I can't see any changes in the default encoding between versions, so if those were brand-new installs, I can't see that happening.

Anyway, what happens if you run this from within your PHP query and output the results. Any differences to the command line output?

 SHOW VARIABLES LIKE 'character_set%';
 SHOW VARIABLES LIKE 'collation%'; 
Friday, November 18, 2022

Character set issues are often really tricky to figure out. Basically, you need to make sure that all of the following are true:

  • The DB connection is using UTF-8
  • The DB tables are using UTF-8
  • The individual columns in the DB tables are using UTF-8
  • The data is actually stored properly in the UTF-8 encoding inside the database (often not the case if you've imported from bad sources, or changed table or column collations)
  • The web page is requesting UTF-8
  • Apache is serving UTF-8

Here's a good tutorial on dealing with that list, from start to finish:

It sounds like your problem is specifically that you've got double-encoded (or triple-encoded) characters, probably from changing character sets or importing already-encoded data with the wrong charset. There's a whole section on fixing that in the above tutorial.

Wednesday, September 28, 2022

MySQL can't make a fulltext (or any) index accross multiple tables. So using a single index is out.

As an alternative, you could either:

  1. Use an index on each table, and a join/union as appropriate to retrieve the rows that match your requirements.

  2. Create an aggregate table to apply the index to.

  3. Use a tool such as lucene or solr to provide your search index. (If you are going for any sort of scale, this is likely the best option)

Monday, September 19, 2022

Unfortunately you cannot do this using a MySQL full-text index. You cannot retrieve '*nited states' instantly from index because left characters are the most important part of the index. However, you can search 'United Sta*'.

// the only possible wildcard full-text search in MySQL

MySQL's full-text performs best when searching whole words in sentences - even that can suck at times. Otherwise, I'd suggest using an external full-text engine like Solr or Sphinx. I think Sphinx allows prefix and suffix wildcards, not sure about the others.

You could go back to MySQL's LIKE clause, but again, running queries like LIKE '%nited states' or LIKE '%nited Stat%', will also suffer on performance, as it can't use the index on the first few characters. 'United Sta%' and 'Unit%States' are okay as the index can be used against the first bunch of known characters.

Another quite major caveat using MySQL's full-text indexing is the stop-word list and minimum word length settings. For example, on a shared hosting environment, you will be limited to words greater than or equal to 4-characters. So searching 'Goo' to get 'Google' would fail. The stop-word list also disallows common words like 'and', 'maybe' and 'outside' - in-fact, there are 548 stop-words all together! Again, if not using shared hosting, these settings are relatively easily to modify, but if you are, then you will get annoyed with some of the default settings.

Sunday, December 11, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :