Viewed   117 times

Issue Description

After upgrading PHP on our development server from 5.2 to 5.3, we're encountering an issue where data requested from our database and displayed on a web page shows with improper encoding when attempting to display Russian characters.

Environment

  • Dev OS: Debian GNU/Linux 6.0
  • Dev PHP: 5.3.5-0.dotdeb.1
  • Live MySQL: Distrib 5.1.49

Details

In PHP 5.3, the default client library for interacting with MySQL databases changed from libmysql to mysqlnd, which would appear to be the cause of the issue we are encountering.

We are connecting to the database with the following code:

$conn = mysql_pconnect('database.hostname', 'database_user', 'database_password');
$mysql_select_db('database', $conn);

The data stored in our database is encoded with UTF-8 encoding. Connecting to the database via the command-line client and running queries confirms that the data is intact and encoded properly. However, when we query the database in PHP and try to display the exact same data, it becomes garbled. In this specific case, we're attempting to display Russian characters and the result is non-English, non-Russian characters:

The response headers we receive confirm that the content-type is UTF-8:

We tested the strings before display with mb_detect_encoding in strict mode as well as mb_check_encoding and were told the string was a UTF-8 string before displaying it. We also used mysql_client_encoding to test the client encoding and it also indicates the character set is UTF-8.

In performing research, we discovered some suggestions to try to work around this issue:

header("Content-type: text/html; charset=utf-8");
mysql_set_charset('utf8');
mysql_query("SET SESSION character_set_results = 'UTF8'");
mysql_query('SET NAMES UTF8', $conn);

We even tried utf8_encode:

utf8_encode($string);

However, none of these solutions worked.

Running out of options, we upgraded MySQL on our development system to Distrib 5.1.55. After that upgrade, everything displayed correctly when we connected to our development database. Of course, it continues to display incorrectly when we connect to our live database.

Ideally, we would like to resolve this issue without upgrading MySQL on our production servers unless we can verify the exact reason why this isn't working and why the upgrade will fix it. How can we resolve this encoding issue without upgrading MySQL? Alternatively, why does the MySQL upgrade fix the issue?

 Answers

1

If you have made sure that both the tables, and the output encoding are UTF-8, almost the only thing left is the connection encoding.

The reason for the change in behaviour when updating servers could be a change of the default connection encoding:

[mysql]
default-character-set=utf8

However, I can't see any changes in the default encoding between versions, so if those were brand-new installs, I can't see that happening.

Anyway, what happens if you run this from within your PHP query and output the results. Any differences to the command line output?

 SHOW VARIABLES LIKE 'character_set%';
 SHOW VARIABLES LIKE 'collation%'; 
Friday, November 18, 2022
 
2

I would do it slightly different.

I assume you have a table battle and for each battle there can be multiple plays where two people play together.

battle_results would contain:

play_id   battle_id   play_date    play_type
      1     DeRa001   2011-01-01       multi

Now you need a new table participants that lists the participants in each play:

play_id user_id is_winner position
      1     007         0    Side1
      1     010         1    Side2

That table would have (play_id, user_id) as the primary key, so the same user can't play twice for the same play (this solves the problem that the same combination can be inserted twice with a different "direction". Note that play_id is unique for a single play. So you'll have always two rows in there with the same play_id.

Edit: you can indicate a draw by setting is_winner to 0 for all participants in a play.

To find out with whom a specific user (e.g. 007) played is simple then:

SELECT user_id 
FROM participants
WHERE play_id IN (SELECT p2.play_id
                    FROM participants p2
                   WHERE p2.user_id = '007')

To find the total number of wins for a user:

SELECT count(*)
FROM participants
WHERE user_id = '007'
AND is_winner = 1

To find the win/loss ratio:

SELECT total_loss / total_wins
FROM ( 
  SELECT user_id, 
         count(CASE WHEN is_winner = 0 THEN 1 ELSE NULL) as total_loss, 
         count(CASE WHEN is_winner = 1 THEN 1 ELSE NULL) as total_wins
  FROM participants
) T
WHERE user_id = '007'
Sunday, October 30, 2022
 
3

You can use html_entity_decode() to convert from HTML to a (real) character encoding.

<? echo html_entity_decode("&ntilde;", ENT_COMPAT, "UTF-8"); ?>
ñ

Please note that "HTML" isn't a character encoding in the usual sense, so isn't understood by libraries such as iconv, nor by MySQL itself.

I'd also recommend (per example above) having the whole application use UTF-8. Single character encodings such as ISO8859 are effectively obsolete now that Unicode is so widely supported.

Saturday, October 29, 2022
 
murtaza
 
3

Use the ALTER DATABASE and ALTER TABLE commands.

ALTER DATABASE databasename CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

Or if you're still on MySQL 5.5.2 or older which didn't support 4-byte UTF-8, use utf8 instead of utf8mb4:

ALTER DATABASE databasename CHARACTER SET utf8 COLLATE utf8_unicode_ci;
ALTER TABLE tablename CONVERT TO CHARACTER SET utf8 COLLATE utf8_unicode_ci;
Tuesday, October 11, 2022
 
guidev
 
1

Your problem is that your SET NAMES 'utf8_persian_ci' command was invalid (utf8_persion_ci is a collation, not an encoding). If you run it in a terminal you will see an error Unknown character set: 'utf8_persian_ci'. Thus your application, when it stored the data, was using the latin1 character set. MySQL interpreted your input as latin1 characters which it then stored encoded as utf-8. Likewise when the data was pulled back out, MySQL converted it from UTF-8 back to latin1 and (hopefully, most of the time) the original bytes you gave it.

In other words, all your data in the database is completely messed up, but it just so happened to work.

To fix this, you need to undo what you did. The most straightforward way is using PHP:

  1. SET NAMES latin1;
  2. Select every single text field from every table.
  3. SET NAMES utf8;
  4. Update the same rows using the same string unaltered.

Alternatively you can perform these steps inside MySQL, but it's tricky because MySQL understands the data to be in a certain character set. You need to modify your text columns to a BLOB type, then modify them back to text types with a utf8 character set. See the section at the bottom of the ALTER TABLE MySQL documentation labeled "Warning" in red.

After you do either one of these things, the bytes stored in your database columns will be the actual character set they claim to be. Then, make sure you always use mysql_set_charset('utf8') on any database access from PHP that you may do in the future! Otherwise you will mess things up again. (Note, do not use a simple mysql_query('SET NAMES utf8')! There are corner cases (such as a reset connection) where this can be reset to latin1 without your knowledge. mysql_set_charset() will set the charset whenever necessary.)

It would be best if you switched away from mysql_* functions and used PDO instead with the charset=utf8 parameter in your PDO dsn.

Tuesday, August 23, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :