Viewed   116 times

When using "special" Unicode characters they come out as weird garbage when encoded to JSON:

php > echo json_encode(['foo' => '?']);
{"foo":"u99ac"}

Why? Have I done something wrong with my encodings?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

 Answers

3

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "u005C".

In short: Any character can be encoded as u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

"?"
"u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "?". They don't look the same, but they mean the same thing in the JSON data encoding format.

PHP's json_encode preferably encodes non-ASCII characters using u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode(['foo' => '?'], JSON_UNESCAPED_UNICODE);
{"foo":"?"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

Tuesday, October 18, 2022
1

Just explaining my comment:

objects in foreach loops are always passed by reference

When you use a foreach loop for an array of objects the variable that you are using inside the loop is a pointer to that object so it works as a reference, any change on the object inside the loop is a change on the object outside. This is because:

objects are always passed by reference (@user3137702 quote)

Detailed and official explanation here.


When you copy and unset your variable:

$copyThing = $thing;
unset($copyThing->property);

you are creating another pointer and unseting it, so the original value is a gone. As a matter of fact, since the foreach loop also uses a pointer the $things array is also affected.

check this ideone (notice the vardump [where the 'a' property is gone], as the output is the same as you got)


I do not know in which version it changed, if ever, as it seems like default object/pointer behavior


As a workaround (some ideas):

  1. Copy your initial array
  2. Use clone: $x = clone($obj); (As long as the default copy constructor works for your objects)
Wednesday, September 21, 2022
 
etl
 
etl
1

You also have to mysql_real_escape_string() your $json_complete in your query, otherwise you'd loose the original /uXXXX encodings.

$json_complete = $_POST['questionnaire'];
$q = json_decode($json_complete, true);

//save complete json string to database
$query_upload = "INSERT INTO questions_json (q_id, json) VALUES('".mysql_real_escape_string($q[qID])."', '".mysql_real_escape_string($json_complete)."') ON DUPLICATE KEY UPDATE json = VALUES(json)";
$result = mysql_query($query_upload) or errorReport("Error in query: $query_upload. ".mysql_error());
if (!$result)
    errorReport($result);
Sunday, September 4, 2022
 
ssell
 
2

Like jrturton mentions, ¹, ² and ³ were from a legacy character set (Latin 1) and therefore included in a different place. This also means that lots of fonts don't have support for more superscript numbers, as many only strive for Latin, Greek and Cyrillic with a few punctuation symbols thrown in. So the remaining ones are taken from a different font over which you as an author have little to no control.

As an example:

Those are the superscript numerals from 1 to 9 and 0. The run of text was formatted in Arial in Word. You see what happened to the rest of them. Contrary to what jrturton believes, there is no reshaping of existing glyphs involved. This is just font substitution.

Friday, October 7, 2022
 
agonen
 
3

First question: it depends on what exactly goes in the string.

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of. So, if you only put valid UTF-8 bytes between the quotes (fairly easy if the file itself is encoded as UTF-8), then the string will be UTF-8, and you can safely use mb_strlen() on it.

Also, if you're using mbstring functions, you need to explicitly tell it what character set your string is, either with mbstring.internal_encoding or as the last argument to any mbstring function.

Second question: yes, with caveats.

Two strings that are both independently valid UTF-8 can be safely byte-wise concatenated (like with PHP's . operator) and still be valid UTF-8. However, you can never be sure, without doing some work yourself, that a POSTed string is valid UTF-8. Database strings are a little easier, if you carefully set the connection character set, because most DBMSs will do any conversion for you.

Monday, November 7, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :