Asked  2 Years ago    Answers:  5   Viewed   108 times

When using "special" Unicode characters they come out as weird garbage when encoded to JSON:

php > echo json_encode(['foo' => '?']);
{"foo":"u99ac"}

Why? Have I done something wrong with my encodings?

(This is a reference question to clarify the topic once and for all, since this comes up again and again.)

 Answers

3

First of all: There's nothing wrong here. This is how characters can be encoded in JSON. It is in the official standard. It is based on how string literals can be formed in Javascript ECMAScript (section 7.8.4 "String Literals") and is described as such:

Any code point may be represented as a hexadecimal number. The meaning of such a number is determined by ISO/IEC 10646. If the code point is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the code point. [...] So, for example, a string containing only a single reverse solidus character may be represented as "u005C".

In short: Any character can be encoded as u...., where .... is the Unicode code point of the character (or the code point of half of a UTF-16 surrogate pair, for characters outside the BMP).

"?"
"u99ac"

These two string literals represent the exact same character, they're absolutely equivalent. When these string literals are parsed by a compliant JSON parser, they will both result in the string "?". They don't look the same, but they mean the same thing in the JSON data encoding format.

PHP's json_encode preferably encodes non-ASCII characters using u.... escape sequences. Technically it doesn't have to, but it does. And the result is perfectly valid. If you prefer to have literal characters in your JSON instead of escape sequences, you can set the JSON_UNESCAPED_UNICODE flag in PHP 5.4 or higher:

php > echo json_encode(['foo' => '?'], JSON_UNESCAPED_UNICODE);
{"foo":"?"}

To emphasise: this is just a preference, it is not necessary in any way to transport "Unicode characters" in JSON.

Tuesday, October 18, 2022
2

Going by Gumbo and Pekka's advice, I wrote curl_exec_utf8

/** The same as curl_exec except tries its best to convert the output to utf8 **/
function curl_exec_utf8($ch) {
    $data = curl_exec($ch);
    if (!is_string($data)) return $data;

    unset($charset);
    $content_type = curl_getinfo($ch, CURLINFO_CONTENT_TYPE);

    /* 1: HTTP Content-Type: header */
    preg_match( '@([w/+]+)(;s*charset=(S+))[email protected]', $content_type, $matches );
    if ( isset( $matches[3] ) )
        $charset = $matches[3];

    /* 2: <meta> element in the page */
    if (!isset($charset)) {
        preg_match( '@<metas+http-equiv="Content-Type"s+content="([w/]+)(;s*charset=([^s"]+))[email protected]', $data, $matches );
        if ( isset( $matches[3] ) ) {
            $charset = $matches[3];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<metas+http-equiv="Content-Type"s+content="[w/]+s*;s*charset=)([^s"]+)@i', '$1utf-8', $data, 1);
        }
    }

    /* 3: <xml> element in the page */
    if (!isset($charset)) {
        preg_match( '@<?xml.+encoding="([^s"]+)@si', $data, $matches );
        if ( isset( $matches[1] ) ) {
            $charset = $matches[1];
            /* In case we want do do further processing downstream: */
            $data = preg_replace('@(<?xml.+encoding=")([^s"]+)@si', '$1utf-8', $data, 1);
        }
    }

    /* 4: PHP's heuristic detection */
    if (!isset($charset)) {
        $encoding = mb_detect_encoding($data);
        if ($encoding)
            $charset = $encoding;
    }

    /* 5: Default for HTML */
    if (!isset($charset)) {
        if (strstr($content_type, "text/html") === 0)
            $charset = "ISO 8859-1";
    }

    /* Convert it if it is anything but UTF-8 */
    /* You can change "UTF-8"  to "UTF-8//IGNORE" to 
       ignore conversion errors and still output something reasonable */
    if (isset($charset) && strtoupper($charset) != "UTF-8")
        $data = iconv($charset, 'UTF-8', $data);

    return $data;
}

The regexes are mostly from http://nadeausoftware.com/articles/2007/06/php_tip_how_get_web_page_content_type

Sunday, December 4, 2022
 
rings
 
2

Well, finally I solved it using:

mb_convert_encoding($text,'ISO-8859-15','utf-8');
Friday, August 12, 2022
 
2

Like jrturton mentions, ¹, ² and ³ were from a legacy character set (Latin 1) and therefore included in a different place. This also means that lots of fonts don't have support for more superscript numbers, as many only strive for Latin, Greek and Cyrillic with a few punctuation symbols thrown in. So the remaining ones are taken from a different font over which you as an author have little to no control.

As an example:

Those are the superscript numerals from 1 to 9 and 0. The run of text was formatted in Arial in Word. You see what happened to the rest of them. Contrary to what jrturton believes, there is no reshaping of existing glyphs involved. This is just font substitution.

Friday, October 7, 2022
 
agonen
 
3

phpMyAdmin uses SHOW TABLE STATUS to get information for your tables.

From the documentation:

Rows

The number of rows. Some storage engines, such as MyISAM, store the exact count. For other storage engines, such as InnoDB, this value is an approximation, and may vary from the actual value by as much as 40 to 50%. In such cases, use SELECT COUNT(*) to obtain an accurate count.

This is due to InnoDB being an ACID compliant storage engine. InnoDB implements MVCC using row-level locking. In short, there can be multiple copies of a given row at a given time. I suggest reading this article: Understanding InnoDB MVCC.

Saturday, November 12, 2022
 
sly
 
sly
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 

Browse Other Code Languages