I have some json I need to decode, alter and then encode without messing up any characters.
If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain:
any-Unicode-character- except-"-or--or- control-character. But it doesn't work in python either.
I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.
[Tag] => OdÃ³metro
When I encode the array again I the character escaped to ascii, which is correct according to the json spec:
Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.
Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.
$json = json_encode($array, JSON_UNESCAPED_UNICODE); Warning: json_encode() expects parameter 2 to be long, string ...
Judging from everything you've said, it seems like the original
Odómetrostring you're dealing with is encoded with ISO 8859-1, not UTF-8.
Here's why I think so:
json_encodeproduced parseable output after you ran the input string through
utf8_encode, which converts from ISO 8859-1 to UTF-8.
utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is
x63xb3in UTF-8, but that sequence is
Ã³in ISO 8859-1.
htmlentitieshackaround solution worked.
htmlentitiesneeds to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (
html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
PHP will use the
uXXXXescaping, but as you noted, this is valid JSON.
So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending
options='--client_encoding=UTF8'to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use
utf8_encode, but this will only support characters that are part of ISO 8859-1).
Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your