Viewed   310 times

I have some json I need to decode, alter and then encode without messing up any characters.

If I have a unicode character in a json string it will not decode. I'm not sure why since json.org says a string can contain: any-Unicode-character- except-"-or--or- control-character. But it doesn't work in python either.

{"Tag":"Odómetro"}

I can use utf8_encode which will allow the string to be decoded with json_decode, however the character gets mangled into something else. This is the result from a print_r of the result array. Two characters.

[Tag] => Odómetro

When I encode the array again I the character escaped to ascii, which is correct according to the json spec:

"Tag"=>"Odu00f3metro"

Is there some way I can un-escape this? json_encode gives no such option, utf8_encode does not seem to work either.

Edit I see there is an unescaped_unicode option for json_encode. However it's not working as expected. Oh damn, it's only on php 5.4. I will have to use some regex as I only have 5.3.

$json = json_encode($array, JSON_UNESCAPED_UNICODE);
Warning: json_encode() expects parameter 2 to be long, string ...

 Answers

5

Judging from everything you've said, it seems like the original Odómetro string you're dealing with is encoded with ISO 8859-1, not UTF-8.

Here's why I think so:

  • json_encode produced parseable output after you ran the input string through utf8_encode, which converts from ISO 8859-1 to UTF-8.
  • You did say that you got "mangled" output when using print_r after doing utf8_encode, but the mangled output you got is actually exactly what would happen by trying to parse UTF-8 text as ISO 8859-1 (ó is x63xb3 in UTF-8, but that sequence is ó in ISO 8859-1.
  • Your htmlentities hackaround solution worked. htmlentities needs to know what the encoding of the input string to work correctly. If you don't specify one, it assumes ISO 8859-1. (html_entity_decode, confusingly, defaults to UTF-8, so your method had the effect of converting from ISO 8859-1 to UTF-8.)
  • You said you had the same problem in Python, which would seem to exclude PHP from being the issue.

PHP will use the uXXXX escaping, but as you noted, this is valid JSON.

So, it seems like you need to configure your connection to Postgres so that it will give you UTF-8 strings. The PHP manual indicates you'd do this by appending options='--client_encoding=UTF8' to the connection string. There's also the possibility that the data currently stored in the database is in the wrong encoding. (You could simply use utf8_encode, but this will only support characters that are part of ISO 8859-1).

Finally, as another answer noted, you do need to make sure that you're declaring the proper charset, with an HTTP header or otherwise (of course, this particular issue might have just been an artifact of the environment where you did your print_r testing).

Monday, August 8, 2022
1

Your data is being correctly stored in the database when you write the files using your script. The proof of this is that they come out correct when you echo back out the data. If you do not see the data correctly formatted in your phpmyadmin, it means that the page isn't properly set up to display utf-8.

The easiest way to check is to Ctrl+U to view the source code of your phpmyadmin and look for:

  < meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

It should also display:

MySQL charset: UTF-8 Unicode (utf8)
MySQL connection collation:    utf8_unicode_ci

On the very first page when you open phpMyAdmin. If you do not see both of these, then follow these steps to fix it:

A) Immediate solution I could think of is changing or forcing the browser's character encoding into utf-8. In Mozilla Firefox, it can be set via View -> Character Encoding -> Unicode (UTF-8)

or B) Another elegant solution might be changing the phpMyAdmin configurations so that it sends and sets proper charset in Content-Type header as Content-Type: text/html; charset=utf-8 . To do this, we have to edit the config.inc.php file found at phpMyAdmin's root directory:

$cfg['DefaultCharset'] = 'utf-8'; and 
$cfg['AllowAnywhereRecoding'] = true; 

These change of configurations should send proper headers. Anyhow, if the lang cookie is already set to some other charset, it won't reflect the changes. So, one may need to clear the cookies to see the changes.

Credit for above solutions: http://rajeshanbiah.blogspot.com/2004/12/storing-unicode-texts-in-mysql-with.html

Thursday, August 11, 2022
3

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "x80 and x95")
-> "xe2x82xac and xe2x80xa2"
Saturday, October 15, 2022
2

encode data in UTF-8 format before passing it to json_encode function

<?
    $array['copyright_str'] = utf8_encode("Copyright site.com © 2011-2012");
    echo json_encode($array);
?>
Friday, September 30, 2022
 
walmik
 
5

I referenced to this question :

How do I convert special UTF-8 chars to their iso-8859-1 equivalent using javascript?

The following functions helped me:

fixed_string = decodeURIComponent(escape(utf_string));

utf_string = unescape(encodeURIComponent(original_string));

The escape and unescape functions used for encoding and decoding query strings are defined for ISO characters whereas the newer encodeURIComponent and decodeURIComponent which do the same thing, are defined for UTF-8 characters.

Thursday, October 6, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :