Viewed   112 times

So I have an array of strings, and all of the strings are using the system default ANSI encoding and were pulled from a SQL database. So there are 256 different possible character byte values (single byte encoding).
Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like u0082?

Or is that the standard for JSON?

 Answers

4

Is there a way I can get json_encode() to work and display these characters instead of having to use utf8_encode() on all of my strings and ending up with stuff like "u0082"?

If you have an ANSI encoded string, using utf8_encode() is the wrong function to deal with this. You need to properly convert it from ANSI to UTF-8 first. That will certainly reduce the number of Unicode escape sequences like u0082 from the json output, but technically these sequences are valid for json, you must not fear them.

Converting ANSI to UTF-8 with PHP

json_encode works with UTF-8 encoded strings only. If you need to create valid json successfully from an ANSI encoded string, you need to re-encode/convert it to UTF-8 first. Then json_encode will just work as documented.

To convert an encoding from ANSI (more correctly I assume you have a Windows-1252 encoded string, which is popular but wrongly referred to as ANSI) to UTF-8 you can make use of the mb_convert_encoding() function:

$str = mb_convert_encoding($str, "UTF-8", "Windows-1252");

Another function in PHP that can convert the encoding / charset of a string is called iconv based on libiconv. You can use it as well:

$str = iconv("CP1252", "UTF-8", $str);

Note on utf8_encode()

utf8_encode() does only work for Latin-1, not for ANSI. So you will destroy part of your characters inside that string when you run it through that function.


Related: What is ANSI format?


For a more fine-grained control of what json_encode() returns, see the list of predifined constants (PHP version dependent, incl. PHP 5.4, some constants remain undocumented and are available in the source code only so far).

Changing the encoding of an array/iteratively (PDO comment)

As you wrote in a comment that you have problems to apply the function onto an array, here is some code example. It's always needed to first change the encoding before using json_encode. That's just a standard array operation, for the simpler case of pdo::fetch() a foreach iteration:

while($row = $q->fetch(PDO::FETCH_ASSOC))
{
  foreach($row as &$value)
  {
    $value = mb_convert_encoding($value, "UTF-8", "Windows-1252");
  }
  unset($value); # safety: remove reference
  $items[] = array_map('utf8_encode', $row );
}
Monday, December 12, 2022
1

The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities

Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘?’ character.

I actually do a htmlspecialchars () on the text before displaying it

Yes. You must do that, or else you've got a security problem.

Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself

Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.

I know that the good idea is to switch the whole software to UTF-8,

Yup. Well, at least the encoding of the page containing the form should be UTF-8.

Monday, September 5, 2022
 
5

The array key is encoded in UTF-8 if it indeed comes as UTF-8 string from the database. Apparently your source code file is not encoded in UTF-8, I'd guess it's encoded in Latin-1. A comparison between a UTF-8 byte sequence and a Latin-1 byte sequence is therefore unsuccessful. Save you source code files in UTF-8 and it should work (consult your text editor).

Wednesday, December 7, 2022
 
ihtcboy
 
3

Here's a Solid way to do it:

$blank = array();
$collection = collect([
    ["name"=>"maroon"],
    ["name"=>"zoo"],
    ["name"=>"ábel"],
    ["name"=>"élof"]
])->toArray();

$count = count($collection);

for ($x=0; $x < $count; $x++) { 
    $blank[$x] = $collection[$x]['name'];
}

$collator = collator_create('en_US');
var_export($blank);
collator_sort( $collator, $blank );
var_export( $blank );

dd($blank);

Outputs:

array (
  0 => 'maroon',
  1 => 'zoo',
  2 => 'ábel',
  3 => 'élof',
)array (
  0 => 'ábel',
  1 => 'élof',
  2 => 'maroon',
  3 => 'zoo',
)

Laravel Pretty Output:

array:4 [
  0 => "ábel"
  1 => "élof"
  2 => "maroon"
  3 => "zoo"
]

For personal Reading and reference: http://php.net/manual/en/class.collator.php

Hope this answer helps, sorry for late response =)

Friday, August 26, 2022
 
nisus
 
2

They are exactly equal, with the Unicode escaping taking a bit more space. Like writing u004a in Java is exactly the same as writing a. If correctness is your concern, it doesn't matter.

And it won't take considerable amount of extra space either unless most of your text is between 0x2000 - 0x20FF:

The following code escapes C0 and C1 control characters, but it also escapes 0x2000 - 0x20FF:

     if (c < ' ' || (c >= 'u0080' && c < 'u00a0')
                    || (c >= 'u2000' && c < 'u2100')) {

So any character between 0x2000 - 0x20FF and control characters are represented as unicode escapes. This makes sense for control characters because those are not allowed in JSON in their unescaped form.

As for 0x2000 - 0x20FF, I have no idea because the code is not commented. Every character unescaped in that range is valid JSON. Of course, 0x2028 and 0x2029 are not valid in Javascript (so this small detail makes JSON syntax not a subset of Javascript syntax), so it's good idea to escape those in JSON in case it is being used as JSONP which is Javascript really. But it is not apparent to me why the code escapes a whole range because just 2 characters in the range are illegal.

Friday, October 21, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :