Asked  2 Years ago    Answers:  5   Viewed   245 times

I currentyl have no clue on how to sort an array which contains UTF-8 encoded strings in PHP. The array comes from a LDAP server so sorting via a database (would be no problem) is no solution. The following does not work on my windows development machine (although I'd think that this should be at least a possible solution):

$array=array('Birnen', 'Äpfel', 'Ungetüme', 'Apfel', 'Ungetiere', 'Österreich');
$oldLocal=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, 'German_Germany.65001'));
usort($array, 'strcoll');
var_dump(setlocale(LC_COLLATE, $oldLocal));
var_dump($array);

The output is:

string(20) "German_Germany.65001"
string(1) "C"
array(6) {
  [0]=>
  string(6) "Birnen"
  [1]=>
  string(9) "Ungetiere"
  [2]=>
  string(6) "Äpfel"
  [3]=>
  string(5) "Apfel"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(11) "Österreich"
}

This is complete nonsense. Using 1252 as the codepage for setlocale() gives another output but still a plainly wrong one:

string(19) "German_Germany.1252"
string(1) "C"
array(6) {
  [0]=>
  string(11) "Österreich"
  [1]=>
  string(6) "Äpfel"
  [2]=>
  string(5) "Apfel"
  [3]=>
  string(6) "Birnen"
  [4]=>
  string(9) "Ungetüme"
  [5]=>
  string(9) "Ungetiere"
}

Is there a way to sort an array with UTF-8 strings locale aware?

Just noted that this seems to be PHP on Windows problem, as the same snippet with de_DE.utf8 used as locale works on a Linux machine. Nevertheless a solution for this Windows-specific problem would be nice...

 Answers

2

Eventually this problem cannot be solved in a simple way without using recoded strings (UTF-8 ? Windows-1252 or ISO-8859-1) as suggested by ???????? due to an obvious PHP bug as discovered by Huppie. To summarize the problem, I created the following code snippet which clearly demonstrates that the problem is the strcoll() function when using the 65001 Windows-UTF-8-codepage.

function traceStrColl($a, $b) {
    $outValue=strcoll($a, $b);
    echo "$a $b $outValuern";
    return $outValue;
}

$locale=(defined('PHP_OS') && stristr(PHP_OS, 'win')) ? 'German_Germany.65001' : 'de_DE.utf8';

$string="ABCDEFGHIJKLMNOPQRSTUVWXYZÄÖÜabcdefghijklmnopqrstuvwxyzäöüß";
$array=array();
for ($i=0; $i<mb_strlen($string, 'UTF-8'); $i++) {
    $array[]=mb_substr($string, $i, 1, 'UTF-8');
}
$oldLocale=setlocale(LC_COLLATE, "0");
var_dump(setlocale(LC_COLLATE, $locale));
usort($array, 'traceStrColl');
setlocale(LC_COLLATE, $oldLocale);
var_dump($array);

The result is:

string(20) "German_Germany.65001"
a B 2147483647
[...]
array(59) {
  [0]=>
  string(1) "c"
  [1]=>
  string(1) "B"
  [2]=>
  string(1) "s"
  [3]=>
  string(1) "C"
  [4]=>
  string(1) "k"
  [5]=>
  string(1) "D"
  [6]=>
  string(2) "ä"
  [7]=>
  string(1) "E"
  [8]=>
  string(1) "g"
  [...]

The same snippet works on a Linux machine without any problems producing the following output:

string(10) "de_DE.utf8"
a B -1
[...]
array(59) {
  [0]=>
  string(1) "a"
  [1]=>
  string(1) "A"
  [2]=>
  string(2) "ä"
  [3]=>
  string(2) "Ä"
  [4]=>
  string(1) "b"
  [5]=>
  string(1) "B"
  [6]=>
  string(1) "c"
  [7]=>
  string(1) "C"
  [...]

The snippet also works when using Windows-1252 (ISO-8859-1) encoded strings (of course the mb_* encodings and the locale must be changed then).

I filed a bug report on bugs.php.net: Bug #46165 strcoll() does not work with UTF-8 strings on Windows. If you experience the same problem, you can give your feedback to the PHP team on the bug-report page (two other, probably related, bugs have been classified as bogus - I don't think that this bug is bogus ;-).

Thanks to all of you.

Sunday, September 18, 2022
5
  • mb_internal_encoding('UTF-8') doesn't do anything by itself, it only sets the default encoding parameter for each mb_ function. If you're not using any mb_ function, it doesn't make any difference. If you are, it makes sense to set it so you don't have to pass the $encoding parameter each time individually.
  • IMO mb_detect_encoding is mostly useless since it's fundamentally impossible to accurately detect the encoding of unknown text. You should either know what encoding a blob of text is in because you have a specification about it, or you need to parse appropriate meta data like headers or meta tags where the encoding is specified.
  • Using mb_check_encoding to check if a blob of text is valid in the encoding you expect it to be in is typically sufficient. If it's not, discard it and throw an appropriate error.
  • Regarding:

    does this mean I have to use all multi byte functions instead of its core functions

    If you are manipulating strings that contain multibyte characters, then yes, you need to use the mb_ functions to avoid getting wrong results. The core string functions only work on a byte level, not a character level, which is what you typically want when working with strings.

  • utf8_general_ci vs. utf8_bin only makes a difference when collating, i.e. sorting and comparing strings. With utf8_bin data is treated in binary form, i.e. only identical data is identical. With utf8_general_ci some logic is applied, e.g. "é" sorts together with "e" and upper case is considered equal to lower case.
Monday, November 21, 2022
 
5

Use natsort

This function implements a sort algorithm that orders alphanumeric strings in the way a human being would while maintaining key/value associations. This is described as a "natural ordering".

Monday, September 19, 2022
 
laika
 
5

The array key is encoded in UTF-8 if it indeed comes as UTF-8 string from the database. Apparently your source code file is not encoded in UTF-8, I'd guess it's encoded in Latin-1. A comparison between a UTF-8 byte sequence and a Latin-1 byte sequence is therefore unsuccessful. Save you source code files in UTF-8 and it should work (consult your text editor).

Wednesday, December 7, 2022
 
ihtcboy
 
3

Here's a Solid way to do it:

$blank = array();
$collection = collect([
    ["name"=>"maroon"],
    ["name"=>"zoo"],
    ["name"=>"ábel"],
    ["name"=>"élof"]
])->toArray();

$count = count($collection);

for ($x=0; $x < $count; $x++) { 
    $blank[$x] = $collection[$x]['name'];
}

$collator = collator_create('en_US');
var_export($blank);
collator_sort( $collator, $blank );
var_export( $blank );

dd($blank);

Outputs:

array (
  0 => 'maroon',
  1 => 'zoo',
  2 => 'ábel',
  3 => 'élof',
)array (
  0 => 'ábel',
  1 => 'élof',
  2 => 'maroon',
  3 => 'zoo',
)

Laravel Pretty Output:

array:4 [
  0 => "ábel"
  1 => "élof"
  2 => "maroon"
  3 => "zoo"
]

For personal Reading and reference: http://php.net/manual/en/class.collator.php

Hope this answer helps, sorry for late response =)

Friday, August 26, 2022
 
nisus
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :
 

Browse Other Code Languages