Viewed   57 times

I am able use UTF-8 characters just fine in my scripts.

As a matter of fact it is possible to have names of variables and functions contain Unicode characters.

There is also the mb_string extension which deals with multi-byte strings, yet in countless articles PHP is criticized for its lack of Unicode support.

I don't get it; why is PHP said to not support Unicode?

 Answers

4

When PHP was started several years ago, UTF-8 was not really supported. We are talking about a time when non-Unicode OS like Windows 98/Me was still current and when other big languages like Delphi were also non-Unicode. Not all languages were designed with Unicode in mind from day 1, and completely changing your language to Unicode without breaking a lot of stuff is hard. Delphi only became Unicode compatible a year or two ago for example, while other languages like Java or C# were designed in Unicode from Day 1.

So when PHP grew and became PHP 3, PHP 4 and now PHP 5, simply no one decided to add Unicode. Why? Presumably to keep compatible with existing scripts or because utf8_de/encode and mb_string already existed and work. I do not know for sure, but I strongly believe that it has something to do with organic growth. Features do not simply exist by default, they have to be written by someone, and that simply did not happen for PHP yet.

Edit: Ok, I read the question wrong. The question is: How are strings stored internally? If I type in "Währung" or "Écriture", which Encoding is used to create the bytes used? In case of PHP, it is ASCII with a Codepage. That means: If I encode the string using ISO-8859-15 and you decode it with some chinese codepage, you will get weird results. The alternative is in languages like C# or Java where everything is stored as Unicode, which means: There is no codepage anymore, and theoretically you cannot mess up. I recommend Joel's article about Unicode and Character Sets, but essentially it boils down to: How are strings stored internally, and the answer with PHP is "Not in Unicode", which means that you have to be very careful and explicit when processing strings to make sure to always keep the string in the proper encoding during input, storage (database) and output, which is very errorprone.

Monday, September 12, 2022
5

You don't need to convert integer to hexadecimal string, instead use IntlChar::chr:

echo IntlChar::chr(127468);

Directly from docs of IntlChar::chr:

Return Unicode character by code point value

Saturday, November 12, 2022
 
2

No, but you can easily make one:

<?php
/**
 * Return unicode char by its code
 *
 * @param int $u
 * @return char
 */
function unichr($u) {
    return mb_convert_encoding('&#' . intval($u) . ';', 'UTF-8', 'HTML-ENTITIES');
}
?>

Taken from: PHP Manual - chr - comments

Monday, September 26, 2022
 
rxread
 
3

It is up to you to decide, on the basis of the purpose and nature of your application, whether you apply normalization upon reading user input, or storing it to a database, or when writing it, or at all. To summarize the long thread mentioned in the comments to the question, also available in the official list archive at http://validator.w3.org/feedback.html

  • The warning message comes from the experimental “HTML5 validation” (which is really a linter, applying subjective rules in addition to some formal tests).
  • The message is not based on any requirement in HTML5 drafts but on opinions on what might cause problems in some software.
  • The opinions originally made “HTML5 validation” issue an error message, now a warning.

It is certainly possible, though uncommon, to get unnormalized data as user input. This does not depend on normalization carried out by browsers (they don’t do such things, though they conceivably might in the future) but on input methods and habits. For example, methods of typing the letter ü (u umlaut, or u with diaeresis) tend to produce the character in precomposed form, as normalized. People can produce it as unnormalized, in decomposed form, as letter u followed by combining diaeresis, but they usually have no reason to do so, and most people wouldn’t even know how to do that.

If you do string comparisons in your software, they may or may not (depending on comparison routines used) treat e.g. a precomposed ü as equal to the decomposed presentation. Simple implementations treat them as different, as they are definitely distinct at the simple character level (Unicode code points).

One reason to normalize at some point, in the writing phase at the latest, is that precomposed characters generally get displayed more reliably. To present a normalized ü, a program just has to pick up a glyph from a font. To present a decomposed ü, a program must either recognize it as canonically equivalent to the normalized ü or write the letter u with a diaeresis symbol properly placed above it, with due attention to the graphic properties of the glyph for u, and many programs fail in this.

On the other hand, in the rare cases where unnormalized data is received as user input, the user may well have a reason to have produced it. He may have the idea that normalized ü and unnormalized ü are distinct and need to be treated as such.

Thursday, December 22, 2022
 
soniiic
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :