Viewed   83 times

Is there a function that will change UTF-8 to Unicode leaving non special characters as normal letters and numbers?

ie the German word "tchüß" would be rendered as something like "tch20AC21AC" (please note that I am making the Unicode codes up).

EDIT: I am experimenting with the following function, but although this one works well with ASCII 32-127, it seems to fail for double byte chars:

function strToHex ($string)
{
    $hex = '';
    for ($i = 0; $i < mb_strlen ($string, "utf-8"); $i++)
    {
        $id = ord (mb_substr ($string, $i, 1, "utf-8"));
        $hex .= ($id <= 128) ? mb_substr ($string, $i, 1, "utf-8") : "&#" . $id . ";";
}

    return ($hex);
}

Any ideas?

EDIT 2: Found solution: The PHP ord() function does not work for double byte chars. Use instead: http://nl.php.net/manual/en/function.ord.php#78032

 Answers

1

Converting one character set to another can be done with iconv:

http://php.net/manual/en/function.iconv.php

Note that UTF is already an Unicode encoding.

Another way is simply using htmlentities with the right character set:

http://php.net/manual/en/function.htmlentities.php

Thursday, November 10, 2022
2
$utf8string = html_entity_decode(preg_replace("/U+([0-9A-F]{4})/", "&#x\1;", $string), ENT_NOQUOTES, 'UTF-8');

is probably the simplest solution.

Thursday, August 4, 2022
3

The easiest solution that I found was one that I just randomly stumbled upon: this official Unicode Properties JSP Web app. I believe this is the query I used:

[:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

Which yields 107,401 code points. I then filtered out the URI reserved characters and a couple of others just to be safe before storing them in my database. Here is my working prototype, in unadvertised beta.

Some other things I tried, unsuccessfully:

I tried the Perl unichars utility, which I believe has the capability to do what I need, but my version of Perl (5.10.1) is linked to a Unicode 5.x standard; I couldn't quickly find any instructions for upgrading to the Unicode 6.0.0 standard. I had considered writing a Ruby app similar to unichars, but my Ruby install is also on a Unicode 5.2 standard (Ruby 1.9.2, ActiveSupport 3.0.8). I found a way to apparently load a different Unicode table, but there is no documentation for it and the unicode_tables.dat file on my system is a binary file so no easy answer there.

I had also considered parsing the Unicode 6.0.0 standard's UnicodeData.txt file myself, but apparently there are ranges of code points missing, such as Han, which would require me parsing yet another file in its own format.

Monday, December 26, 2022
 
3

First question: it depends on what exactly goes in the string.

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of. So, if you only put valid UTF-8 bytes between the quotes (fairly easy if the file itself is encoded as UTF-8), then the string will be UTF-8, and you can safely use mb_strlen() on it.

Also, if you're using mbstring functions, you need to explicitly tell it what character set your string is, either with mbstring.internal_encoding or as the last argument to any mbstring function.

Second question: yes, with caveats.

Two strings that are both independently valid UTF-8 can be safely byte-wise concatenated (like with PHP's . operator) and still be valid UTF-8. However, you can never be sure, without doing some work yourself, that a POSTed string is valid UTF-8. Database strings are a little easier, if you carefully set the connection character set, because most DBMSs will do any conversion for you.

Monday, November 7, 2022
5

You have to encode/decode 4 times to get the desired result:

print(
  "Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad"

  # actually any encoding support printable ASCII would work, for example utf-8
  .encode('ascii')

  # unescape the string
  # source: https://.com/a/1885197
  .decode('unicode-escape')

  # latin-1 also works, see https://.com/q/7048745
  .encode('iso-8859-1')

  # finally
  .decode('utf-8')
)

Try it online!

Besides, consider telling your target program (data source) to give different output format (byte array or base64 encoded, for example), if you can.

The unsafe-but-shorter way:

st = "Je-li pro za\xc5\x99azov\xc3\xa1n\xc3\xad"
print(eval("b'"+st+"'").decode('utf-8'))

Try it online!

There are ast.literal_eval, but it may not worth using here.

Wednesday, November 23, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :