Viewed   121 times

In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?

I need that to generate test data.



This email (archived here) contains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.

In the archived link look for the 'utf8_whitespace_table' function.

static $whitespace = array(
    "SPACE" => "x20",
    "NO-BREAK SPACE" => "xc2xa0",
    "OGHAM SPACE MARK" => "xe1x9ax80",
    "EN QUAD" => "xe2x80x80",
    "EM QUAD" => "xe2x80x81",
    "EN SPACE" => "xe2x80x82",
    "EM SPACE" => "xe2x80x83",
    "THREE-PER-EM SPACE" => "xe2x80x84",
    "FOUR-PER-EM SPACE" => "xe2x80x85",
    "SIX-PER-EM SPACE" => "xe2x80x86",
    "FIGURE SPACE" => "xe2x80x87",
    "PUNCTUATION SPACE" => "xe2x80x88",
    "THIN SPACE" => "xe2x80x89",
    "HAIR SPACE" => "xe2x80x8a",
    "ZERO WIDTH SPACE" => "xe2x80x8b",
    "NARROW NO-BREAK SPACE" => "xe2x80xaf",
    "MEDIUM MATHEMATICAL SPACE" => "xe2x81x9f",
    "IDEOGRAPHIC SPACE" => "xe3x80x80",
Tuesday, October 18, 2022

Be careful! If you actually have utf stored as another encoding, you could have a real mess on your hands. Back up first. Then try some of the standard methods:

for instance

I've had to resort to converting all text fields to binary, then back to varchar/text. This has saved my ass.

I had data is UTF8, stored as latin1. What I did:

Drop indexes. Convert fields to binary. Convert to utf8-general ci

If your on LAMP, don’t forget to add set NAMES command before interacting with the db, and make sure you set character encoding headers.

Sunday, September 4, 2022

I resolved this issue and it ended up having nothing to do with iconv like I initially thought. The change that was required was such a small one, only one character, but it took me ages to hunt this down. It turns out that the offending statement was actually the following:

preg_replace('/s+/', ' ',$columnvalue))

The purpose of this regular expression is to remove white space from the value, but because the encoding was UTF-8 this regular expression had a residual effect of corrupting the à character. I resolved this but adding u (unicode modifier) to the end of the regular expression definition. So the expression became:

preg_replace('/s+/u', ' ',$columnvalue))

And then the encoding of the page was correct.

Tuesday, August 9, 2022

As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:

MailItem.Body.Replace("u200B", "");
Tuesday, November 15, 2022

First question: it depends on what exactly goes in the string.

In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of. So, if you only put valid UTF-8 bytes between the quotes (fairly easy if the file itself is encoded as UTF-8), then the string will be UTF-8, and you can safely use mb_strlen() on it.

Also, if you're using mbstring functions, you need to explicitly tell it what character set your string is, either with mbstring.internal_encoding or as the last argument to any mbstring function.

Second question: yes, with caveats.

Two strings that are both independently valid UTF-8 can be safely byte-wise concatenated (like with PHP's . operator) and still be valid UTF-8. However, you can never be sure, without doing some work yourself, that a POSTed string is valid UTF-8. Database strings are a little easier, if you carefully set the connection character set, because most DBMSs will do any conversion for you.

Monday, November 7, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :