In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?
I need that to generate test data.
In PHP, what's the most elegant way to get the complete list (array of strings) of all the Unicode whitespace characters, encoded in utf8?
I need that to generate test data.
Be careful! If you actually have utf stored as another encoding, you could have a real mess on your hands. Back up first. Then try some of the standard methods:
for instance http://www.cesspit.net/drupal/node/898 http://www.hackszine.com/blog/archive/2007/05/mysql_database_migration_latin.html
I've had to resort to converting all text fields to binary, then back to varchar/text. This has saved my ass.
I had data is UTF8, stored as latin1. What I did:
Drop indexes. Convert fields to binary. Convert to utf8-general ci
If your on LAMP, don’t forget to add set NAMES command before interacting with the db, and make sure you set character encoding headers.
I resolved this issue and it ended up having nothing to do with iconv
like I initially thought. The change that was required was such a small one, only one character, but it took me ages to hunt this down. It turns out that the offending statement was actually the following:
preg_replace('/s+/', ' ',$columnvalue))
The purpose of this regular expression is to remove white space from the value, but because the encoding was UTF-8
this regular expression
had a residual effect of corrupting the à
character. I resolved this but adding u
(unicode modifier
) to the end of the regular expression definition. So the expression became:
preg_replace('/s+/u', ' ',$columnvalue))
And then the encoding of the page was correct.
As strings in C# are stored in Unicode (not UTF-8) the following might do the trick:
MailItem.Body.Replace("u200B", "");
First question: it depends on what exactly goes in the string.
In PHP (up to PHP5, anyway), strings are just sequences of bytes. There is no implied or explicit character set associated with them; that's something the programmer must keep track of. So, if you only put valid UTF-8 bytes between the quotes (fairly easy if the file itself is encoded as UTF-8), then the string will be UTF-8, and you can safely use mb_strlen() on it.
Also, if you're using mbstring functions, you need to explicitly tell it what character set your string is, either with mbstring.internal_encoding or as the last argument to any mbstring function.
Second question: yes, with caveats.
Two strings that are both independently valid UTF-8 can be safely byte-wise concatenated (like with PHP's .
operator) and still be valid UTF-8. However, you can never be sure, without doing some work yourself, that a POSTed string is valid UTF-8. Database strings are a little easier, if you carefully set the connection character set, because most DBMSs will do any conversion for you.
This email (archived here) contains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.
In the archived link look for the 'utf8_whitespace_table' function.