Viewed   84 times

I'm using PHP to handle text from a variety of sources. I don't anticipate it will be anything other than UTF-8, ISO 8859-1, or perhaps Windows-1252. If it's anything other than one of those, I just need to make sure the text gets turned into a valid UTF-8 string, even if characters are lost. Does the //TRANSLIT option of iconv solve this?

For example, would this code ensure that a string is safe to insert into a UTF-8 encoded document (or database)?

function make_safe_for_utf8_use($string) {

    $encoding = mb_detect_encoding($string, "UTF-8,ISO-8859-1,WINDOWS-1252");

    if ($encoding != 'UTF-8') {
        return iconv($encoding, 'UTF-8//TRANSLIT', $string);
    }
    else {
        return $string;
    }
}

 Answers

1

UTF-8 can store any Unicode character. If your encoding is anything else at all, including ISO-8859-1 or Windows-1252, UTF-8 can store every character in it. So you don't have to worry about losing any characters when you convert a string from any other encoding to UTF-8.

Further, both ISO-8859-1 and Windows-1252 are single-byte encodings where any byte is valid. It is not technically possible to distinguish between them. I would chose Windows-1252 as your default match for non-UTF-8 sequences, as the only bytes that decode differently are the range 0x80-0x9F. These decode to various characters like smart quotes and the Euro in Windows-1252, whereas in ISO-8859-1 they are invisible control characters which are almost never used. Web browsers may sometimes say they are using ISO-8859-1, but often they will really be using Windows-1252.

would this code ensure that a string is safe to insert into a UTF-8 encoded document

You would certainly want to set the optional ‘strict’ parameter to TRUE for this purpose. But I'm not sure this actually covers all invalid UTF-8 sequences. The function does not claim to check a byte sequence for UTF-8 validity explicitly. There have been known cases where mb_detect_encoding would guess UTF-8 incorrectly before, though I don't know if that can still happen in strict mode.

If you want to be sure, do it yourself using the W3-recommended regex:

if (preg_match('%^(?:
      [x09x0Ax0Dx20-x7E]            # ASCII
    | [xC2-xDF][x80-xBF]             # non-overlong 2-byte
    | xE0[xA0-xBF][x80-xBF]         # excluding overlongs
    | [xE1-xECxEExEF][x80-xBF]{2}  # straight 3-byte
    | xED[x80-x9F][x80-xBF]         # excluding surrogates
    | xF0[x90-xBF][x80-xBF]{2}      # planes 1-3
    | [xF1-xF3][x80-xBF]{3}          # planes 4-15
    | xF4[x80-x8F][x80-xBF]{2}      # plane 16
)*$%xs', $string))
    return $string;
else
    return iconv('CP1252', 'UTF-8', $string);
Tuesday, August 9, 2022
3

Solution: after $description_dom = new DOMDocument(); , i placed this code.

$description_html = mb_convert_encoding($description_node, 'HTML-ENTITIES', "UTF-8");

Simply converts html entities to UTF-8. Instead of

$description_dom->loadHTML( (string)$description_node );

now i load the converted html

$description_dom->loadHTML( (string)$description_html );
Tuesday, September 27, 2022
 
1

The browser helpfully converts the unpresentable-in-Windows-1251 characters to html entities

Well, nearly, except that it's not at all helpful. Now you can't tell the difference between a real “ƛ” that someone typed expecting it to come out as a string of text with a ‘&’ in it, and a ‘?’ character.

I actually do a htmlspecialchars () on the text before displaying it

Yes. You must do that, or else you've got a security problem.

Okay, I serve this form in Windows-1251, but will you please just send me the input in UTF-8 and let me deal with it myself

Yeah, supposedly you send “accept-charset="UTF-8"” in the form tag. But the reality is that doesn't work in IE. To get a form in UTF-8, you must send a form (page) in UTF-8.

I know that the good idea is to switch the whole software to UTF-8,

Yup. Well, at least the encoding of the page containing the form should be UTF-8.

Monday, September 5, 2022
 
1

try this for the software #2

iconv("UTF-8", "CP437", $this->_output);

Extended ASCII is not the same as plain ASCII. The first one maybe accepts ASCII, but the second software requires Extended ASCII - Codepage 437

see this link

Wednesday, December 7, 2022
 
shades
 
4

Found that the following function works:

function utf8_urldecode($str) {

  $str = str_replace("\00", "%u00", $str);

  $str = preg_replace("/%u([0-9a-f]{3,4})/i","&#x\1;",urldecode($str));

  return html_entity_decode($str,null,'UTF-8');

}

Some parts from http://us2.php.net/manual/en/function.urldecode.php

Monday, September 26, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :