Viewed   72 times

I have a feed taken from third-party sites, and sometimes I have to apply utf8_decode and other times utf8_encode to get the desired visible output.

If by mistake the same stuff is applied twice/or the wrong method is used I get something more ugly, this is what I want to change.

How can I detect when what have to apply on the string?

Actually the content returns UTF-8, but inside there are parts that are not.

 Answers

4

I can't say I can rely on mb_detect_encoding(). I had some freaky false positives a while back.

The most universal way I found to work well in every case was:

if (preg_match('!!u', $string))
{
   // This is UTF-8
}
else
{
   // Definitely not UTF-8
}
Sunday, October 2, 2022
3

Solution: after $description_dom = new DOMDocument(); , i placed this code.

$description_html = mb_convert_encoding($description_node, 'HTML-ENTITIES', "UTF-8");

Simply converts html entities to UTF-8. Instead of

$description_dom->loadHTML( (string)$description_node );

now i load the converted html

$description_dom->loadHTML( (string)$description_html );
Tuesday, September 27, 2022
 
1

try this for the software #2

iconv("UTF-8", "CP437", $this->_output);

Extended ASCII is not the same as plain ASCII. The first one maybe accepts ASCII, but the second software requires Extended ASCII - Codepage 437

see this link

Wednesday, December 7, 2022
 
shades
 
3

In Python 3, all strings are sequences of Unicode characters. There is a bytes type that holds raw bytes.

In Python 2, a string may be of type str or of type unicode. You can tell which using code something like this:

def whatisthis(s):
    if isinstance(s, str):
        print "ordinary string"
    elif isinstance(s, unicode):
        print "unicode string"
    else:
        print "not a string"

This does not distinguish "Unicode or ASCII"; it only distinguishes Python types. A Unicode string may consist of purely characters in the ASCII range, and a bytestring may contain ASCII, encoded Unicode, or even non-textual data.

Tuesday, August 30, 2022
5

UTF8 is the encoding supported by MongoDB out of the box.

Wednesday, November 30, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :