Viewed   67 times
<tag>
????? ?
</tag>

When I try to get the content of the following code using DOMDocument functions, it returns something like:

ÐÐ»ÐµÐºÑ Ðœ

I've tried setting DOMDocument encoding to different values (UTF-8, ISO-8859-1), using mb_convert_encoding, iconv and utf8_encode but without success.

How can I get "????? ?" instead of "ÐÐ»ÐµÐºÑ Ðœ" ?

EDIT: The input is coming from a page loaded with curl. When I output the page content to my browser, the characters are displayed correctly (so I doubt the input is the problem).

 Answers

4

Try:

$string = file_get_contents('your-xml-file.xml');
$string = mb_convert_encoding($string, 'utf-8', mb_detect_encoding($string));
// if you have not escaped entities use
$string = mb_convert_encoding($string, 'html-entities', 'utf-8'); 
$doc = new DOMDocument();
$doc->loadXML($string);
Sunday, September 25, 2022
2

If you want to get :

  • The text
  • that's inside a <div> tag with class="text"
  • that's, itself, inside a <div> with class="main"

I would say the easiest way is not to use DOMDocument::getElementsByTagName -- which will return all tags that have a specific name (while you only want some of them).

Instead, I would use an XPath query on your document, using the DOMXpath class.


For example, something like this should do, to load the HTML string into a DOM object, and instance the DOMXpath class :

$html = <<<HTML
<div class="main">
    <div class="text">
    Capture this text 1
    </div>
</div>

<div class="main">
    <div class="text">
    Capture this text 2
    </div>
</div>
HTML;

$dom = new DOMDocument();
$dom->loadHTML($html);

$xpath = new DOMXPath($dom);


And, then, you can use XPath queries, with the DOMXPath::query method, that returns the list of elements you were searching for :

$tags = $xpath->query('//div[@class="main"]/div[@class="text"]');
foreach ($tags as $tag) {
    var_dump(trim($tag->nodeValue));
}


And executing this gives me the following output :

string 'Capture this text 1' (length=19)
string 'Capture this text 2' (length=19)
Monday, December 5, 2022
5

You have probably come to mix encoding types. For example. A page that is sent as iso-8859-1, but get UTF-8 text encoding from MySQL or XML would typically fail.

To solve this problem you must keep control on input ecodings type in relation to the type of encoding you have chosen to use internal.

If you send it as an iso-8859-1, your input from the user is also iso-8859-1.

header("Content-type:text/html; charset: iso-8859-1");

And if mysql sends latin1 you do not have to do anything.

But if your input is not iso-8859-1 you must converted it, before it's sending to the user or to adapt it to Mysql before it's store.

mb_convert_encoding($text, mb_internal_encoding(), 'UTF-8'); // If it's UTF-8 to internal encoding

Short it means that you must always have input converted to fit internal encoding and convereter output to match the external encoding.


This is the internal encoding I have chosen to use.

mb_internal_encoding('iso-8859-1'); // Internal encoding

This is a code i use.

mb_language('uni'); // Mail encoding
mb_internal_encoding('iso-8859-1'); // Internal encoding
mb_http_output('pass'); // Skip

function convert_encoding($text, $from_code='', $to_code='')
{
    if (empty($from_code))
    {
        $from_code = mb_detect_encoding($text, 'auto');
        if ($from_code == 'ASCII')
        {
            $from_code = 'iso-8859-1';
        }
    }

    if (empty($to_code))
    {
        return mb_convert_encoding($text, mb_internal_encoding(), $from_code);
    }
    return mb_convert_encoding($text, $to_code, $from_code);
}

function encoding_html($text, $code='')
{
    if (empty($code))
    {
        return htmlentities($text, ENT_NOQUOTES, mb_internal_encoding());
    }

    return mb_convert_encoding(htmlentities($text, ENT_NOQUOTES, $code), mb_internal_encoding(), $code);
}
function decoding_html($text, $code='')
{
    if (empty($code))
    {
        return html_entity_decode($text, ENT_NOQUOTES, mb_internal_encoding());
    }

    return mb_convert_encoding(html_entity_decode($text, ENT_NOQUOTES, $code), mb_internal_encoding(), $code);
}
Tuesday, August 2, 2022
2

I use this:

function replaceSpecial($str){
$chunked = str_split($str,1);
$str = ""; 
foreach($chunked as $chunk){
    $num = ord($chunk);
    // Remove non-ascii & non html characters
    if ($num >= 32 && $num <= 123){
            $str.=$chunk;
    }
}   
return $str;
} 
Friday, October 14, 2022
 
5

Try using iconv to convert the charset of the scraped text to the charset you use on your page.

Signature:

string iconv ( string $in_charset , string $out_charset , string $str )

Example:

echo iconv("ISO-8859-1", "UTF-8", $text);
Saturday, October 29, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :