Viewed   105 times

I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ?,š,?,?,ž etc. When I load the HTML with file_get_contents() like this:

$html = file_get_contents('http://example.com/foreign.html');

It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.

How can I solve this?

UPDATE:

I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.

UPDATE2:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="sk" lang="sk">
<head>

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Language" content="sk" />
<title>Test</title>

</head>
<body>


<?php

$html = file_get_contents('http://example.com');
echo htmlentities($html);

?>

</body>
</html>

 Answers

4

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Wednesday, November 23, 2022
5

As I continued my research I came up with an answer for my problem, this piece of code did it!

private function properText($text){
    $text = mb_convert_encoding($text, "HTML-ENTITIES", "UTF-8");
    $text = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$text);
    return($text); 
}

Now all the characters (and all the new ones I've seen) that troubled me are displayed correctly!

Monday, November 21, 2022
 
4

How about this one????

For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn



This code works for me

<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>


The problem is not with file_get_contents()

I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);

UPDATE

Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾

Its Hex value is xBE (190 decimal)

I tried these two character sets. Neither worked.

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');




END OF UPDATE


It works by adding a header WITHOUT charset=utf-8.

These two headers work

header('Content-Type: text/plain');
header('Content-Type: text/html');

These two headers do NOT work

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

This code is tested and displayed all characters.

<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>



These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.

Check the Hex values against these character sets.

From the above table I saw the character set was Latin2.

I went to Wikipedia Windows code page and found that Latin2 is Windows-1250


bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

Tuesday, December 13, 2022
 
4

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.

Edit:

Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
{
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
}
Monday, October 10, 2022
 
1

This seems to be a content negotiation problem as file_get_contents probably sends a request that only accepts ISO 8859-1 as character encoding.

You can create a custom stream context for file_get_contents using stream_context_create that explicitly states that you accept UTF-8:

$opts = array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context = stream_context_create($opts);

$filename = "http://search.yahoo.com/search;_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
echo file_get_contents($filename, false, $context);
Saturday, September 3, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :