Viewed   105 times

I am loading a HTML from an external server. The HTML markup has UTF-8 encoding and contains characters such as ?,š,?,?,ž etc. When I load the HTML with file_get_contents() like this:

$html = file_get_contents('');

It messes up the UTF-8 characters and loads Å, ¾, ¤ and similar nonsense instead of proper UTF-8 characters.

How can I solve this?


I tried both saving the HTML to a file and outputting it with UTF-8 encoding. Both doesn't work so it means file_get_contents() is already returning broken HTML.


<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">
<html xmlns="" xml:lang="sk" lang="sk">

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta http-equiv="Content-Language" content="sk" />



$html = file_get_contents('');
echo htmlentities($html);





Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Wednesday, November 23, 2022

As I continued my research I came up with an answer for my problem, this piece of code did it!

private function properText($text){
    $text = mb_convert_encoding($text, "HTML-ENTITIES", "UTF-8");
    $text = preg_replace('~^(&([a-zA-Z0-9]);)~',htmlentities('${1}'),$text);

Now all the characters (and all the new ones I've seen) that troubled me are displayed correctly!

Monday, November 21, 2022

How about this one????

For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

This code works for me

header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('');

The problem is not with file_get_contents()

I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

$data = file_get_contents('');


Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾

Its Hex value is xBE (190 decimal)

I tried these two character sets. Neither worked.

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');


It works by adding a header WITHOUT charset=utf-8.

These two headers work

header('Content-Type: text/plain');
header('Content-Type: text/html');

These two headers do NOT work

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

This code is tested and displayed all characters.

header('Content-Type: text/plain');
echo file_get_contents('');

header('Content-Type: text/html');
echo file_get_contents('');

These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.

Check the Hex values against these character sets.

From the above table I saw the character set was Latin2.

I went to Wikipedia Windows code page and found that Latin2 is Windows-1250

bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

Tuesday, December 13, 2022

The xml string must not (!) contain the BOM, the BOM is only allowed in byte data (e.g. streams) which is encoded with UTF-8. This is because the string representation is not encoded, but already a sequence of unicode characters.

It therefore seems that you load the string wrong, which is in code you unfortunatley didn't provide.


Thanks for posting the serialization code.

You should not write the data to a MemoryStream, but rather to a StringWriter which you can then convert to a string with ToString. Since this avoids passing through a byte representation it is not only faster but also avoids such problems.

Something like this:

private static string SerializeResponse(Response response)
    var output = new StringWriter();
    var writer = XmlWriter.Create(output);
    new XmlSerializer(typeof(Response)).Serialize(writer, response);
    return output.ToString();
Monday, October 10, 2022

This seems to be a content negotiation problem as file_get_contents probably sends a request that only accepts ISO 8859-1 as character encoding.

You can create a custom stream context for file_get_contents using stream_context_create that explicitly states that you accept UTF-8:

$opts = array('http' => array('header' => 'Accept-Charset: UTF-8, *;q=0'));
$context = stream_context_create($opts);

$filename = ";_ylt=A0oG7lpgGp9NTSYAiQBXNyoA?p=naj%C5%A1%C5%A5astnej%C5%A1%C3%AD&fr2=sb-top&fr=yfp-t-701&type_param=&rd=pref";
echo file_get_contents($filename, false, $context);
Saturday, September 3, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :