Viewed   437 times

When I am using file_get_contents($url), and when I echo this, it returns an exotic character.

But it can be seen only in some websites and works correctly in other websites:

My code is:

<?php
header ( "Content-Type: text/html;charset=utf-8" );
$url ="http://www.varzesh3.com/news/1307290";
echo $go_to = file_get_contents($url);
?>

 Answers

4

According to PHP manual:

you can use this code, if you have problem with file_get_contents!

<?php
function file_get_contents_utf8($fn) {
     $content = file_get_contents($fn);
      return mb_convert_encoding($content, 'UTF-8',
          mb_detect_encoding($content, 'UTF-8, ISO-8859-1', true));
}
?>
Wednesday, December 7, 2022
4

Alright. I have found out the file_get_contents() is not causing this problem. There's a different reason which I talk about in another question. Silly me.

See this question: Why Does DOM Change Encoding?

Wednesday, November 23, 2022
4

How about this one????

For this one I used header('Content-Type: text/plain;; charset=Windows-1250');

bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn



This code works for me

<?php
header('Content-Type: text/plain;charset=Windows-1250');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>


The problem is not with file_get_contents()

I save the $data to a file and the characters were correct but still not encoded correctly by my text editor. See image below.

$data = file_get_contents('http://www.parfumeriafox.sk/source_file.html');
file_put_contents('doc.txt',$data);

UPDATE

Seems to be one problematic character as shown here. It also is seen on the HTML image below. Renders as ¾

Its Hex value is xBE (190 decimal)

I tried these two character sets. Neither worked.

header('Content-Type: text/plain; charset=ISO 8859-1');
header('Content-Type: text/plain; charset=ISO 8859-2');




END OF UPDATE


It works by adding a header WITHOUT charset=utf-8.

These two headers work

header('Content-Type: text/plain');
header('Content-Type: text/html');

These two headers do NOT work

header('Content-Type: text/plain; charset=utf-8');
header('Content-Type: text/html; charset=utf-8');

This code is tested and displayed all characters.

<?php
header('Content-Type: text/plain');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>

<?php
header('Content-Type: text/html');
echo file_get_contents('http://www.parfumeriafox.sk/source_file.html');
?>



These are some of the problematic characters with their Hex values.
This is the saved file viewed in Notepad++ with UTF-8 Encoding.

Check the Hex values against these character sets.

From the above table I saw the character set was Latin2.

I went to Wikipedia Windows code page and found that Latin2 is Windows-1250


bergamot, citrón, tráva, rebarbora, bazalka;levandu?a, škorica, hruška;céderové drevo, vanilka, pižmo, amberlyn

Tuesday, December 13, 2022
 
5

NOTE: you should not just strip, but replace with replacement character U+FFFD to avoid unicode attacks, mostly XSS:

http://unicode.org/reports/tr36/#Deletion_of_Noncharacters

preg_replace('/[x{10000}-x{10FFFF}]/u', "xEFxBFxBD", $value);
Sunday, October 9, 2022
 
1

it seems that I just need to add the u flag to the regex thus it becomes:

$s = preg_replace('/[x00-x1Fx7F-x9F]/u', '', $s);
Saturday, August 6, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :