Viewed   103 times

When I try to write UTF-8 Strings into an XML file using DomDocument it actually writes the hexadecimal notation of the string instead of the string itself.

for example:


instead of: ???????

any ideas how to resolve the issue?



Ok, here you go:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

will work fine, because in this case, the document you constructed will retain the encoding specified as the second argument:

<?xml version="1.0" encoding="utf-8"?>

However, once you load XML into a Document that does not specify an encoding, you will lose anything you declared in the constructor, which means:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

will not have an encoding of utf-8:

<?xml version="1.0"?>

So if you loadXML something, make sure it is

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

and it will work as expected.

As an alternative, you can also specify the encoding after loading the document.

Tuesday, August 2, 2022

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

  • Takes the charset specified in a meta http-equiv for "content-type".
  • Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>??? ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */

/* Build DOM */
$d = new domdocument;
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8


string(79) "??? ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"
Thursday, December 1, 2022

I think there are a couple of things going on here. For one, you need:

$dom->encoding = 'utf-8';

But also, I think we should try creating the DOMDocument manually specifying the proper encoding. So:


$XMLNS = "";
$rootNode = new SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
$rootNode->addAttribute('xmlns', $XMLNS);

$url = $rootNode->addChild('url');
$url->addChild('loc', "Somewhere over the rainbow");

// Turn it into an indented file needs a DOMDocument...
$domSxe = dom_import_simplexml($rootNode)->ownerDocument;

// Set DOM encoding to UTF-8.
$domSxe->encoding = 'UTF-8';

$dom = new DOMDocument('1.0', 'UTF-8');
$domSxe = $dom->importNode($domSxe, true);
$domSxe = $dom->appendChild($domSxe);

$path = "C:\temp";

$dom->formatOutput = true;

Also ensure that any elements or CData you're adding are actually UTF-8 (see utf8_encode()).

Using the example above, this works for me:

php > var_dump($utf8);
string(11) "????"

php > $XMLNS = "";
php > $rootNode = new SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
php > $rootNode->addAttribute('xmlns', $XMLNS);
php > $url = $rootNode->addChild('url');

php > $url->addChild('loc', "Somewhere over the rainbow $utf8");

php > $domSxe = dom_import_simplexml($rootNode);
php > $domSxe->encoding = 'UTF-8';
php > $dom = new DOMDocument('1.0', 'UTF-8');
php > $domSxe = $dom->importNode($domSxe, true);
php > $domSxe = $dom->appendChild($domSxe);
php > $dom->save('./sitemap.xml');

$ cat ./sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns=""><url><loc>Somewhere over the rainbow ????</loc></url></urlset>
Saturday, October 22, 2022

Your code works perfectly fine on my machine :

$im = imagecreatetruecolor(120, 20);
$text_color = imagecolorallocate($im, 233, 14, 91);
imagestring($im, 1, 5, 5, 'A Simple Text String', $text_color);
header('Content-type: image/jpeg');


Are you sure you are not outputing anything before or after that code ? Even any kind of whitespace would be a source of troubles.

Or maybe your script is doing something else somewhere ?

If it still doesn't work, maybe trying with imagettftext, to use a "better" / more complete font than the ones used by imagestring might help ?

Using something like this, for instance :

$font = '/usr/share/fonts/truetype/msttcorefonts/arial.ttf';
imagettftext($im, 20, 0, 10, 20, $text_color, $font, 'A Simple éléphant String');

BTW, did you try without those line :

header('Content-type: image/jpeg');

If there is an error/warning/notice, removing those lines might help you seeing those.

And, as a sidenote : using JPEG for images that contain some text generally doesn't give great results, as JPEG is a destructive compression mechanism. Using PNG, in that kind of situation, might get you better results ;-)

Monday, August 22, 2022

You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.

Sunday, September 4, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :