Viewed   103 times

When I try to write UTF-8 Strings into an XML file using DomDocument it actually writes the hexadecimal notation of the string instead of the string itself.

for example:

ירושלים

instead of: ???????

any ideas how to resolve the issue?

 Answers

1

Ok, here you go:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->appendChild($dom->createElement('root'));
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

will work fine, because in this case, the document you constructed will retain the encoding specified as the second argument:

<?xml version="1.0" encoding="utf-8"?>
<root>???????</root>

However, once you load XML into a Document that does not specify an encoding, you will lose anything you declared in the constructor, which means:

$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadXml('<root/>'); // missing prolog
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

will not have an encoding of utf-8:

<?xml version="1.0"?>
<root>&#x5D9;&#x5E8;&#x5D5;&#x5E9;&#x5DC;&#x5D9;&#x5DD;</root>

So if you loadXML something, make sure it is

$dom = new DOMDocument();
$dom->loadXml('<?xml version="1.0" encoding="utf-8"?><root/>');
$dom->documentElement->appendChild(new DOMText('???????'));
echo $dom->saveXml();

and it will work as expected.

As an alternative, you can also specify the encoding after loading the document.

Tuesday, August 2, 2022
5

Your "hack" doesn't make sense.

You are converting a Windows-1250 HTML file into UTF-8 and then prepending <?xml encoding="UTF-8">. This won't work. The DOM extension, for HTML files:

  • Takes the charset specified in a meta http-equiv for "content-type".
  • Otherwise assumes ISO-8859-1

I suggest you instead convert from Windows-1250 into ISO-8859-1 and prepend nothing.

EDIT The suggestion is not very good because Windows-1250 has characters that are not in ISO-8859-1. Since you're dealing with fragments without meta elements for content-type, you can add your own to force interpretation as UTF-8:

<?php
//script and output are in UTF-8

/* Simulate HTML fragment in Windows-1250 */
$html = <<<XML
<p>??? ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)</p>
XML;
$htmlInterm = iconv("UTF-8", "Windows-1250", $html); //convert

/* Append meta header to force UTF-8 interpretation and convert into UTF-8 */
$htmlInterm =
    "<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />" .
    iconv("Windows-1250", "UTF-8", $htmlInterm);

/* Omit libxml warnings */
libxml_use_internal_errors(true);

/* Build DOM */
$d = new domdocument;
$d->loadHTML($htmlInterm);
var_dump($d->getElementsByTagName("body")->item(0)->textContent); //correct UTF-8

gives:

string(79) "??? ‰ ‡ … á (some exist on win-1250, but not LATIN1 or even win-1252)"
Thursday, December 1, 2022
 
4

I think there are a couple of things going on here. For one, you need:

$dom->encoding = 'utf-8';

But also, I think we should try creating the DOMDocument manually specifying the proper encoding. So:

<?php

$XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
$rootNode = new SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
$rootNode->addAttribute('xmlns', $XMLNS);

$url = $rootNode->addChild('url');
$url->addChild('loc', "Somewhere over the rainbow");

// Turn it into an indented file needs a DOMDocument...
$domSxe = dom_import_simplexml($rootNode)->ownerDocument;

// Set DOM encoding to UTF-8.
$domSxe->encoding = 'UTF-8';

$dom = new DOMDocument('1.0', 'UTF-8');
$domSxe = $dom->importNode($domSxe, true);
$domSxe = $dom->appendChild($domSxe);

$path = "C:\temp";

$dom->formatOutput = true;
$dom->save($path.'/sitemap.xml');

Also ensure that any elements or CData you're adding are actually UTF-8 (see utf8_encode()).

Using the example above, this works for me:

php > var_dump($utf8);
string(11) "????"

php > $XMLNS = "http://www.sitemaps.org/schemas/sitemap/0.9";
php > $rootNode = new SimpleXMLElement("<?xml version='1.0' encoding='UTF-8'?><urlset></urlset>");
php > $rootNode->addAttribute('xmlns', $XMLNS);
php > $url = $rootNode->addChild('url');

php > $url->addChild('loc', "Somewhere over the rainbow $utf8");

php > $domSxe = dom_import_simplexml($rootNode);
php > $domSxe->encoding = 'UTF-8';
php > $dom = new DOMDocument('1.0', 'UTF-8');
php > $domSxe = $dom->importNode($domSxe, true);
php > $domSxe = $dom->appendChild($domSxe);
php > $dom->save('./sitemap.xml');


$ cat ./sitemap.xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"><url><loc>Somewhere over the rainbow ????</loc></url></urlset>
Saturday, October 22, 2022
3

Your code works perfectly fine on my machine :

<?php
$im = imagecreatetruecolor(120, 20);
$text_color = imagecolorallocate($im, 233, 14, 91);
imagestring($im, 1, 5, 5, 'A Simple Text String', $text_color);
header('Content-type: image/jpeg');
imagejpeg($im);
imagedestroy($im);

die;
?>

Are you sure you are not outputing anything before or after that code ? Even any kind of whitespace would be a source of troubles.

Or maybe your script is doing something else somewhere ?


If it still doesn't work, maybe trying with imagettftext, to use a "better" / more complete font than the ones used by imagestring might help ?

Using something like this, for instance :

$font = '/usr/share/fonts/truetype/msttcorefonts/arial.ttf';
imagettftext($im, 20, 0, 10, 20, $text_color, $font, 'A Simple éléphant String');


BTW, did you try without those line :

header('Content-type: image/jpeg');
imagejpeg($im);
imagedestroy($im);

If there is an error/warning/notice, removing those lines might help you seeing those.


And, as a sidenote : using JPEG for images that contain some text generally doesn't give great results, as JPEG is a destructive compression mechanism. Using PNG, in that kind of situation, might get you better results ;-)

Monday, August 22, 2022
 
vyrx
 
4

You can't. You should first use a N type data field, convert your file to UTF-16 and then import it. The database does not support UTF-8.

Sunday, September 4, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :