Viewed   96 times

I have this code to decode numeric html entities to the UTF8 equivalent character.

I'm trying to convert this character:

’

which should output:

?

However, it just disappears (no output). (i've checked the source code of the page, the page has the correct utf8 character set headers/meta tags).

Does anyone know what is wrong with the code?

function entity_decode($string, $quote_style = ENT_COMPAT, $charset = "UTF-8") {    
     $string = html_entity_decode($string, $quote_style, $charset);

     $string = preg_replace_callback('~&#x([0-9a-fA-F]+);~i', "chr_utf8_callback", $string);
     $string = preg_replace('~&#([0-9]+);~e', 'chr_utf8("\1")', $string);

    //this is another method, which also doesn't work.. 
     //$string = preg_replace_callback("/(&#[0-9]+;)/", "entity_decode_callback", $string);

     return $string; 
}




function chr_utf8_callback($matches) { 
     return chr_utf8(hexdec($matches[1])); 
}

function chr_utf8($num) {   
     if ($num < 128) return chr($num);
     if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
     if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
     return '';
}

function entity_decode_callback($m) { 
     return mb_convert_encoding($m[1], "UTF-8", "HTML-ENTITIES"); 
} 

 echo '=' . entity_decode('&#146;');

 Answers

2

html_entity_decode already does what you're looking for:

$string = '&#146;';

echo html_entity_decode($string, ENT_COMPAT, 'UTF-8');

It will return the character:

’   binary hex: c292

Which is PRIVATE USE TWO (U+0092). As it's private use, your PHP configuration/version/compile might not return it at all.

Also there are some more quirks:

But in HTML (other than XHTML, which uses XML rules), it's a long-standing browser quirk that character references in the range &#128; to &#159; are misinterpreted to mean the characters associated with bytes 128 to 159 in the Windows Western code page (cp1252) instead of the Unicode characters with those code points. The HTML5 standard finally documents this behaviour.

See: &#146; is getting converted as “u0092” by nokogiri in ruby on rails

Sunday, October 16, 2022
 
3

Disable default charset:

AddDefaultCharset Off
Monday, October 3, 2022
1

Add this to the header in your php file and let me know if it solves or not -

header('Content-Type: text/html; charset=UTF-8');
Saturday, October 22, 2022
4

Then maybe you will need the HttpUtility.HtmlDecode?. It should work, you just need to add a reference to System.Web. At least this was the way in .Net Framework < 4.

For example the following code:

MessageBox.Show(HttpUtility.HtmlDecode("&amp;&copy;"));

Worked and the output was as expected (ampersand and copyright symbol). Are you sure the problem is within HtmlDecode and not something else?

UPDATE: Another class capable of doing the job, WebUtility (again HtmlDecode method) came in the newer versions of .Net. However, there seem to be some problems with it. See the HttpUtility vs. WebUtility question.

Sunday, August 14, 2022
 
4

In PHP, this can be done using htmlentities(). Example below.

<?php
  $content = "This string contains the TM symbol: &trade;";
  print "<textarea>". htmlentities($content) ."</textarea>";
?>

Without htmlentities(), the textarea would interpret and display the TM symbol (™) instead of "&trade;".

http://php.net/manual/en/function.htmlentities.php

Tuesday, August 23, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :