Viewed   220 times

I am using jqueryFileTree to show a directory listing on the server with download links to the files in the directory. Recently I've run into an issue with files which contain special characters:

  • test.pdf : works fine
  • tést.pdf : does not work (notice the é - acute accent - in the filename)

When debugging the php connector of jqueryFileTree, I see it's doing a scandir() of the directory passed via $_GET, and then looping over each file/dir of the directory. Before parsing the filename into the url, the script seems to correctly perform a htmlentities() over the file name. The problem seems to be that this htmlentities($file) call just returns an empty string, which according to the php docs this can be the case when the input string contains an invalid code unit within the given encoding. However i tried passing the charset implicitly by calling:

$file = htmlentities($file,ENT_QUOTES,'UTF-8');

But this also returns an empty string.

If I call: $file = htmlentities($file,ENT_IGNORE,'UTF-8'); The e acute character is just dropped (so tést.pdf becomes tst.pdf)

When debugging my php script with xdebug I can see the source string contains an unknown character (looks like this).

So I'm quite at my wits end here to find the solution for this. Any help would be welcome.

FYI:

  • The charset of my page is UTF-8 (specified in metadata)
  • The file is stored on a windows 2003 fileserver and scandir() is executed with the UNC path (e.g. //fileserver/sharename/sourcedir)
  • The default encoding in my php.ini is set to UTF-8
  • The webserver & PHP 5.4.26 are running on a windows 2008 R2 server

 Answers

3

My best guess is that the filename itself isn't using UTF-8. Or at least scandir() isn't picking it up like that.

Maybe mb_detect_encoding() can shed some light?

var_dump(mb_detect_encoding($filename));

If not, try to guess the encoding (CP1252 or ISO-8859-1 would be my first guess) and convert it to UTF-8, see if the output is valid:

var_dump(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-1'));
var_dump(mb_convert_encoding($filename, 'UTF-8', 'ISO-8859-15'));

Or using iconv():

var_dump(iconv('WINDOWS-1252', 'UTF-8', $filename));
var_dump(iconv('ISO-8859-1',   'UTF-8', $filename));
var_dump(iconv('ISO-8859-15',  'UTF-8', $filename));

Then when you've figured out which encoding is actually used, your code should look somewhat like this (assuming CP1252):

$filename = htmlentities(mb_convert_encoding($filename, 'UTF-8', 'Windows-1252'), ENT_QUOTES, 'UTF-8');
Sunday, August 21, 2022
 
appx
 
1

Thanks to @XzKto and this comment on PHP.net I changed my slug function to the following:

static function slug($input){

    $string = html_entity_decode($input,ENT_COMPAT,"UTF-8");

    $oldLocale = setlocale(LC_CTYPE, '0');  

    setlocale(LC_CTYPE, 'en_US.UTF-8');
    $string = iconv("UTF-8","ASCII//TRANSLIT",$string);

    setlocale(LC_CTYPE, $oldLocale);

    return strtolower(preg_replace('/[^a-zA-Z0-9]+/','-',$string));

}

I feel like the setlocale part is a bit dirty but this works perfectly for translating special characters to their 'normal' equivalents.

Input a áñö ïß éèé returns a-ano-iss-eee

Saturday, September 17, 2022
 
3

I think the encoding you are looking for is Windows code page 1252 (Western European). It is not the same as ISO-8859-1 (or 8859-15 for that matter); the characters in the range 0xA0-0xFF match 8859-1, but cp1252 adds an assortment of extra characters in the range 0x80-0x9F where ISO-8859-1 assigns little-used control codes.

The confusion comes about because when you serve a page as text/html;charset=iso-8859-1, for historical reasons, browsers actually use cp1252 (and will hence submit forms in cp1252 too).

iconv('cp1252', 'utf-8', "x80 and x95")
-> "xe2x82xac and xe2x80xa2"
Saturday, October 15, 2022
5

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If your string doesn't contain an XML encoding declaration, you can prepend one to cause the string to be treated as UTF-8:

$profile = '<p>???????????????????????9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the string will contain such a declaration already, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>???????????????????????9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

Thursday, October 20, 2022
 
rrs
 
rrs
5

Apparently, I forgot to set the StringEntity's charset to UTF-8. These lines did the trick:

    httpPut.setEntity(new StringEntity(body, HTTP.UTF_8));
    httpPost.setEntity(new StringEntity(body, HTTP.UTF_8));

So, there are at least two levels to set the charset in the Android client when sending an http post with non-ascii characters.

  1. The rest client itself itself
  2. The StringEntity

UPDATE: As Samuel pointed out in the comments, the modern way to do it is to use a ContentType, like so:

    final StringEntity se = new StringEntity(body, ContentType.APPLICATION_JSON);
    httpPut.setEntity(se);
Sunday, November 6, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :