Viewed   124 times

I have the standard XAMPP installation on win7 (x64). Having had my share of encoding troubles in a past project where mysql encoding did not match with the php enconding which in turn sometimes output html in other encodings, I decided to consistently encode everything using utf-8.

I'm just getting started with the html markup and am allready experiencing troubles.

  • My page is saved using utf-8 (no BOM, I think)
    //update: It turns out this was NOT the case. The file was actually saved with ISO_8859-1. I later found this out thanks to Sherm Pendleys answer. I had to go back and change my project settings (which were set to "ISO-8859-1") to the desired "UTF-8".
  • php is set per .htaccess to serve .php-pages in utf-8 with: AddCharset UTF-8 .php
  • html has a meta tag specifying: <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
  • To test I set used php header('Content-Type:text/html; charset=UTF-8');

The page is evidently served in utf-8 (firefox and chrome recognize it as such) but any special characters such as é, á or ¡ will just show as ?. Also when viewing the source code.

When dropping the encoding settings mentioned above all characters are rendered correctly but the encoding that is detected shows either windows-1252 or ISO-8859-1 depending on the browser.

How come? I'm very puzzled. I would have expected the exact opposite behavior.
Any advice is welcome, thanks!

edit: Hopefully this helps a bit more. This is the response header (as per firebug)

HTTP/1.1 200 OK
Date: Sat, 26 Mar 2011 20:49:44 GMT
Server: Apache/2.2.14 (Win32) DAV/2 mod_ssl/2.2.14 OpenSSL/0.9.8l mod_autoindex_color PHP/5.3.1 mod_apreq2-20090110/2.7.1 mod_perl/2.0.4 Perl/v5.10.1
X-Powered-By: PHP/5.3.1
Content-Length: 91
Keep-Alive: timeout=5, max=99
Connection: Keep-Alive
Content-Type: text/html; charset=utf-8

 Answers

2

When [dropping] the encoding settings mentioned above all characters [are rendered] correctly but the encoding that is detected shows either windows-1252 or ISO-8859-1 depending on the browser.

Then that's what you're really sending. None of the encoding settings in your bullet list will actually modify your output in any way; all they do is tell the browser what encoding to assume when interpreting what you send. That's why you're getting those ?s - you're telling the browser that what you're sending is UTF-8, but it's really ISO-8859-1.

Tuesday, December 20, 2022
2

As justhalf points out above, my question here is essentially a duplicate of this question.

The HTML content reported itself as UTF-8 encoded and, for the most part it was, except for one or two rogue invalid UTF-8 characters.

This apparently confuses BeautifulSoup about which encoding is in use, and when trying to first decode as UTF-8 when passing the content to BeautifulSoup like this:

soup = BeautifulSoup(response.read().decode('utf-8'))

I would get the error:

UnicodeDecodeError: 'utf8' codec can't decode bytes in position 186812-186813: 
                    invalid continuation byte

Looking more closely at the output, there was an instance of the character Ü which was wrongly encoded as the invalid byte sequence 0xe3 0x9c, rather than the correct 0xc3 0x9c.

As the currently highest-rated answer on that question suggests, the invalid UTF-8 characters can be removed while parsing, so that only valid data is passed to BeautifulSoup:

soup = BeautifulSoup(response.read().decode('utf-8', 'ignore'))
Friday, October 28, 2022
 
5

What's the server OS?

If it's Windows, you'll not be able to access files under a UTF-8-encoded filename, because the Windows implementation of the C IO libraries used by PHP will only talk in the system default code page. For Western European installs, that's code page 1252. You can convert a UTF-8 string to cp1252 using iconv:

$winfilename= iconv('utf-8', 'cp1252', $utffilename);

(utf8_decode could also be used, but it would give the wrong results for Windows's extension characters that map to the range 0x80-0x9F in cp1252.)

Files whose names include characters outside the repertoire of the system codepage (eg. Greek on a Western box) cannot be accessed at all by PHP and other programs using the stdio. There are scripting languages that can use native-Unicode filenames through Win32 APIs, but PHP5 isn't one of them.

And of course the step above shouldn't be used when deployed on a different OS where the filesystem is UTF-8-encoded. (ie. modern Linux.)

If you need to seamlessly cross-server-compatible with PHP, you'll have to refrain from using non-ASCII characters in filenames. Sorry.

Thursday, September 8, 2022
 
sana
 
1

Everything looks fine. Only the System.out.println() console also needs to be configured to interpret the byte stream as UTF-8.

If you're sitting in an IDE like Eclipse, then you can do that by setting Window > Preferences > General > Workspace > Text File Encoding to UTF-8. For other environments, you should be more specific about that so that we can tell how to configure it.

Tuesday, December 27, 2022
5

If you truly want to see everything, use for example this hex dump function. It's good for finding weird UTF-8 (UTF-8 space is not same as ASCII space character and so on) or BOM stuff etc.

It outputs like this

0000  00 01 02 03 04 05 06 07  08 09 0a 0b 0c 0d 0e 0f   ........ ........
0010  10 11 12 13 14 15 16 17  18 19 1a 1b 1c 1d 1e 1f   ........ ........
0020  20 21 22 23 24 25 26 27  28 29 2a 2b 2c 2d 2e 2f    !"#$%&' ()*+,-./
0030  30 31 32 33 34 35 36 37  38 39 3a 3b 3c 3d 3e 3f   01234567 89:;<=>?
0040  40 41 42 43 44 45 46 47  48 49 4a 4b 4c 4d 4e 4f   @ABCDEFG HIJKLMNO
0050  50 51 52 53 54 55 56 57  58 59 5a 5b 5c 5d 5e 5f   PQRSTUVW XYZ[]^_
0060  60 61 62 63 64 65 66 67  68 69 6a 6b 6c 6d 6e 6f   `abcdefg hijklmno
0070  70 71 72 73 74 75 76 77  78 79 7a 7b 7c 7d 7e 7f   pqrstuvw xyz{|}~
0080  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f   €‚ƒ„…†‡ ˆ‰Š‹ŒŽ
0090  90 91 92 93 94 95 96 97  98 99 9a 9b 9c 9d 9e 9f   ‘’“”•–— ˜™š›œžŸ
00a0  a0 a1 a2 a3 a4 a5 a6 a7  a8 a9 aa ab ac ad ae af    ¡¢£¤¥¦§ ¨©ª«¬­®¯
00b0  b0 b1 b2 b3 b4 b5 b6 b7  b8 b9 ba bb bc bd be bf   °±²³´µ¶· ¸¹º»¼½¾¿
00c0  c0 c1 c2 c3 c4 c5 c6 c7  c8 c9 ca cb cc cd ce cf   ÀÁÂÃÄÅÆÇ ÈÉÊËÌÍÎÏ
00d0  d0 d1 d2 d3 d4 d5 d6 d7  d8 d9 da db dc dd de df   ÐÑÒÓÔÕÖ× ØÙÚÛÜÝÞß
00e0  e0 e1 e2 e3 e4 e5 e6 e7  e8 e9 ea eb ec ed ee ef   àáâãäåæç èéêëìíîï
00f0  f0 f1 f2 f3 f4 f5 f6 f7  f8 f9 fa fb fc fd fe      ðñòóôõö÷ øùúûüýþ
Monday, October 24, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :