"replacing invalid utf-8 characters by question marks, mbstring.substitute_character seems ignored" Code Answer
Answers related to “replacing invalid utf-8 characters by question marks, mbstring.substitute_character seems ignored”
- UTF-8 all the way through
- file_get_contents() Breaks Up UTF-8 Characters
- Unable to retrieve UTF-8 accented characters from Access via PDO_ODBC
- UTF-8 to Unicode Code Points
- How to sort an array of UTF-8 strings?
- php: using DomDocument whenever I try to write UTF-8 it writes the hexadecimal notation of it
- How to set UTF-8 encoding for a PHP file
- json_encode() non utf-8 strings?
- Am I correctly supporting UTF-8 in my PHP apps?
- How do I detect if have to apply UTF-8 decode or encode on a string?
- Ensuring valid UTF-8 in PHP
- Working with GD ( imagettftext() ) and UTF-8 characters
- A script to change all tables and fields to the utf-8-bin collation in MYSQL
- UTF-8 problems PHP/MySQL
- How to force UTF-8 encoding in browser?
- Encoding a string as UTF-8 with BOM in PHP
- Encoding SQL_Latin1_General_CP1_CI_AS into UTF-8
- php mysql insert into utf-8
- RegEx: \w - “_” + “-” in UTF-8
- How to correct double-encoded UTF-8 strings sitting in MySQL utf8_general_ci fields?
- mb_detect_encoding detects ASCII as UTF-8?
- How can I convert “Western (Mac OS Roman)” formatted text to UTF-8 with PHP?
- UTF-8 Database Problem
- Simplest way to get a complete list of all the UTF-8 whitespace characters in PHP
- How would you create a string of all UTF-8 characters?
- How to convert text with HTML entites and invalid characters to it's UTF-8 equivalent?
- UTF-8 compatible truncate function
- Latin-1 / UTF-8 encoding php
- Call a program via shell_exec with utf-8 text input
- Check if csv file is in UTF-8 with PHP
- Safe to use strpos with UTF-8 strings?
- difficulty passing Japanese characters(UTF-8) via json_encode
- UTF-8 Decode for php
- Convert ASCII and UTF-8 to non-special characters with one function
- json_encode with mysql content and umlauts in utf-8
- $_POST will convert from utf-8 to ä ö ü etc
- TinyMCE UTF-8 saving to MySQL Database
- PHP \uXXXX encoded string convert to utf-8
- German Umlauts in strftime date-formatting - correct utf-8 encoding?
- Is replacing a line break UTF-8 safe?
- UTF-8 PDF generated with TCPDF showing up fine in Adobe Acrobat but corrupted in Illustrator and Google preview
- UTF-8 with PHP DOMDocument loadHTML?
- UTF-8 Encoding with internet explorer %u20AC to €
- utf-8 problem in using jquery autocomplete tags
- Invalid UTF-8 character string on import of a CSV file into a MySQL database
- Convert CESU-8 to UTF-8 with high performance
- html php meta charset UTF-8 not working?
- \w in PHP preg_replace covers only second byte of UTF-8 chars
- Convert html entities to UTF-8, but keep existing UTF-8
- Corrupted UTF-8 encoding when reading Google feed / alerts
- UTF-8 works on hosting A but not on hosting B, what could be causing it?
- PHP: Problems finding the most frequent character in a UTF-8 string (eg ??????????)?
- Saving XML created with DOMDocument gives the error “DOMDocument::save() string is not in UTF-8”
- How fix UTF-8 Characters in PHP file_get_contents()
- PHP fread() Function Returning Extra Characters at the Front on UTF-8 Text Files
- why do i have to use mb_convert_encoding($name,'ISO-8859-15','utf-8') to get accented chars to display?
- MySQL, PHP, JavaScript UTF-8 Problem with swedish letters (Everything Tested - Nothing Works)
- Convert inline specified UTF-8 mail subject to UTF-8 text
- UTF-8 Character set CentOS PHP
- Convert parsed text, with php, to utf-8
- Replacing empty space with preg_replace causes invalid characters with UTF-8
- Strange character issue between utf-8 and ISO-8859-1 character sets. Need explanation
- Get non-UTF-8-form fields as UTF-8 in PHP?
- AJAX: post method with UTF-8
- Comparing UTF-8 String
- PHP: Problems converting “’” character from ISO-8859-1 to UTF-8
- PHP, convert UTF-8 to ASCII 8-bit
- PHP Convert Windows-1256 encoded text to UTF-8
- Sending MIME-encoded email attachments with utf-8 filenames
- Characters not displaying correctly on a UTF-8 website
- UTF-8, PHP, Win7 - Is there a solution now to save UTF-8-filenames on Win 7 using php?
- UTF-8 problems with characters from MySQL database (e.g. é as é)
- Problem writing UTF-8 encoded file in PHP
- Android with php: Saving utf-8 string to MySQL
- PHP array_key_exists and UTF 8
- PHP function iconv character encoding from iso-8859-1 to utf-8
- PHP & HTML5: UTF-8 document declaration with <meta> tag or through the header() function?
- Transform project from windows-1256 to utf-8 charset, what's the right steps?
- Problems in inserting utf-8 string into database and then outputting it to web page
- php mail with utf-8 encoding displays incorrectly in Microsoft mail clients
- preg_replace + UTF-8 doesn't work on one server but works on another
- UTF-8 to UTF-16LE Javascript
- PHP: strpos & substr with UTF-8
- How to use ctype_alpha with UTF-8
- Passing utf-8 strings between php and javascript
- UTF-8, PHP and XML Mysql
- How to Convert UTF-16 hexadecimal string to UTF-8 in PHP?
- ini_set('mbstring.internal_encoding','UTF-8')
- PHP,sql, UTF-8 encoding
- PHP DOM UTF-8 problem
- MySQL UTF8 for Chrome, UTF8 for IE, but HEADER UTF8 for Chrome and UTF-8 for IE?
- utf-8 to iso-8859-1 encoding problem
- utf-8 data retrieve from database
- PHP DOMDocument::save() saves as ASCII instead of UTF-8
- Ajax, Request Header UTF-8 to ISO Charset
- PHP 5: how to write utf-8 binary data - image - to output?
- Decoding UTF-8 Encoded Header
- UTF-8 Corrupted from MySQL to SQLite
- json_encode() throwing an error: “Invalid UTF-8 sequence in argument”
- PHP preg_replace with UTF-8 not working
- String Comparison using PHP mysql_* SET NAMES UTF 8 and Mysql Table With utf8_unicode_ci
- how to convert from Unicode to UTF-8 in PHP?
- utf-8 bom and headers in php
- UTF-8 Turkish Character on osCommerce
- Converting UTF-8 euro char to other euro
- Calling Php Script with UTF-8 POST variables
- User submitted CSV file upload UTF-8 concern
- PHP session with UTF-8-BOM encoding
- How to get file content with a proper utf-8 encoding using file_get_contents?
- What's the difference between UTF-8 and UTF-8 without BOM?
- How to get UTF-8 working in Java webapps?
- Is it possible to force Excel recognize UTF-8 CSV files automatically?
- Saving utf-8 texts with json.dumps as UTF8, not as \u escape sequence
- How to decode Unicode escape sequences like “\u00ed” to proper UTF-8 encoded characters?
- Changing PowerShell's default output encoding to UTF-8
- MySQL and PHP: UTF-8 with Cyrillic characters
- What is the Java's internal represention for String? Modified UTF-8? UTF-16?
- JSON character encoding - is UTF-8 well-supported by browsers or should I use numeric escape sequences?
- php substr() function with utf-8 leaves ? marks at the end
- How to display UTF-8 characters in phpMyAdmin?
- UTF-8, UTF-16, and UTF-32
- Convert UTF-8 with BOM to UTF-8 with no BOM in Python
- Manually converting unicode codepoints into UTF-8 and UTF-16
- How to write UTF-8 characters using bulk insert in SQL Server?
- How to identify/delete non-UTF-8 characters in R
- Difference between UTF-8 and UTF-16?
- How to fix double-encoded UTF8 characters (in an utf-8 table)
- Regex to detect Invalid UTF-8 String
- Is there any benefit to adding accept-charset=“UTF-8” to HTML forms, if the page is already in UTF-8?
- Java 8 UTF-8 encoding issue (java bug?)
- How to handle user input of invalid UTF-8 characters?
- Convert UTF-16 to UTF-8 and remove BOM?
- Convert UTF-16 to UTF-8 under Windows and Linux, in C
- JPA utf-8 characters not persisted
- How to remove invalid UTF-8 characters from a JavaScript string?
- utf-8 special characters not displaying
- Ruby method to remove accents from UTF-8 international characters
- Ruby read CSV file as UTF-8 and/or convert ASCII-8Bit encoding to UTF-8
- Why does .net use the UTF16 encoding for string, but uses UTF-8 as default for saving files?
- How do I sanitize invalid UTF-8 in Perl?
- String#encode not fixing “invalid byte sequence in UTF-8” error
- reading in utf-8 file (javascript XMLHttpRequest) gives bad european characters
- Handling special characters in C (UTF-8 encoding)
- What is “ANSI as UTF-8” and how can I make fputcsv() generate UTF-8 w/BOM?
- JQuery AJAX is not sending UTF-8 to my server, only in IE
- How to work with UTF-8 in C++, Conversion from other Encodings to UTF-8
- How do I read UTF-8 characters via a pointer?
- Flatten FDF / XFDF forms to PDF in PHP with utf-8 characters
- UTF-8: showing correctly in database, however not in HTML despite utf-8 charset
- How to read an UTF-8 encoded file containing Chinese characters and output them correctly on console?
- UTF-8 CJK characters not displaying in Java
- Java 8 change in UTF-8 decoding
- Can not send special characters (UTF-8) from JSP to Servlet: question marks displayed
- Can str_replace be safely used on a UTF-8 encoded string if it's only given valid UTF-8 encoded strings as arguments?
- Has anyone been able to write out UTF-8 characters using python's xlwt?
- How do you print raw UTF-8 characters from their numbers?
- Which encoding does Java uses UTF-8 or UTF-16?
- Storing UTF-8 string in a UnicodeString
- Reading utf-8 characters from a gzip file in python
- How to replace/remove 4(+)-byte characters from a UTF-8 string in PHP?
- How to uppercase/lowercase UTF-8 characters in C++?
- How to read unicode (utf-8) / binary file line by line
- Problem with PHP and Mysql UTF-8 (Special Character)
- If UTF-8 is an 8-bit encoding, why does it need 1-4 bytes?
- UTF-16 to UTF-8 conversion in JavaScript
- PHP: replace invalid characters in utf-8 string in
- when we import csv data, how eliminate “invalid byte sequence in UTF-8”
- Characters “ي” and “ی” and the difference in persian - Mysql
- Ruby/Rails CSV parsing, invalid byte sequence in UTF-8
- Is there a drastic difference between UTF-8 and UTF-16
- Is there any reason to prefer UTF-16 over UTF-8?
- Range of UTF-8 Characters in C++11 Regex
- Confused about C++'s std::wstring, UTF-16, UTF-8 and displaying strings in a windows GUI
- Write ObjectNode to JSON String with UTF-8 Characters to Escaped ASCII
- How to return xml as UTF-8 instead of UTF-16
- Removing control characters from a UTF-8 string in PHP
- What does the message “Invalid byte 2 of a 3-byte UTF-8 sequence” mean?
- How do I set the PYTHONUTF8 environment variable to enable UTF-8 encoding by default in Python?
- Reading UTF-8 text and converting to UTF-16 using standard C++ wifstream
- Handling utf-8 in Boost.Spirit with utf-32 parser
- Removing control characters from a UTF-8 string
- read resourcebundle as UTF-8. getString() Method seems to change encoding to ISO-8859
- Html2canvas image capturing issue with UTF-8 characters
- What could go wrong in switching HTML encoding from UTF-8 to UTF-16?
- What are surrogate characters in UTF-8?
- What is the best way to handle/remove, UTF-8's Right-to-left-override characters?
- How to make a flex (lexical scanner) to read UTF-8 characters input?
- A problem with passing Japanese characters(UTF-8) via json_encode
- UTF-8 characters in uploaded file name are jumbled on file upload
- Is UTF-8 the encoding of choice for QR-codes with non ASCII chars by now?
- How to unescape UTF-8 characters in Node (\u00f6)?
- Jquery Autocomplete case sensitive for utf-8 characters
- jmeter Invalid UTF-8 middle byte
- Write utf-8 to a sql server Text field using ADO.Net and maintain the UTF-8 bytes
- UTF-8 Character Encoding in SQL
- inputStream and utf 8 sometimes shows “?” characters
- XML file encoding format “utf-8” VS “UTF-8”?
- How to detect illegal UTF-8 byte sequences to replace them in java inputstream?
- How to replace/remove 4(+)-byte characters from a UTF-8 string in Java?
- Convert Javascript UTF-8 to ASCII (like Iconv('UTF-8', 'ASCII//TRANSLIT', $string) in PHP)
- Can Bison parse UTF-8 characters?
- Find out number of characters in a UTF-8 string in Java/Android
- Persist UTF-8 as Default Encoding
- Jackson->Jackson + HttpPost = “Invalid UTF-8 middle byte”, Setting Mime and Encoding
- When converting a utf-8 encoded string from bytes to characters, how does the computer know where a character ends?
- Why does UTF-8 use more than one byte to represent some characters?
- How to use UTF-8 characters in dygraph title in R
- How to get all characters within a certain UTF-8 language group?
- PHP UTF-8 questions - If I create a string in PHP… is it in UTF-8?
- Java EE + Spring + Hibernate can't save UTF-8 characters into MySQL Database
- Powershell and UTF-8
- Optimal function to create a random UTF-8 string in PHP? (letter characters only)
- Invalid byte sequence in UTF-8 (ArgumentError)
- WebClient DownloadString UTF-8 not displaying international characters
- Using iconv to convert from UTF-16BE to UTF-8 without BOM
- Is it possible to have SQL Server convert collation to UTF-8 / UTF-16
- putText for UTF-8 characters (C++)
- utf-8 characters get lost when converting from list to data.frame in R
- Are XLSX files UTF-8 encoded by definition?
- I have UTF-8 - but still get "Invalid byte 1 of 1-byte UTF-8 sequence"
- JSP Not Suppported by UTF-8 Encoding
- MVC 3 Razor page encoded as utf 8 shows encoded characters
- How to convert \xXY encoded characters to UTF-8 in Python?
- WebSocket - Safari 9 - Invalid UTF-8 sequence in header value
- Utf-8 characters displayed as ISO-8859-1
- Convert ANSI characters to UTF-8 in Java
- File.listFiles crashes for invalid UTF-8 characters
- Charset filter causing issue in parsing UTF-8 characters
- Java JTextPane HTML Editor UTF-8 characters encoding
- Qt 5 encoding problems (UTF-8, Windows-1250, Windows-1251)
- Codeiginter mysql storing chinese characters as questions marks
- SMTP and Unicode/UTF-8 characters...? How do I send them? base64 everything?
- (HTML5) cannot display Chinese characters when using UTF-8
- Ruby 1.9 - Invalid multibyte character (utf-8)
- php-json output web service problem with utf-8 characters (greek)
- List files with UTF-8 characters in the name in Python ftplib
- Ruby 1.9.3 Invalid byte sequence in UTF-8 explanation needed
- file_get_contents show characters of utf-8 like question marks
- Spring MVC with Thymeleaf and Tomcat 8 UTF-8 Encoding Issue
- sorting of list containing utf-8 characters
- Reading/writing/printing UTF-8 in C++11
- Postgres ordering of UTF-8 characters
- using UTF-8 characters in JAVA variable-names
- How to print UTF-8 characters on console using C
- Websphere 8.5.5 UTF-8 encoding issue
- Can base64 encoding applied to multibyte utf-8 characters?
- Ruby and MySQL UTF-8 characters
- Python CSV file UTF-16 to UTF-8 print error
- Converting UTF-8 NFD filenames to UTF-8 NFC, in either rsync or afpd
- What causes 3 special characters after conversion to UTF-8?
Only authorized users can answer the search term. Please sign in first, or register a free account.
You can use mb_convert_encoding() or htmlspecialchars()'s ENT_SUBSTITUTE option since PHP 5.4. Of cource you can use preg_match() too. If you use intl, you can use UConverter since PHP 5.5.
Recommended substitute character for invalid byte sequence is U+FFFD. see "3.1.2 Substituting for Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations for the details.
When using mb_convert_encoding(), you can specify a substitute character by passing Unicode code point to mb_substitute_character() or mbstring.substitute_character directive. The default character for substitution is ? (QUESTION MARK - U+003F).
UConverter offers both procedual and object-oriented API.
When using preg_match(), you need pay attention to the range of bytes for avoiding the vulnerability of UTF-8 non-shortest form. the range of trail bytes change depending on the range of lead bytes.
you can refer to the following resources for checking the byte range.
The byte range table is the below.
How to replace invalid byte sequence without breaking valid characters is shown in "3.1.1 Ill-Formed Subsequences" in UTR #36: Unicode Security Considerations and "Table 3-8. Use of U+FFFD in UTF-8 Conversion" in The Unicode Standard.
The Unicode Standard shows an example:
Here is the implementation by preg_replace_callback() according to the above rule.
You can compare byte directly and avoid preg_match's restriction about byte size by this way.
The test case is here.
As a note, mb_convert_encoding has a bug that breaks s valid character just after invalid byte sequence or remove invalid byte sequence after valid characters without adding U+FFFD.
Although preg_match() can be used intead of preg_replace_callback, this function has a limition on bytesize. See bug report #36463 for details. You can confirm it by the following test case.
Finally, the result of my benchmark is following.
The benchmark code is here.