Viewed   80 times

Im having a problem with removing non-utf8 characters from string, which are not displaying properly. Characters are like this 0x97 0x61 0x6C 0x6F (hex representation)

What is the best way to remove them? Regular expression or something else ?

 Answers

2

Using a regex approach:

$regex = <<<'END'
/
  (
    (?: [x00-x7F]                 # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]      # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2}   # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3}   # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                        # ...one or more times
  )
| .                                 # anything else
/x
END;
preg_replace($regex, '$1', $text);

It searches for UTF-8 sequences, and captures those into group 1. It also matches single bytes that could not be identified as part of a UTF-8 sequence, but does not capture those. Replacement is whatever was captured into group 1. This effectively removes all invalid bytes.

It is possible to repair the string, by encoding the invalid bytes as UTF-8 characters. But if the errors are random, this could leave some strange symbols.

$regex = <<<'END'
/
  (
    (?: [x00-x7F]               # single-byte sequences   0xxxxxxx
    |   [xC0-xDF][x80-xBF]    # double-byte sequences   110xxxxx 10xxxxxx
    |   [xE0-xEF][x80-xBF]{2} # triple-byte sequences   1110xxxx 10xxxxxx * 2
    |   [xF0-xF7][x80-xBF]{3} # quadruple-byte sequence 11110xxx 10xxxxxx * 3 
    ){1,100}                      # ...one or more times
  )
| ( [x80-xBF] )                 # invalid byte in range 10000000 - 10111111
| ( [xC0-xFF] )                 # invalid byte in range 11000000 - 11111111
/x
END;
function utf8replacer($captures) {
  if ($captures[1] != "") {
    // Valid byte sequence. Return unmodified.
    return $captures[1];
  }
  elseif ($captures[2] != "") {
    // Invalid byte of the form 10xxxxxx.
    // Encode as 11000010 10xxxxxx.
    return "xC2".$captures[2];
  }
  else {
    // Invalid byte of the form 11xxxxxx.
    // Encode as 11000011 10xxxxxx.
    return "xC3".chr(ord($captures[3])-64);
  }
}
preg_replace_callback($regex, "utf8replacer", $text);

EDIT:

  • !empty(x) will match non-empty values ("0" is considered empty).
  • x != "" will match non-empty values, including "0".
  • x !== "" will match anything except "".

x != "" seem the best one to use in this case.

I have also sped up the match a little. Instead of matching each character separately, it matches sequences of valid UTF-8 characters.

Monday, September 19, 2022
5

The Arabic regex is:

[u0600-u06FF]

Actually, ?-? is a subset of this Arabic range, so I think you can remove them from the pattern.

So, in JS it will be

/^[a-z0-9+,()/'su0600-u06FF-]+$/i

See regex demo

Tuesday, October 11, 2022
2

Your requirements are not clear. All characters in a Java String are Unicode characters, so if you remove them, you'll be left with an empty string. I assume what you mean is that you want to remove any non-ASCII, non-printable characters.

String clean = str.replaceAll("\P{Print}", "");

Here, p{Print} represents a POSIX character class for printable ASCII characters, while P{Print} is the complement of that class. With this expression, all characters that are not printable ASCII are replaced with the empty string. (The extra backslash is because starts an escape sequence in string literals.)


Apparently, all the input characters are actually ASCII characters that represent a printable encoding of non-printable or non-ASCII characters. Mongo shouldn't have any trouble with these strings, because they contain only plain printable ASCII characters.

This all sounds a little fishy to me. What I believe is happening is that the data really do contain non-printable and non-ASCII characters, and another component (like a logging framework) is replacing these with a printable representation. In your simple tests, you are failing to translate the printable representation back to the original string, so you mistakenly believe the first regular expression is not working.

That's my guess, but if I've misread the situation and you really do need to strip out literal xHH escapes, you can do it with the following regular expression.

String clean = str.replaceAll("\\x\p{XDigit}{2}", "");

The API documentation for the Pattern class does a good job of listing all of the syntax supported by Java's regex library. For more elaboration on what all of the syntax means, I have found the Regular-Expressions.info site very helpful.

Thursday, September 29, 2022
 
rzcoder
 
3

For this PHP regex:

$str = preg_replace ( '{(.)1+}', '$1', $str );
$str = preg_replace ( '{[ '-_()]}', '', $str )

In Java:

str = str.replaceAll("(.)\1+", "$1");
str = str.replaceAll("[ '-_\(\)]", "");

I suggest you to provide your input and expected output then you will get better answers on how it can be done in PHP and/or Java.

Sunday, October 9, 2022
 
haodong
 
4

In general, to remove non-ascii characters, use str.encode with errors='ignore':

df['col'] = df['col'].str.encode('ascii', 'ignore').str.decode('ascii')

To perform this on multiple string columns, use

u = df.select_dtypes(object)
df[u.columns] = u.apply(
    lambda x: x.str.encode('ascii', 'ignore').str.decode('ascii'))

Although that still won't handle the null characters in your columns. For that, you replace them using regex:

df2 = df.replace(r'W+', '', regex=True)
Saturday, August 13, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :