Viewed   99 times

I'm trying to strip all characters from a string except:

  • Alphanumeric characters
  • Dollar sign ($)
  • Underscore (_)
  • Unicode characters between code points U+0080 and U+FFFF

I've got the first three conditions by doing this:

preg_replace('/[^a-zA-Zd$_]+/', '', $foo);

How do I go about matching the fourth condition? I looked at using X but there has to be a better way than listing out 65000+ characters.

 Answers

5

You can use:

$foo = preg_replace('/[^w$x{0080}-x{FFFF}]+/u', '', $foo);
  • w - is equivalent of [a-zA-Z0-9_]
  • x{0080}-x{FFFF} to match characters between code points U+0080andU+FFFF`
  • /u for unicode support in regex
Tuesday, September 13, 2022
3

I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP

My function highlights text ignoring diacritics, spaces, apostrophes and dashes:

  function highlight($pattern, $string)
  {
    $array = str_split($pattern);

    //add or remove characters to be ignored
    $pattern=implode('[s'-]*', $array);  

    //list of letters with diacritics
    $replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");

    $pattern=str_replace(array_keys($replacements), $replacements, $pattern);  

    //instead of <u> you can use <b>, <i> or even <div> or <span> with css class
    return preg_replace("/(" . $pattern . ")/ui", "<u>\1</u>", $string);
  }
Tuesday, August 23, 2022
 
2

The syntax of your unicode range will not do what you expect.

  1. The raw r'' string prevents u escapes from being parsed, and the regex engine will not do this. The only range in this set is [0-]:

    >>> re.compile(r'[u0020-u00d7ff]', re.DEBUG)
    in
      literal 117
      literal 48
      literal 48
      literal 50
      range (48, 117)
      literal 48
      literal 48
      literal 100
      literal 55
      literal 102
      literal 102
    
  2. Making it a Unicode literal causes u parsing while leaving other backslashes alone (although that’s not a concern here), but the leading zeroes are messing it up. The syntax is uxxxx or Uxxxxxxxx, so it’s parsed as "u00d7, f, f".

    >>> re.compile(ur'[u0020-u00d7ff]', re.DEBUG)
    in
      range (32, 215)
      literal 102
      literal 102
    
  3. Removing the leading zeroes or switching to U0000d7ff will fix it:

    >>> re.compile(ur'[u0020-ud7ff]', re.DEBUG)
    in
      range (32, 55295)
    
Monday, August 1, 2022
4

Try:

[u00D8-u00F6]
Monday, December 5, 2022
 
5

Regex is not able to change characters by itself, it can only change their order and/or add additional characters/delete some of them.

There is preg_replace_callback or /e flag, but they can manipulate only with known functions, and therefore can't do better than strtolower.

If you can't rely on existense of mb_strolower function, you will have to implement it yourself.

Saturday, August 6, 2022
 
emd
 
emd
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :