Viewed   69 times

I can't solve my problem with regexp.

Ok, when i type:

$string = preg_replace("#[name=([a-zA-Z0-9 .-]+)*]#","$name_start $1 $name_end",$string);

everything is ok, except situation with Russian language.

so, i try to re-type this reg-exp:

$string = preg_replace("#[name=([a-zA-Z0-9**а-яА-Я** .-]+)*]#","$name_start $1 $name_end",$string);

but this not working,

i know some idea, just write:

$string = preg_replace("#[name=([a-zA-Z0-9йцукенгшщзхъфывапролджэячсмитьбю .-]+)*]#","$name_start $1 $name_end",$string);

but this is crazy :D

please, give me simple variant

 Answers

1

Try a Unicode range:

'/[x{0410}-x{042F}]/u'  // matches a capital cyrillic letter in the range A to Ya

Don't forget the /u flag for Unicode.

In your case:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)*]#u"

Note that the STAR in your regex is redundant. Everything already gets "eaten" by the PLUS. This would do the same:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)]#u"
Friday, September 16, 2022
 
nom
 
nom
3

I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP

My function highlights text ignoring diacritics, spaces, apostrophes and dashes:

  function highlight($pattern, $string)
  {
    $array = str_split($pattern);

    //add or remove characters to be ignored
    $pattern=implode('[s'-]*', $array);  

    //list of letters with diacritics
    $replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");

    $pattern=str_replace(array_keys($replacements), $replacements, $pattern);  

    //instead of <u> you can use <b>, <i> or even <div> or <span> with css class
    return preg_replace("/(" . $pattern . ")/ui", "<u>\1</u>", $string);
  }
Tuesday, August 23, 2022
 
5

A list of Unicode properties can be found in http://www.unicode.org/Public/UNIDATA/PropList.txt.

The properties for each character can be found in http://www.unicode.org/Public/UNIDATA/UnicodeData.txt (1.2 MB).

In your case,

  • + (PLUS SIGN) is Sm,
  • - (HYPHEN-MINUS) is Pd,
  • * (ASTERISK) is Po,
  • / (SOLIDUS) is also Po, and
  • ^ (CIRCUMFLEX ACCENT) is Sk.

You're better off matching them with [-+*/^].

Tuesday, November 22, 2022
 
mrbyte
 
3

Your Regex is being "compiled" as ASCII-8BIT.

Just add the encoding declaration at the top of the file where the Regex is declared:

# encoding: utf-8

And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

UPDATE: UTF-8 is now the default encoding for Ruby 2.0 and beyond.

Friday, August 12, 2022
1

Per Damian, the answer was actually in the "Manual result distillation" part of the docs

The correct answer is to tell your <pair> token
to pass the result of each <literal> subrule through as its own
result, using the MATCH=
alias (see: "Manual result distillation" in the module documentation)  like so:

   <token: pair>        '<MATCH=literal>' | "<MATCH=literal>" |
$$<MATCH=literal>$$

Here is what the docs say:

Regexp::Grammars also offers full manual control over the distillation process. If you use the reserved word MATCH as the alias for a subrule call [...] Note that, in this second case, even though and are captured to the result-hash, they are not returned, because the MATCH alias overrides the normal "return the result-hash" semantics and returns only what its associated subrule (i.e. ) produces.

Friday, October 21, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :