Viewed   69 times

I can't solve my problem with regexp.

Ok, when i type:

$string = preg_replace("#[name=([a-zA-Z0-9 .-]+)*]#","$name_start $1 $name_end",$string);

everything is ok, except situation with Russian language.

so, i try to re-type this reg-exp:

$string = preg_replace("#[name=([a-zA-Z0-9**а-яА-Я** .-]+)*]#","$name_start $1 $name_end",$string);

but this not working,

i know some idea, just write:

$string = preg_replace("#[name=([a-zA-Z0-9йцукенгшщзхъфывапролджэячсмитьбю .-]+)*]#","$name_start $1 $name_end",$string);

but this is crazy :D

please, give me simple variant



Try a Unicode range:

'/[x{0410}-x{042F}]/u'  // matches a capital cyrillic letter in the range A to Ya

Don't forget the /u flag for Unicode.

In your case:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)*]#u"

Note that the STAR in your regex is redundant. Everything already gets "eaten" by the PLUS. This would do the same:

"#[name=([a-zA-Z0-9x{0430}-x{044F}x{0410}-x{042F} .-]+)]#u"
Friday, September 16, 2022

I finally found working solution thanks to this Tibor's answer here: Regex to ignore accents? PHP

My function highlights text ignoring diacritics, spaces, apostrophes and dashes:

  function highlight($pattern, $string)
    $array = str_split($pattern);

    //add or remove characters to be ignored
    $pattern=implode('[s'-]*', $array);  

    //list of letters with diacritics
    $replacements = Array("a" => "[áa]", "e"=>"[ée]", "i"=>"[íi]", "o"=>"[óo]", "u"=>"[úu]", "A" => "[ÁA]", "E"=>"[ÉE]", "I"=>"[ÍI]", "O"=>"[ÓO]", "U"=>"[ÚU]");

    $pattern=str_replace(array_keys($replacements), $replacements, $pattern);  

    //instead of <u> you can use <b>, <i> or even <div> or <span> with css class
    return preg_replace("/(" . $pattern . ")/ui", "<u>\1</u>", $string);
Tuesday, August 23, 2022

A list of Unicode properties can be found in

The properties for each character can be found in (1.2 MB).

In your case,

  • + (PLUS SIGN) is Sm,
  • - (HYPHEN-MINUS) is Pd,
  • * (ASTERISK) is Po,
  • / (SOLIDUS) is also Po, and

You're better off matching them with [-+*/^].

Tuesday, November 22, 2022

Your Regex is being "compiled" as ASCII-8BIT.

Just add the encoding declaration at the top of the file where the Regex is declared:

# encoding: utf-8

And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

UPDATE: UTF-8 is now the default encoding for Ruby 2.0 and beyond.

Friday, August 12, 2022

Per Damian, the answer was actually in the "Manual result distillation" part of the docs

The correct answer is to tell your <pair> token
to pass the result of each <literal> subrule through as its own
result, using the MATCH=
alias (see: "Manual result distillation" in the module documentation)  like so:

   <token: pair>        '<MATCH=literal>' | "<MATCH=literal>" |

Here is what the docs say:

Regexp::Grammars also offers full manual control over the distillation process. If you use the reserved word MATCH as the alias for a subrule call [...] Note that, in this second case, even though and are captured to the result-hash, they are not returned, because the MATCH alias overrides the normal "return the result-hash" semantics and returns only what its associated subrule (i.e. ) produces.

Friday, October 21, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :