Viewed   85 times

In my string I have utf-8 non-breaking space (0xc2a0) and I want to replace it with something else.

When I use

$str=preg_replace('~xc2xa0~', 'X', $str);

it works OK.

But when I use

$str=preg_replace('~x{C2A0}~siu', 'W', $str);

non-breaking space is not found (and replaced).

Why? What is wrong with second regexp?

The format x{C2A0} is correct, also I used u flag.

 Answers

3

Actually the documentation about escape sequences in PHP is wrong. When you use xc2xa0 syntax, it searches for UTF-8 character. But with x{c2a0} syntax, it tries to convert the Unicode sequence to UTF-8 encoded character.

A non breaking space is U+00A0 (Unicode) but encoded as C2A0 in UTF-8. So if you try with the pattern ~x{00a0}~siu, it will work as expected.

Friday, November 4, 2022
4

This email (archived here) contains a list of all Unicode whitespace characters encoded in UTF-8, UTF-16, and HTML.

In the archived link look for the 'utf8_whitespace_table' function.

static $whitespace = array(
    "SPACE" => "x20",
    "NO-BREAK SPACE" => "xc2xa0",
    "OGHAM SPACE MARK" => "xe1x9ax80",
    "EN QUAD" => "xe2x80x80",
    "EM QUAD" => "xe2x80x81",
    "EN SPACE" => "xe2x80x82",
    "EM SPACE" => "xe2x80x83",
    "THREE-PER-EM SPACE" => "xe2x80x84",
    "FOUR-PER-EM SPACE" => "xe2x80x85",
    "SIX-PER-EM SPACE" => "xe2x80x86",
    "FIGURE SPACE" => "xe2x80x87",
    "PUNCTUATION SPACE" => "xe2x80x88",
    "THIN SPACE" => "xe2x80x89",
    "HAIR SPACE" => "xe2x80x8a",
    "ZERO WIDTH SPACE" => "xe2x80x8b",
    "NARROW NO-BREAK SPACE" => "xe2x80xaf",
    "MEDIUM MATHEMATICAL SPACE" => "xe2x81x9f",
    "IDEOGRAPHIC SPACE" => "xe3x80x80",
);
Tuesday, October 18, 2022
 
5

The Arabic regex is:

[u0600-u06FF]

Actually, ?-? is a subset of this Arabic range, so I think you can remove them from the pattern.

So, in JS it will be

/^[a-z0-9+,()/'su0600-u06FF-]+$/i

See regex demo

Tuesday, October 11, 2022
3

Thanks guys, this is how I managed to solve the 'cross' windows and linux requirement:

  1. Downloaded and installed: MinGW , and MSYS
  2. Downloaded the libiconv source package
  3. Compiled libiconv via MSYS.

That's about it.

Thursday, November 24, 2022
 
g_san
 
3

For this PHP regex:

$str = preg_replace ( '{(.)1+}', '$1', $str );
$str = preg_replace ( '{[ '-_()]}', '', $str )

In Java:

str = str.replaceAll("(.)\1+", "$1");
str = str.replaceAll("[ '-_\(\)]", "");

I suggest you to provide your input and expected output then you will get better answers on how it can be done in PHP and/or Java.

Sunday, October 9, 2022
 
haodong
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :