Asked  2 Years ago    Answers:  5   Viewed   125 times

I found it in the following regex:


It matches square brackets (with their content) together with nested square brackets.



[^][] is a character class that means all characters except [ and ].

You can avoid escaping [ and ] special characters since it is not ambiguous for the PCRE, the regex engine used in preg_ functions.

Since [^] is incorrect in PCRE, the only way for the regex to parse is that ] is inside the character class which will be closed later. The same with the [ that follows. It can not reopen a character class (except a POSIX character class [:alnum:]) inside a character class. Then the last ] is clear; it is the end of the character class. However, a [ outside a character class must be escaped since it is parsed as the beginning of a character class.

In the same way, you can write []] or [[] or [^[] without escaping the [ or ] in the character class.

Note: since PHP 7.3, you can use the inline xx modifier that allows blank characters to be ignored even inside character classes. This way you can write these classes in a less ambigous way like that: (?xx) [^ ][ ] [ ] ] [ [ ] [^ [ ].

You can use this syntax with several regex flavour: PCRE (PHP, R), Perl, Python, Java, .NET, GO, awk, Tcl (if you delimit your pattern with curly brackets, thanks Donal Fellows), ...

But not with: Ruby, JavaScript (except for IE < 9), ...

As m.buettner noted, [^]] is not ambiguous because ] is the first character, [^a]] is seen as all that is not a a followed by a ]. To have a and ], you must write: [^a]] or [^]a]

In particular case of JavaScript, the specification allow [] as a regex token that never matches (in other words, [] will always fail) and [^] as a regex that matches any character. Then [^]] is seen as any character followed by a ]. The actual implementation varies, but modern browser generally sticks to the definition in the specification.

Pattern details:

[          # literal [
(?:         # open a non capturing group
    [^][]   # a character that is not a ] or a [
  |         # OR
    (?R)    # the whole pattern (here is the recursion)
)*          # repeat zero or more time
]          # a literal ]

In your pattern example, you don't need to escape the last ]

But you can do the same with this pattern a little bit optimized, and more useful cause reusable as subpattern (with the (?-1)): ([(?:[^][]+|(?-1))*+])

(                     # open the capturing group
    [                # a literal [
        (?:           # open a non-capturing group
            [^][]+    # all characters but ] or [ one or more time
          |           # OR
            (?-1)     # the last opened capturing group (recursion)
                      # (the capture group where you are)
        )*+           # repeat the group zero or more time (possessive)
    ]                 # literal ] (no need to escape)
)                     # close the capturing group

or better: ([[^][]*(?:(?-1)[^][]*)*+]) that avoids the cost of an alternation.

Friday, October 7, 2022

This will work only for non-nested parentheses:

    $regex = <<<HERE
    /  "  ( (?:[^"\\]++|\\.)*+ ) "
     | '  ( (?:[^'\\]++|\\.)*+ ) '
     | ( ( [^)]*                  ) )
     | [s,]+

    $tags = preg_split($regex, $str, -1,
                       | PREG_SPLIT_DELIM_CAPTURE);

The ++ and *+ will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.

Saturday, October 29, 2022

The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.

Having said that:

<ab[^>]*>(.*?)</a>   // match group one will contain the link text
Tuesday, September 6, 2022

\pL is a Unicode property shortcut. It can also be written as asp{L} or p{Letter}. It matches any kind of letter from any language.

Saturday, December 17, 2022

These are considered Unicode properties.

The Unicode property p{L} — shorthand for p{Letter} will match any kind of letter from any language. Therefore, p{Lu} will match an uppercase letter that has a lowercase variant. And, the opposite p{Ll} will match a lowercase letter that has an uppercase variant.

Concisely, this would match any lowercase/uppercase that has a variant from any language:

Friday, October 14, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :

Browse Other Code Languages