I found it in the following regex:
[(?:[^][]|(?R))*]
It matches square brackets (with their content) together with nested square brackets.
I found it in the following regex:
[(?:[^][]|(?R))*]
It matches square brackets (with their content) together with nested square brackets.
This will work only for non-nested parentheses:
$regex = <<<HERE
/ " ( (?:[^"\\]++|\\.)*+ ) "
| ' ( (?:[^'\\]++|\\.)*+ ) '
| ( ( [^)]* ) )
| [s,]+
/x
HERE;
$tags = preg_split($regex, $str, -1,
PREG_SPLIT_NO_EMPTY
| PREG_SPLIT_DELIM_CAPTURE);
The ++
and *+
will consume as much as they can and give nothing back for backtracking. This technique is described in perlre(1) as the most efficient way to do this kind of matching.
The standard disclaimer applies: Parsing HTML with regular expressions is not ideal. Success depends on the well-formedness of the input on a character-by-character level. If you cannot guarantee this, the regex will fail to do the Right Thing at some point.
Having said that:
<ab[^>]*>(.*?)</a> // match group one will contain the link text
\pL
is a Unicode property shortcut. It can also be written as asp{L}
or p{Letter}
. It matches any kind of letter from any language.
These are considered Unicode properties.
The Unicode property p{L}
— shorthand for p{Letter}
will match any kind of letter from any language. Therefore, p{Lu}
will match an uppercase letter that has a lowercase variant. And, the opposite p{Ll}
will match a lowercase letter that has an uppercase variant.
Concisely, this would match any lowercase/uppercase that has a variant from any language:
AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZz
[^][]
is a character class that means all characters except[
and]
.You can avoid escaping
[
and]
special characters since it is not ambiguous for the PCRE, the regex engine used inpreg_
functions.Since
[^]
is incorrect in PCRE, the only way for the regex to parse is that]
is inside the character class which will be closed later. The same with the[
that follows. It can not reopen a character class (except a POSIX character class[:alnum:]
) inside a character class. Then the last]
is clear; it is the end of the character class. However, a[
outside a character class must be escaped since it is parsed as the beginning of a character class.In the same way, you can write
[]]
or[[]
or[^[]
without escaping the[
or]
in the character class.Note: since PHP 7.3, you can use the inline xx modifier that allows blank characters to be ignored even inside character classes. This way you can write these classes in a less ambigous way like that:
(?xx) [^ ][ ] [ ] ] [ [ ] [^ [ ]
.You can use this syntax with several regex flavour: PCRE (PHP, R), Perl, Python, Java, .NET, GO, awk, Tcl (if you delimit your pattern with curly brackets, thanks Donal Fellows), ...
But not with: Ruby, JavaScript (except for IE < 9), ...
As m.buettner noted,
[^]]
is not ambiguous because]
is the first character,[^a]]
is seen as all that is not aa
followed by a]
. To havea
and]
, you must write:[^a]]
or[^]a]
In particular case of JavaScript, the specification allow
[]
as a regex token that never matches (in other words,[]
will always fail) and[^]
as a regex that matches any character. Then[^]]
is seen as any character followed by a]
. The actual implementation varies, but modern browser generally sticks to the definition in the specification.Pattern details:
In your pattern example, you don't need to escape the last
]
But you can do the same with this pattern a little bit optimized, and more useful cause reusable as subpattern (with the
(?-1)
):([(?:[^][]+|(?-1))*+])
or better:
([[^][]*(?:(?-1)[^][]*)*+])
that avoids the cost of an alternation.