Viewed   101 times

I just wrote a regex for use with the php function preg_match that contains the following part:

[w-.]

To match any word character, as well as a minus sign and the dot. While it seems to work in preg_match, I tried to put it into a utility called Reggy and it complaints about "Empty range in char class". Trial and error taught me that this issue was solved by escaping the minus sign, turning the regex into

[w-.]

Since the original appears to work in PHP, I am wondering why I should or should not be escaping the minus sign, and - since the dot is also a character with a meaning in PHP - why I would not need to escape the dot. Is the utility I am using just being silly, is it working with another regex dialect or is my regex really incorrect and am I just lucky that preg_match lets me get away with it?

 Answers

1

In many regex implementations, the following rules apply:

Meta characters inside a character class are:

  • ^ (negation)
  • - (range)
  • ] (end of the class)
  • (escape char)

So these should all be escaped. There are some corner cases though:

  • - needs no escaping if placed at the very start, or end of the class ([abc-] or [-abc]). In quite a few regex implementations, it also needs no escaping when placed directly after a range ([a-c-abc]) or short-hand character class ([w-abc]). This is what you observed
  • ^ needs no escaping when it's not at the start of the class: [^a] means any char except a, and [a^] matches either a or ^, which equals: [^a]
  • ] needs no escaping if it's the only character in the class: []] matches the char ]
Tuesday, August 2, 2022
4
preg_match_all(
    '/(                    # Match and capture...
     (?:                   # either:
      \\.                # an escaped character
     |                     # or:
      [^\\:]             # any character except : or 
     )+                    # one or more times
    )                      # End of capturing group 1
    :                      # Match a colon
    ((?:\\.|[^\\;])+); # Same for 2nd part with semicolons
    /x', 
    $inside, $pairs);

does this. It doesn't remove the backslashes, though. You can't do that in a regex itself; for this, you'd need a callback function.

To match the final element even if it doesn't end with a delimiter change the ; to (?:;|$) (same for the :). And to return empty elements as well change the + to a *.

Monday, September 19, 2022
 
3

Which characters you must and which you mustn't escape indeed depends on the regex flavor you're working with.

For PCRE, and most other so-called Perl-compatible flavors, escape these outside character classes:

.^$*+?()[{|

and these inside character classes:

^-]

For POSIX extended regexes (ERE), escape these outside character classes (same as PCRE):

.^$*+?()[{|

Escaping any other characters is an error with POSIX ERE.

Inside character classes, the backslash is a literal character in POSIX regular expressions. You cannot use it to escape anything. You have to use "clever placement" if you want to include character class metacharacters as literals. Put the ^ anywhere except at the start, the ] at the start, and the - at the start or the end of the character class to match these literally, e.g.:

[]^-]

In POSIX basic regular expressions (BRE), these are metacharacters that you need to escape to suppress their meaning:

.^$*[

Escaping parentheses and curly brackets in BREs gives them the special meaning their unescaped versions have in EREs. Some implementations (e.g. GNU) also give special meaning to other characters when escaped, such as ? and +. Escaping a character other than .^$*(){} is normally an error with BREs.

Inside character classes, BREs follow the same rule as EREs.

If all this makes your head spin, grab a copy of RegexBuddy. On the Create tab, click Insert Token, and then Literal. RegexBuddy will add escapes as needed.

Tuesday, November 1, 2022
 
4

PCRE and newlines

PCRE has a superfluity of newline related escape sequences and alternatives.

Well, a nifty escape sequence that you can use here is R. By default R will match Unicode newlines sequences, but it can be configured using different alternatives.

To match any Unicode newline sequence that is in the ASCII range.

preg_match('~R~', $string);

This is equivalent to the following group:

(?>rn|n|r|f|x0b|x85)

To match any Unicode newline sequence; including newline characters outside the ASCII range and both the line separator (U+2028) and paragraph separator (U+2029), you want to turn on the u (unicode) flag.

preg_match('~R~u', $string);

The u (unicode) modifier turns on additional functionality of PCRE and Pattern strings are treated as (UTF-8).

The is equivalent to the following group:

(?>rn|n|r|f|x0b|x85|x{2028}|x{2029})

It is possible to restrict R to match CR, LF, or CRLF only:

preg_match('~(*BSR_ANYCRLF)R~', $string);

The is equivalent to the following group:

(?>rn|n|r)

Additional

Five different conventions for indicating line breaks in strings are supported:

(*CR)        carriage return
(*LF)        linefeed
(*CRLF)      carriage return, followed by linefeed
(*ANYCRLF)   any of the three above
(*ANY)       all Unicode newline sequences

Note: R does not have special meaning inside of a character class. Like other unrecognized escape sequences, it is treated as the literal character "R" by default.

Monday, October 3, 2022
 
4

You can look at the javadoc of the Pattern class: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html

You need to escape any char listed there if you want the regular char and not the special meaning.

As a maybe simpler solution, you can put the template between Q and E - everything between them is considered as escaped.

Saturday, December 10, 2022
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :