I'm currently writing a library for matching specific words in content.
Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.
A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word
cat. I specify that it must start a word, so
catering will match as
cat is at the start, but
ducat won't match as
cat doesn't start the word.
I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.
Take the following,
preg_match("/(^|b)@nimal/i", "something@nimal", $match); preg_match("/(^|b)@nimal/i", "something!@nimal", $match);
In the statements above I would expect the following results,
> false > 1 (@nimal)
But the result is instead the opposite,
> 1 (@nimal) > false
In the first, I would expect it to fail as the group will eat the
nimal to match against
@nimal, which obviously it doesn't. Instead, the group matchs an empty string, so
@nimal is matched, meaning
@ is considered to be part of the word.
In the second, I would expect the group to eat the
@nimal to match the rest (which it should). Instead, it appears to combine the
@ together to form a word, which is confirmed by the following matching,
preg_match("/gb!@bn/i", "something!@nimal", $match);
Any ideas why regular expression does this?
I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.
The word boundary
bmatches on a change from a
w(a word character) to a
Wa non word character. You want to match if there is a
@which is a
Wcharacter. So to match you need a word character before your
==> Match because of the word boundary between
==> NO match because between
@there is no word boundary, both characters are