I'm currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.

A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.

I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.

Take the following,

preg_match("/(^|b)@nimal/i", "something@nimal", $match);
preg_match("/(^|b)@nimal/i", "something!@nimal", $match);

In the statements above I would expect the following results,

> false
> 1 (@nimal)

But the result is instead the opposite,

> 1 (@nimal)
> false

In the first, I would expect it to fail as the group will eat the @, leaving nimal to match against @nimal, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal is matched, meaning @ is considered to be part of the word.

In the second, I would expect the group to eat the ! leaving @nimal to match the rest (which it should). Instead, it appears to combine the ! and @ together to form a word, which is confirmed by the following matching,

preg_match("/gb!@bn/i", "something!@nimal", $match);

Any ideas why regular expression does this?

I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.



The word boundary b matches on a change from a w (a word character) to a W a non word character. You want to match if there is a b before your @ which is a W character. So to match you need a word character before your @


==> Match because of the word boundary between g and @.


==> NO match because between ! and @ there is no word boundary, both characters are W

