Viewed   63 times

I'm currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.

A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.

I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.

Take the following,

preg_match("/(^|b)@nimal/i", "something@nimal", $match);
preg_match("/(^|b)@nimal/i", "something!@nimal", $match);

In the statements above I would expect the following results,

> false
> 1 (@nimal)

But the result is instead the opposite,

> 1 (@nimal)
> false

In the first, I would expect it to fail as the group will eat the @, leaving nimal to match against @nimal, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal is matched, meaning @ is considered to be part of the word.

In the second, I would expect the group to eat the ! leaving @nimal to match the rest (which it should). Instead, it appears to combine the ! and @ together to form a word, which is confirmed by the following matching,

preg_match("/gb!@bn/i", "something!@nimal", $match);

Any ideas why regular expression does this?

I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.



The word boundary b matches on a change from a w (a word character) to a W a non word character. You want to match if there is a b before your @ which is a W character. So to match you need a word character before your @


==> Match because of the word boundary between g and @.


==> NO match because between ! and @ there is no word boundary, both characters are W

Thursday, August 11, 2022

The Arabic regex is:


Actually, ?-? is a subset of this Arabic range, so I think you can remove them from the pattern.

So, in JS it will be


See regex demo

Tuesday, October 11, 2022

It's very hard to do this by analysing the regex (short of actually parsing the regex itself.

I suggest you rather use conservative settings for pcre.backtrack-limit and pcre.recursion_limit.

Tuesday, August 2, 2022

There is function for reading csv files: fgetcsv

Friday, December 16, 2022

For this PHP regex:

$str = preg_replace ( '{(.)1+}', '$1', $str );
$str = preg_replace ( '{[ '-_()]}', '', $str )

In Java:

str = str.replaceAll("(.)\1+", "$1");
str = str.replaceAll("[ '-_\(\)]", "");

I suggest you to provide your input and expected output then you will get better answers on how it can be done in PHP and/or Java.

Sunday, October 9, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :