Viewed   63 times

I'm currently writing a library for matching specific words in content.

Essentially the way it works is by compiling words into regular expressions, and running content through said regular expressions.

A feature I want to add is specifying whether a given word to match must start and/or end a word. For example, I have the word cat. I specify that it must start a word, so catering will match as cat is at the start, but ducat won't match as cat doesn't start the word.

I wanted to do this using word boundaries, but during some testing I found it doesn't work as I'd expect it to.

Take the following,

preg_match("/(^|b)@nimal/i", "something@nimal", $match);
preg_match("/(^|b)@nimal/i", "something!@nimal", $match);

In the statements above I would expect the following results,

> false
> 1 (@nimal)

But the result is instead the opposite,

> 1 (@nimal)
> false

In the first, I would expect it to fail as the group will eat the @, leaving nimal to match against @nimal, which obviously it doesn't. Instead, the group matchs an empty string, so @nimal is matched, meaning @ is considered to be part of the word.

In the second, I would expect the group to eat the ! leaving @nimal to match the rest (which it should). Instead, it appears to combine the ! and @ together to form a word, which is confirmed by the following matching,

preg_match("/gb!@bn/i", "something!@nimal", $match);

Any ideas why regular expression does this?

I'd just love a page that clearly documents how word boundaries are determined, I just can't find one for the life of me.

 Answers

1

The word boundary b matches on a change from a w (a word character) to a W a non word character. You want to match if there is a b before your @ which is a W character. So to match you need a word character before your @

something@nimal
        ^^

==> Match because of the word boundary between g and @.

something!@nimal
         ^^ 

==> NO match because between ! and @ there is no word boundary, both characters are W

Thursday, August 11, 2022
5

The Arabic regex is:

[u0600-u06FF]

Actually, ?-? is a subset of this Arabic range, so I think you can remove them from the pattern.

So, in JS it will be

/^[a-z0-9+,()/'su0600-u06FF-]+$/i

See regex demo

Tuesday, October 11, 2022
2

It's very hard to do this by analysing the regex (short of actually parsing the regex itself.

I suggest you rather use conservative settings for pcre.backtrack-limit and pcre.recursion_limit.

Tuesday, August 2, 2022
 
ryley
 
4

There is function for reading csv files: fgetcsv

Friday, December 16, 2022
 
3

For this PHP regex:

$str = preg_replace ( '{(.)1+}', '$1', $str );
$str = preg_replace ( '{[ '-_()]}', '', $str )

In Java:

str = str.replaceAll("(.)\1+", "$1");
str = str.replaceAll("[ '-_\(\)]", "");

I suggest you to provide your input and expected output then you will get better answers on how it can be done in PHP and/or Java.

Sunday, October 9, 2022
 
haodong
 
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :