Viewed   52 times

Note: This is a question about possibilities of modern regex flavors. It's not about the best way to solve this using other methods. It's inspired by an earlier question, but that one is not restricted to regex.

The Problem

In an ASCII "image"/art/map/string like:

....X.......
..X..X...X....
X.X...X..X.....
X....XXXXXX.....
X..XXX...........
.....X..........
..............X
..X...........X....
..X...........X....X...
....X.....

I'd like to find a simple vertical line formation of three Xs:

X
X
X

The number of lines is variable in the image, and the width of each line is variable too.

The Question(s)

With regex (PCRE/PHP, Perl, .NET or similar) is it possible to:

  1. Determine if such a formation exists
  2. Count the number of such formations/match the starting point of them all (4 in the above example)

 Answers

5

Answer to question 1

To answer the first question one could use:

(?xm)                    # ignore comments and whitespace, ^ matches beginning of line
^                        # beginning of line
(?:
    .                    # any character except n
    (?=                  # lookahead
        .*+n            # go to next line
        ( 1?+ . )       # add a character to the 1st capturing group
        .*+n            # next line
        ( 2?+ . )       # add a character to the 2nd capturing group
    )
)*?                      # repeat as few times as needed
X .*+n                  # X on the first line and advance to next line
1?+                     # if 1st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X .*+n                  # X on the 2nd line and advance to next line
2?+                     # if 2st capturing group is defined, use it, consuming exactly the same number of characters as on the first line
X                        # X on the 3rd line

Online demo

This expression works in Perl, PCRE, Java and should work in .NET.

The expression uses lookaheads with self referencing capturing groups to add a character for every repetition of the lookahead (this is used to "count").

1?+ means if 1 matches (or is defined) consume it, and don't give it back (don't backtrack). In this case it's equivalent to (?(1) 1 ). Which means match 1 if 1 is defined.

polygenelubricants explains this kinds of lookaheads with backreferences very nicely in his answer for How can we match a^n b^n with Java regex?. (He has also written about other impressive tricks for Java regex involving backreferences and lookarounds.)

Answer to question 2

Plain matching

When just using matching and requiring the answer (count) in the number of matches, then the question 2 answer would be:

It can not be directly solved in regex flavors that have a limited lookbehind. While other flavors like Java and .NET could (as for example in m.buettner's .NET solution).

Thus plain regex matches in Perl and PCRE (PHP, etc) cannot directly answer this question in this case.

(Semi?)proof

Assume that no variable length lookbehinds are available.

You have to in some way count the number of characters on a line before an X.
Only way to do that is to match them, and since no variable length lookbehinds are available you have to start the match (at least) at the beginning of the line.
If you start the match at the beginning of a line you can only get at most one match per line.

Since there can be multiple occurrences per line, this would not count them all and would not give a correct answer.

Length/indirect solution

On the other hand if we accept the answer as the length of a match or substitution result, then the 2nd question can be answered in PCRE and Perl (and other flavors).

This solution is based on/inspired by m.buettner's nice "partial PCRE solution".

One could simply replace all matches of the following expression with $3, getting the answer to question two (the number of patterns of interests) as the length of the resulting string.

^
(?:
    (?:                   # match .+? characters
        .
        (?=               # counting the same number on the following two lines
            .*+n
            ( 1?+ . )
            .*+n
            ( 2?+ . )
        )
    )+?
    (?<= X )              # till the above consumes an X
    (?=                   # that matches the following conditions
        .*+n
        1?+
        (?<= X )
        .*+n
        2?+
        (?<= X )
    )
    (?=                   # count the number of matches
        .*+n
        ( 3?+ . )        # the number of matches = length of $3
    )
)*                        # repeat as long as there are matches on this line
.*n?                     # remove the rest of the line

Which in Perl could be written as:

$in =~ s/regex/$3/gmx;
$count = length $in;

Online demo

This expression is similar to the solution to question 1 above, with some modifications to include X in the characters matched in the first lookahead, wrapped with a quantifier and counting number of matches of the quantifier.

Except for direct matches this is as close as it gets (extra code wise besides regex), and could be an acceptable answer to question 2.

Test cases

Some test cases and results for the above solution. Result showing the numerical answer (length of the resulting string) and in parenthesis the resulting string after the substitution(s).

Test #0:
--------------------
X
X
X

result: 1 (X)


Test #1:
--------------------
..X....
..X....
..X....

result: 1 (.)


Test #2:
--------------------
..X.X..
..X.X..
....X..

result: 1 (.)


Test #3:
--------------------
..X....
..X....
...X...

result: 0 ()


Test #4:
--------------------
..X....
...X...
..X....

result: 0 ()


Test #5:
--------------------
....X..
.X..X..
.X.....

result: 0 ()


Test #6:
--------------------
.X..X..
.X.X...
.X.X...

result: 1 (.)


Test #7:
--------------------
.X..X..
.X..X..
.X..X..

result: 2 (.X)


Test #8:
--------------------
XXX
XXX
XXX

result: 3 (XXX)


Test #9:
--------------------
X.X.X
XXXXX
XXXXX
.X.X.

result: 5 (XXXXX)


Test #10:
--------------------
1....X.......
2..X..X...X....
3X.X...X..X.....
4X....XXXXXX.....
5X..XXX...........
6.....X..........
7.........X....X
8..X......X....X....
9..X......X....X....X...
A....X.....
B.X..X..
C.....
XXX
XXX
XXX
.

result: 8 (3458.XXX)
Tuesday, September 6, 2022
3

Neither basic nor extended Posix/GNU regex recognizes the non-greedy quantifier; you need a later regex. Fortunately, Perl regex for this context is pretty easy to get:

perl -pe 's|(http://.*?/).*|1|'
Tuesday, November 15, 2022
 
gsuoyb
 
4

There are a couple of important things to know about bash's [[ ]] construction. The first:

Word splitting and pathname expansion are not performed on the words between the [[ and ]]; tilde expansion, parameter and variable expansion, arithmetic expansion, command substitution, process substitution, and quote removal are performed.

The second thing:

An additional binary operator, ‘=~’, is available,... the string to the right of the operator is considered an extended regular expression and matched accordingly... Any part of the pattern may be quoted to force it to be matched as a string.

Consequently, $v on either side of the =~ will be expanded to the value of that variable, but the result will not be word-split or pathname-expanded. In other words, it's perfectly safe to leave variable expansions unquoted on the left-hand side, but you need to know that variable expansions will happen on the right-hand side.

So if you write: [[ $x =~ [$0-9a-zA-Z] ]], the $0 inside the regex on the right will be expanded before the regex is interpreted, which will probably cause the regex to fail to compile (unless the expansion of $0 ends with a digit or punctuation symbol whose ascii value is less than a digit). If you quote the right-hand side like-so [[ $x =~ "[$0-9a-zA-Z]" ]], then the right-hand side will be treated as an ordinary string, not a regex (and $0 will still be expanded). What you really want in this case is [[ $x =~ [$0-9a-zA-Z] ]]

Similarly, the expression between the [[ and ]] is split into words before the regex is interpreted. So spaces in the regex need to be escaped or quoted. If you wanted to match letters, digits or spaces you could use: [[ $x =~ [0-9a-zA-Z ] ]]. Other characters similarly need to be escaped, like #, which would start a comment if not quoted. Of course, you can put the pattern into a variable:

pat="[0-9a-zA-Z ]"
if [[ $x =~ $pat ]]; then ...

For regexes which contain lots of characters which would need to be escaped or quoted to pass through bash's lexer, many people prefer this style. But beware: In this case, you cannot quote the variable expansion:

# This doesn't work:
if [[ $x =~ "$pat" ]]; then ...

Finally, I think what you are trying to do is to verify that the variable only contains valid characters. The easiest way to do this check is to make sure that it does not contain an invalid character. In other words, an expression like this:

valid='0-9a-zA-Z $%&#' # add almost whatever else you want to allow to the list
if [[ ! $x =~ [^$valid] ]]; then ...

! negates the test, turning it into a "does not match" operator, and a [^...] regex character class means "any character other than ...".

The combination of parameter expansion and regex operators can make bash regular expression syntax "almost readable", but there are still some gotchas. (Aren't there always?) One is that you could not put ] into $valid, even if $valid were quoted, except at the very beginning. (That's a Posix regex rule: if you want to include ] in a character class, it needs to go at the beginning. - can go at the beginning or the end, so if you need both ] and -, you need to start with ] and end with -, leading to the regex "I know what I'm doing" emoticon: [][-])

Wednesday, September 28, 2022
1

With PCRE and Perl (and probably Java) you could use:

^(?:.(?=.*?(?(1)(?=.1$))(.1?$)))*(.)

which would capture the middle character of odd length strings in the 2nd capturing group.

Explained:

^ # beginning of the string
(?: # loop
  . # match a single character
  (?=
    # non-greedy lookahead to towards the end of string
    .*?
    # if we already have captured the end of the string (skip the first iteration)
    (?(1)
      # make sure we do not go past the correct position
      (?= .1$ )
    )
    # capture the end of the string +1 character, adding to 1 every iteration
    ( .1?$ )
  )
)* # repeat
# the middle character follows, capture it
(.)
Monday, December 26, 2022
 
3

[ This answer is Perl-specific. The information within may not apply to PCRE or the engine used by the other languages tagged. ]

/w/aa (the actual equivalent of /[a-zA-Z0-9_]/) is usually faster, but not always. That said, the difference is so minimal (less than 1 nanosecond per check) that it shouldn't be a concern. To put it in to context, it takes far, far longer to call a sub or start the regex engine.

What follows covers this in detail.


First of all, w isn't the same as [a-zA-Z0-9_] by default. w matches every alphabetic, numeric, mark and connector punctuation Unicode Code Point. There are 119,821 of these![1] Determining which is the fastest of non-equivalent code makes no sense.

However, using w with /aa ensures that w only matches [a-zA-Z0-9_]. So that's what we're going to be using for our benchmarks. (Actually, we'll use both.)

(Note that each test performs 10 million checks, so a rate of 10.0/s actually means 10.0 million checks per second.)


ASCII-only positive match
               Rate [a-zA-Z0-9_]      (?u:w)     (?aa:w)
[a-zA-Z0-9_] 39.1/s           --         -26%         -36%
(?u:w)      52.9/s          35%           --         -13%
(?aa:w)     60.9/s          56%          15%           --

When finding a match in ASCII characters, ASCII-only w and Unicode w both beat the explicit class.

/w/aa is ( 1/39.1 - 1/60.9 ) / 10,000,000 = 0.000,000,000,916 s faster on my machine


ASCII-only negative match
               Rate      (?u:w)     (?aa:w) [a-zA-Z0-9_]
(?u:w)      27.2/s           --          -0%         -12%
(?aa:w)     27.2/s           0%           --         -12%
[a-zA-Z0-9_] 31.1/s          14%          14%           --

When failing to find a match in ASCII characters, the explicit class beats ASCII-only w.

/[a-zA-Z0-9_]/ is ( 1/27.2 - 1/31.1 ) / 10,000,000 = 0.000,000,000,461 s faster on my machine


Non-ASCII positive match
               Rate      (?u:w) [a-zA-Z0-9_]     (?aa:w)
(?u:w)      2.97/s           --        -100%        -100%
[a-zA-Z0-9_] 3349/s      112641%           --          -9%
(?aa:w)     3664/s      123268%           9%           --

Whoa. This tests appears to be running into some optimization. That said, running the test multiple times yields extremely consistent results. (Same goes for the other tests.)

When finding a match in non-ASCII characters, ASCII-only w beats the explicit class.

/w/aa is ( 1/3349 - 1/3664 ) / 10,000,000 = 0.000,000,000,002,57 s faster on my machine


Non-ASCII negative match
               Rate      (?u:w) [a-zA-Z0-9_]     (?aa:w)
(?u:w)      2.66/s           --          -9%         -71%
[a-zA-Z0-9_] 2.91/s          10%           --         -68%
(?aa:w)     9.09/s         242%         212%           --

When failing to find a match in non-ASCII characters, ASCII-only w beats the explicit class.

/[a-zA-Z0-9_]/ is ( 1/2.91 - 1/9.09 ) / 10,000,000 = 0.000,000,002,34 s faster on my machine


Conclusions

  • I'm surprised there's any difference between /w/aa and /[a-zA-Z0-9_]/.
  • In some situation, /w/aa is faster; in others, /[a-zA-Z0-9_]/.
  • The difference between /w/aa and /[a-zA-Z0-9_]/ is very minimal (less than 1 nanosecond).
  • The difference is so minimal that you shouldn't be concerned about it.
  • Even the difference between /w/aa and /w/u is quite small despite the latter matching 4 orders of magnitude more characters than the former.

use strict;
use warnings;
use feature qw( say );

use Benchmarks qw( cmpthese );

my %pos_tests = (
   '(?u:\w)'     => '/^\w*\z/u',
   '(?aa:\w)'    => '/^\w*\z/aa',
   '[a-zA-Z0-9_]' => '/^[a-zA-Z0-9_]*\z/',
);

my %neg_tests = (
   '(?u:\w)'     => '/\w/u',
   '(?aa:\w)'    => '/\w/aa',
   '[a-zA-Z0-9_]' => '/[a-zA-Z0-9_]/',
);

$_ = sprintf( 'use strict; use warnings; our $s; for (1..1000) { $s =~ %s }', $_)
   for
      values(%pos_tests),
      values(%neg_tests);

local our $s;

say "ASCII-only positive match";
$s = "J" x 10_000;
cmpthese(-3, %pos_tests);

say "";

say "ASCII-only negative match";
$s = "!" x 10_000;
cmpthese(-3, %neg_tests);

say "";

say "Non-ASCII positive match";
$s = "N{U+0100}" x 10_000;
cmpthese(-3, %pos_tests);

say "";

say "Non-ASCII negative match";
$s = "N{U+2660}" x 10_000;
cmpthese(-3, %neg_tests);

  1. Unicode version 11.
Tuesday, November 22, 2022
Only authorized users can answer the search term. Please sign in first, or register a free account.
Not the answer you're looking for? Browse other questions tagged :