1

Locating Benjamin Button

Given ...

Born,Died
1852,1891
1862,1862
1902,1785

.., is there a syntax in perl-compatible regex which will match the fourth line, where the first value is greater than the second value?

My guess is something combining ...

(\d+),(\d+)

... and ...

(??{$1>$2})

.., but maybe it is not possible, because regex is lexical and the match is arithmetic.

Edit: This is constrained to pcre-regex because the environment accepts pcre patterns but forbids perl programs.

ThisSuitIsBlackNot
  • 21,870
  • 8
  • 56
  • 101
Thomas L Holaday
  • 13,068
  • 5
  • 38
  • 50
  • 3
    Just to clarify, are you asking for a Perl regex or a PCRE regex? [PCRE doesn't support the `??{code}` syntax](http://www.pcre.org/current/doc/html/pcre2compat.html). – ThisSuitIsBlackNot Apr 22 '16 at 17:46
  • 3
    Also, this is trivial to do in Perl without a regex (and easier to read): `while () { chomp; my ($born, $died) = split /,/; print if $died < $born; }` Is there some reason you have to use a regex? – ThisSuitIsBlackNot Apr 22 '16 at 18:34
  • @ThisSuitIsBlackNot, I am personally curious about the syntax for ??{code} , but the problem at hand requires PCRE. Based on the link you provided, the problem at hand is hopeless; unless the pcre-callout feature might be a bridge. – Thomas L Holaday Apr 22 '16 at 19:21
  • Now an answer at all, but cool regex magic: `perl -le '$x = "123,456"; print $x =~ /(\d),([\1-9])/; print $1; print $2;'` - as long as you're dealing with 1-digit numbers, a regex could do your job. – Sebastian Apr 22 '16 at 19:30
  • 1
    What about [this regex](https://regex101.com/r/qG7qD6/1)? – Wiktor Stribiżew Apr 22 '16 at 21:35
  • @WiktorStribiżew Thanks for this. I now understand that arithmetic less-than `is` lexical. I am attempting to locate a version of pcregrep which understands named capture groups. – Thomas L Holaday Apr 23 '16 at 02:07

2 Answers2

1

Summary

This regex assumes that your source numbers are 4 digit strings. It will find instances where the first comma delimited number is numerically larger than the second number.
As written this regex assumes that you're using the "x" flag which ignores white space or line breaks.

Regex

^(?=\d{4},\d{4}(?:\D|\Z))(?:(?:
[9]\d*,[012345678]\d*|
[89]\d*,[01234567]\d*|
[789]\d*,[0123456]\d*|
[6789]\d*,[012345]\d*|
[56789]\d*,[01234]\d*|
[456789]\d*,[0123]\d*|
[3456789]\d*,[012]\d*|
[23456789]\d*,[01]\d*|
[123456789]\d*,[0]\d*
)|
(?<a>\d{1})(?:
[9]\d*,\k<a>[012345678]\d*|
[89]\d*,\k<a>[01234567]\d*|
[789]\d*,\k<a>[0123456]\d*|
[6789]\d*,\k<a>[012345]\d*|
[56789]\d*,\k<a>[01234]\d*|
[456789]\d*,\k<a>[0123]\d*|
[3456789]\d*,\k<a>[012]\d*|
[23456789]\d*,\k<a>[01]\d*|
[123456789]\d*,\k<a>[0]\d*
)|
(?<b>\d{2})(?:
[9]\d*,\k<b>[012345678]\d*|
[89]\d*,\k<b>[01234567]\d*|
[789]\d*,\k<b>[0123456]\d*|
[6789]\d*,\k<b>[012345]\d*|
[56789]\d*,\k<b>[01234]\d*|
[456789]\d*,\k<b>[0123]\d*|
[3456789]\d*,\k<b>[012]\d*|
[23456789]\d*,\k<b>[01]\d*|
[123456789]\d*,\k<b>[0]\d*
)|
(?<c>\d{3})(?:
[9]\d*,\k<c>[012345678]\d*|
[89]\d*,\k<c>[01234567]\d*|
[789]\d*,\k<c>[0123456]\d*|
[6789]\d*,\k<c>[012345]\d*|
[56789]\d*,\k<c>[01234]\d*|
[456789]\d*,\k<c>[0123]\d*|
[3456789]\d*,\k<c>[012]\d*|
[23456789]\d*,\k<c>[01]\d*|
[123456789]\d*,\k<c>[0]\d*
))

Example

http://www.rubular.com/r/XjBNBQIzGP

Sample text

Born,Died
1852,1891
1862,1862
1902,1785
1111,1111
1111,1110
2222,2202
3333,3033
4444,0444
123,456
1234,567
123,4567
456,123
4567,123
456,1234
4567,1234

Sample capture

[0][0] = 1902,1785
[0][a] = 1
[0][b] = 
[0][c] = 
[1][0] = 1111,1110
[1][a] = 
[1][b] = 
[1][c] = 111
[2][0] = 2222,2202
[2][a] = 
[2][b] = 22
[2][c] = 
[3][0] = 3333,3033
[3][a] = 3
[3][b] = 
[3][c] = 
[4][0] = 4444,0444
[4][a] = 
[4][b] = 
[4][c] = 
[5][0] = 4567,1234
[5][a] = 
[5][b] = 
[5][c] = 

Short Explanation

The beginning of the regex has a lookahead to validate that we do in fact have two 4 digit numbers.

The four blocks of code test each position to validate if one digit is larger than the other. The second, third, and fourth blocks contain a named backreference (a, b, c respectively). This backreference insures the leading numbers are identical.

More Detailed Explanation

NODE                     EXPLANATION
----------------------------------------------------------------------
  ^                        the beginning of a "line"
----------------------------------------------------------------------
  (?=                      look ahead to see if there is:
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
    ,                        ','
----------------------------------------------------------------------
    \d{4}                    digits (0-9) (4 times)
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      \D                       non-digits (all but 0-9)
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      \Z                       before an optional \n, and the end of
                               the string
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
  )                        end of look-ahead
----------------------------------------------------------------------
  (?:                      group, but do not capture:
----------------------------------------------------------------------
    (?:                      group, but do not capture:
----------------------------------------------------------------------
      [9]                      any character of: '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [012345678]              any character of: '0', '1', '2', '3',
                               '4', '5', '6', '7', '8'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [89]                     any character of: '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [01234567]               any character of: '0', '1', '2', '3',
                               '4', '5', '6', '7'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [789]                    any character of: '7', '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [0123456]                any character of: '0', '1', '2', '3',
                               '4', '5', '6'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [6789]                   any character of: '6', '7', '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [012345]                 any character of: '0', '1', '2', '3',
                               '4', '5'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [56789]                  any character of: '5', '6', '7', '8',
                               '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [01234]                  any character of: '0', '1', '2', '3',
                               '4'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [456789]                 any character of: '4', '5', '6', '7',
                               '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [0123]                   any character of: '0', '1', '2', '3'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [3456789]                any character of: '3', '4', '5', '6',
                               '7', '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [012]                    any character of: '0', '1', '2'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [23456789]               any character of: '2', '3', '4', '5',
                               '6', '7', '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [01]                     any character of: '0', '1'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
     |                        OR
----------------------------------------------------------------------
      [123456789]              any character of: '1', '2', '3', '4',
                               '5', '6', '7', '8', '9'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
      ,                        ','
----------------------------------------------------------------------
      [0]                      any character of: '0'
----------------------------------------------------------------------
      \d*                      digits (0-9) (0 or more times
                               (matching the most amount possible))
----------------------------------------------------------------------
    )                        end of grouping
----------------------------------------------------------------------
   |                        OR
----------------------------------------------------------------------
  )                        end of grouping
----------------------------------------------------------------------
)                        end of grouping
Ro Yo Mi
  • 13,586
  • 4
  • 31
  • 40
1

A working pattern using the Perl "delayed execution assertion", (??{code}), is:

^(\d{4}),(\d{4})(??{ $1 > $2 ? "" : "(?!)"})$

The delayed execution assertion puts the value returned by the code into the regular expression. The effect in this case is for the pattern to become ^(\d{4}),(\d{4})$ if the first number is greater than the second (so it matches), and ^(\d{4}),(\d{4})(?!)$ otherwise. (?!) is a negative lookahead assertion that never matches because Perl considers that the empty pattern always matches.

Another option in Perl is to use the "conditional expression", (?(condition)yes-pattern), and the "code evaluation assertion", (?{code}):

^(\d{4}),(\d{4})(?(?{ $1 <= $2 })(?!))$

This has the effect of adding the never-matching (?!) to the pattern if the first number is less than or equal to the second number. My testing shows this to be considerably faster than the first pattern above.

See the perlre man page for detailed information about all the Perl regular expression features.

See the Regex Arcana article, by Jeff Pinyan, for an excellent tutorial on the (??{code}) and (?{code}) patterns.

However, the patterns above don't work with the PCRE library. The comment by @Sebastian suggests a possible solution (WHICH DOES NOT WORK):

^(\d*)(\d)\d*,\1[^\D\2-9]\d*$

This attempts to find number pairs where the numbers have the same prefix and the first differing character in the second number is not a non-digit (i.e. it is a digit) and is not equal to or greater than the corresponding digit in the first number (i.e. it is less than the other digit). Unfortunately, it doesn't work. The reason is explained in General approach for (equivalent of) “backreferences within character class”?. Basically, backreferences don't work in character classes. The idea can be made to work by using the delayed execution assertion (^(\d*)(\d)\d*,\1(??{"[^\D${2}-9]"})\d*$), but that's still no good for PCRE.

One PCRE-compatible option is to use a brute-force version of the check-the-first-differing-digit idea. Look for a 1 followed by a 0, or a 2 followed by 0 or 1, or a 3 followed by 0 or 1 or 2, ... . This snippet of Bash code produces the regular expression:

regex='^(?=\d{4},\d{4}$)'   # Match only lines of the form 'dddd,dddd'
regex+='(\d*)'              # Prefix of both numbers
regex+='(1\d*,\1[0]'        # 1 (followed by digits+','+prefix) followed by 0
for (( i=2 ; i<=9 ; i++ )) ; do
    regex+="|$i\d*,\1[0-$((i-1))]"  # or $i (...) followed by a lesser digit
done
regex+=')\d*$'
printf '%s\n' "$regex"

It also adds the same positive look-ahead assertion at the start of the regular expression that @Denomales used in the first posted answer. The resulting regular expression is:

^(?=\d{4},\d{4}$)(\d*)(1\d*,\1[0]|2\d*,\1[0-1]|3\d*,\1[0-2]|4\d*,\1[0-3]|5\d*,\1[0-4]|6\d*,\1[0-5]|7\d*,\1[0-6]|8\d*,\1[0-7]|9\d*,\1[0-8])\d*$

As @ThisSuitIsBlackNot pointed out in a comment, regular expressions are not the best way to do this. Also see What is meant by “Now you have two problems”?.

Community
  • 1
  • 1
pjh
  • 2,848
  • 11
  • 15