-1

I am looking for a regex to select only the strings that are not starting with consecutive zeroes or consecutive alphabets before underscore in below strings.

For ex:

ABC_DE-001 is invalid
abc is invalid (only alphabets)
0_DE-001 is invalid (1 zero before underscore)
000_DE-001 is invalid (sequence of 3 consecutive zeroes)
00_DE-001 is invalid (sequence of 2 consecutive zeroes)
01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)

One of the approach I tried was: (0[1-9]+|[1-9][0-9]+|0[0*$][1-9])_[A-Z0-9]+[-][0-9]{3}

I am not sure though if any scenario is missed with this. Also, how can the same thing be achieved using negative or positive lookaround?

Dwarrior
  • 427
  • 1
  • 7
  • 20
  • Why lookarounds? I think `^0*[1-9][0-9]*_[A-Z]+-[0-9]{3}$` will do. See [this demo](https://regex101.com/r/BaI48B/1). BTW, is `A0B1C_DE-001` a valid string? – Wiktor Stribiżew Feb 05 '19 at 11:48
  • 1
    Looks like [my solution](https://regex101.com/r/BaI48B/2) still works for the updated test cases. – Wiktor Stribiżew Feb 05 '19 at 12:41
  • I wanted to add this in the beginning: please think of the "positive" rules for the pattern. What must there be at the start and later in the string? If you know what to match in the first place, it will be easier to formulate any exceptions later. – Wiktor Stribiżew Feb 05 '19 at 12:52
  • @WiktorStribiżew - Thank you for responding. I get your point. When I was writing the regex, I myself was unsure about what scenarios might fail using my regex. A0B1C_DE-001 is an invalid string. What regex to be used to if this is valid? Also, I wanted to understand more on the negative lookaround concept because it was little confusing for me to understand and I thought this is a good use case. Thank you again for responding. – Dwarrior Feb 05 '19 at 22:04
  • @Amessihel: Please check my above comment. Thank you. – Dwarrior Feb 05 '19 at 22:04
  • 1
    @WiktorStribiżew If your suggested regex from your comment solves the OP's problem you should post it and get the credits for it. There is no copy paste intent from my side as the last part of the regex is the same as the original posted regex and my reasoning looking at the example data was that the beginning of that regex `(0[1-9]+|[1-9][0-9]+)` could be written making just the zero optional, match a digit and making the rest of the digits optional. – The fourth bird Feb 06 '19 at 08:12

2 Answers2

0

You can try with negative look ahead groups:

grep -Pi '^(?![a-z]+(?:_|$|\s)|0+(?:_|$|\s))' test.txt

Explanation:

  • -Pi - use PCRE and process ignore case. This is grep specific, you can adapt these options to your case. If you cannot make the regex processor to ignore case, just replace [a-z] with [a-zA-Z]. And of course, PCRE support is required.
  • ^ - beginning of the line
  • (?!rgx) - look forward without moving the cursor to check the line doesn't match the enclosed regular expression rgx.
  • [a-z]+(?:_|$|\s)|0+(?:_|$|\s) :
    • don't keep consecutive letters ([a-z]+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
    • don't keep consecutive zeroes (0+) followed by an underscore, and end of line or a blank character ((?:_|$|\s))
    • (?:) stands for a non capturing group (got content is not stored, use it if so to improve performances)

Output got:

01_DE-001 is valid (0 followed by some other number is valid)
10_DE-001 is valid (starts with 1)
100_DE-001 is valid (starts with 1)

Since grep only keeps valid lines (default behavior), non displayed lines were processed as invalid.

Amessihel
  • 4,794
  • 1
  • 12
  • 36
0

For your examople data, you might match using an optional zero ^0? as that can occur but not more than 1 zero.

^0?[1-9][0-9]*_[A-Z]+-[0-9]{3}$

Regex demo

That will match

  • ^0? An optional zero at the start of the string
  • [1-9][0-9]* Match a digit 1-9 followed by 0+ digits
  • _[A-Z]+ Match an _ followed by 1+ times A-Z
  • -[0-9]{3} Match-` followed by 3 digits
  • $ Assert the end of the string
The fourth bird
  • 96,715
  • 14
  • 35
  • 52