Regular expressions: Ensuring b doesn't come between a and c

Question

Here's something I'm trying to do with regular expressions, and I can't figure out how. I have a big file, and strings abc, 123 and xyz that appear multiple times throughout the file.

I want a regular expression to match a substring of the big file that begins with abc, contains 123 somewhere in the middle, ends with xyz, and there are no other instances of abc or xyz in the substring besides the start and the end.

Is this possible with regular expressions?

[Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool.](http://stackoverflow.com/tags/regex/info) That said, is there any particular reason you want to use regular expressions here? It's possible, but in most environments, it's more complicated than not using regexes. — , May 15 '16 at 15:56
Should line breaks be considered or not? The big file will be read line by line or as one big string? — Jorge Campos, May 15 '16 at 15:59

Wiktor Stribiżew · Accepted Answer · 2020-04-23T19:33:40.843

37

When your left- and right-hand delimiters are single characters, it can be easily solved with negated character classes. So, if your match is between a and c and should not contain b (literally), you may use (demo)

a[^abc]*c

This is the same technique you use when you want to make sure there is a b in between the closest a and c (demo):

a[^abc]*b[^ac]*c

When your left- and right-hand delimiters are multi-character strings, you need a tempered greedy token:

abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz

See the regex demo

To make sure it matches across lines, use re.DOTALL flag when compiling the regex.

Note that to achieve a better performance with such a heavy pattern, you should consider unrolling it. It can be done with negated character classes and negative lookaheads.

Pattern details:

abc - match abc
(?:(?!abc|xyz|123).)* - match any character that is not the starting point for a abc, xyz or 123 character sequences
123 - a literal string 123
(?:(?!abc|xyz).)* - any character that is not the starting point for a abc or xyz character sequences
xyz - a trailing substring xyz

See the diagram below (if re.S is used, . will mean AnyChar):

See the Python demo:

import re
p = re.compile(r'abc(?:(?!abc|xyz|123).)*123(?:(?!abc|xyz).)*xyz', re.DOTALL)
s = "abc 123 xyz\nabc abc 123 xyz\nabc text 123 xyz\nabc text xyz xyz"
print(p.findall(s))
// => ['abc 123 xyz', 'abc 123 xyz', 'abc text 123 xyz']

edited Apr 23 '20 at 19:33

answered May 15 '16 at 16:20

Wiktor Stribiżew

484,719
26
302
397

Can you please link the site from where you generated that state machine? I know a site exists with similar UI but can't find it though. Sorry for the irrelevant comment. I'll delete it soon :) – denvercoder9 Mar 18 '17 at 18:24
Why the `|123`? – Stefan Pochmann Jan 24 '18 at 20:22
@StefanPochmann Well, you may get rid of that if you use a lazy quantifier in the first case: `r'abc(?:(?!abc|xyz).)*?123(?:(?!abc|xyz).)*xyz'`. It will work similarly. – Wiktor Stribiżew Jan 24 '18 at 20:26
Both are just optimizations, though, right? Not necessary? – Stefan Pochmann Jan 24 '18 at 20:26
That is optimization. – Wiktor Stribiżew Jan 24 '18 at 20:27
@WiktorStribiżew which software is used to draw diagram – Er. Amit Joshi Jun 03 '18 at 10:13
1

Hey @WiktorStribiżew I think you meant "between a and c" in your first pharagraph. I tried to edit it but I don't want to add noise to complete 6 chacters rule. – Mehmet Karadeniz Apr 23 '20 at 16:52

score 3 · Answer 2 · answered May 15 '16 at 16:15

3

Using PCRE a solution would be:

This using m flag. If you want to check only from start and end of a line add ^ and $ at beginning and end respectively

abc(?!.*(abc|xyz).*123).*123(?!.*(abc|xyz).*xyz).*xyz

Regular expression visualization

Debuggex Demo

answered May 15 '16 at 16:15

Jorge Campos

20,662
7
51
77

which software is used to draw diagram – Er. Amit Joshi Jun 03 '18 at 10:15
@Er.AmitJoshi It is not a software, it is the [Debuggex Site](https://www.debuggex.com) there is a link in the answer – Jorge Campos Jun 04 '18 at 01:50

score 2 · Answer 3 · edited May 15 '16 at 16:11

2

The comment by hvd is quite appropriate, and this just provides an example. In SQL, for instance, I think it would be clearer to do:

where val like 'abc%123%xyz' and
      val not like 'abc%abc%' and
      val not like '%xyz%xyz'

I imagine something quite similar is simple to do in other environments.

edited May 15 '16 at 16:11

Jonathan Leffler

666,971
126
813
1,185

answered May 15 '16 at 16:01

Gordon Linoff

1,122,135
50
484
624

Kenny Lau · Answer 4 · 2016-05-15T16:13:33.890

1

You could use lookaround.

/^abc(?!.*abc).*123.*(?<!xyz.*)xyz$/g

(I've not tested it.)

edited May 15 '16 at 16:13

answered May 15 '16 at 15:56

Kenny Lau

421
4
13

Regular expressions: Ensuring b doesn't come between a and c

4 Answers4

Linked

Related