2

Say I had the line

"The quick brown fox jumps over the lazy dog"

and I wanted to grab everything between "brown" and "over", where the boundary words may also be substrings of other words. So I am trying to tell the RegEx something like

"grab everything in this line beginning at the string brown until you find the string over"

So I did

brown[^("over")]*

but the result is brown f, because "fox" contains an "o" which is contained in "over".

I just couldn't find a solution to this and the so I hope you can help.

jaySon
  • 735
  • 2
  • 7
  • 19
  • Until you find the first or last `over`? What about `"grab everything in this line beginning at the string brown until you brown find the string over"`? *brown until you brown find the string over* or *brown find the string over*. What about newlines? – Wiktor Stribiżew Nov 16 '15 at 13:14
  • @stribizhev, the first "over". – jaySon Nov 16 '15 at 13:15

1 Answers1

2

Alroght, to match really anything between 2 substrings (where the trailing part must be the left-most match, i.e. closest to the leading substring) can be achieved best with the help of a unrolling-the-loop method that invloves the use of negated character classes (sometimes, with a look-ahead).

Here is one for your case:

\bbrown\b[^o]*(?:o(?!ver\b)[^o]*)*\bover\b

See the regex demo

Note that basically this expression is synonymic to (?s)\bbrown\b.*?\bover\b where .*? matches 0 or more any characters, but as few as possible to return a valid match. However, it involves much less backtracking since it is linear.

The unrolled lazy matching is turned into [^o]*(?:o(?!ver\b)[^o]*)* here. Negated character class [^o] matches any character but o. Thus, we do not have to worry about matching newlines.

The \b word boundaries help match whole words only. If you need no whole word matching, just remove all \b from the pattern.

Here is my regex breakdown:

  • \bbrown\b - matches brown as a whole word
  • [^o]* - 0 or more characters other than o
  • (?:o(?!ver\b)[^o]*)* - 0 or more sequences of o that is not followed by ver ((?!ver\b)) and followed by 0 or more characters other than o ([^o]*)
  • \bover\b - matches a whole word over.
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Probably I should have mentioned that in my case `brown` and `over` may also be substrings of other words or be enclosed by quotes, so using word bounderies here would be kind of restrictive to the allowed patterns of the analyzed string. – jaySon Nov 16 '15 at 13:28
  • Yes, you can just remove them, and you will be able to match them as part of other words. The speed of this regex is sometimes 100 times faster than with `.*?` (depending on the input string length). Also, this technique is universal and portable to most other platforms. With `.*?`, you can have various headaches as singleline mode unavailable (in JS) or backtracking buffer limit gets exhausted quick (with very lengthy inputs). – Wiktor Stribiżew Nov 16 '15 at 13:30
  • How is the above pattern more efficient than `brown\b((?!over).*)\bover`? – hjpotter92 Nov 16 '15 at 13:36
  • @hjpotter92: [Your regex](https://regex101.com/r/tY7lI4/1) finds a match in 58 steps. [My regex](https://regex101.com/r/tY7lI4/2) does it in 23-25 steps. You can check the regex debugger and see how backtracking works in both cases. Dot matching is always less efficient that character class matching. Although that is for PCRE, .NET regex engine will work similarly with these patterns. – Wiktor Stribiżew Nov 16 '15 at 13:41