1

I want to allow any 0 to 2 characters between each group in the (this is)?.??.??(an)?.??.??(example sentence) regex. It should match the bolded text in the below strings:

blah blah. An example sentence
blah blah. This is an example sentence
Something something Example sentence

Now, in the first example, the match is ah. example sentence. I thought adding 2 question marks to "." would mean that the regex engine will prefer to match 0 chars.

I'm using regex within VBA in MS Word, implemented by CreateObject("vbscript.regexp"), which as I understand it uses the VBScript regex flavor, which as I understand it is the same as the JavaScript flavor.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
Some_Guy
  • 444
  • 4
  • 18
  • @wiktor-stribiżew why is this a duplicate? As far as I can see I'm using non greedy on purpose, but getting a greedy match. This isn't addressed in the linked question. – Some_Guy Jan 17 '17 at 11:54
  • 1
    You seem to misunderstand how greedy and lazy quantifiers work. The linked thread deals with that. Quantifiers do not affect the place where a match is found. A regex engine parses text from left to right. Once it can match a part of the text with a pattern, it will. – Wiktor Stribiżew Jan 17 '17 at 11:57
  • (this is).*?(an).*?(example sentence) – Lonnie Best Jan 17 '17 at 12:01
  • @LonnieBest Thank you, but I want to match a maximum of two characters between each subexpression, and match even if the subexpression is absent – Some_Guy Jan 17 '17 at 12:03
  • 1
    I think you need to modify the question to accentuate the real problem. It seems to me you just need to use the `.{0,2}` (or even `.{1,2}`) inside the optional groups, `(this is.{0,2})?(an.{0,2})?(example sentence)`, see [this demo](https://regex101.com/r/ONhZRD/2). – Wiktor Stribiżew Jan 17 '17 at 12:06
  • @WiktorStribiżew That solves the problem thanks. Why does placing it outside the groups matter though? When searching `0020002101` should `2.??.??.??101` not prefer `2101` to `20002101`? – Some_Guy Jan 17 '17 at 12:11
  • 1
    Regex egine cannot "prefer" anything. It matches from left to right. Once the `2` is found (the first `2`) it starts matching the subsequent subpatterns, and when a match is found, it is returned. – Wiktor Stribiżew Jan 17 '17 at 12:13
  • I do think [Non-greedy regex quantifier gives greedy result](http://stackoverflow.com/questions/16633315/non-greedy-regex-quantifier-gives-greedy-result) is a better dupe for this question. – Sebastian Proske Jan 17 '17 at 12:42
  • https://msdn.microsoft.com/en-us/library/az24scfc(v=vs.110).aspx was the reference I was using by the way. Very misleading: | ?? | Matches the previous element zero or one time, but as few times as possible. | – Some_Guy Jan 22 '17 at 15:34

1 Answers1

3

When searching 0020002101 should 2.??.??.??101 not prefer 2101 to 20002101?

Regex egine cannot "prefer" anything. It matches from left to right. Once the 2 is found (the first 2) it starts matching the subsequent subpatterns, and when a match is found, it is returned.

In your case, you need to use the .{0,2} inside the optional groups,

(this is.{0,2})?(an.{0,2})?(example sentence)
        ^^^^^^     ^^^^^^

See the regex demo.

If the order of the optional strings is important, make them nested:

(this is.{0,2}(an.{0,2})?)?(example sentence)

See another regex demo. This regex will only match an with 0 to 2 chars after it only if this is with 0 to 2 chars is found before it.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Thank you. This clarifies things a lot. – Some_Guy Jan 17 '17 at 12:31
  • Was experimenting a bit with this one and modifying the original regex to `(this is)??.??.??(an)??.??.??(example sentence)` I'd expect only **example sentence** to be matched but... [regex101](https://regex101.com/r/ONhZRD/5). The first two groups get matched any way. Any idea why? – SamWhan Jan 17 '17 at 12:47
  • A hint: switch to PCRE regex flavor at regex101.com and click *regex debugger*. You will see the internals of matching that is quite common to these NFA regex engines. Your matches are expected. If you need to avoid returning a match at all, you would have to use lookarounds/zero-width assertions that restrict a match context and do not return the matched texts with the match value. – Wiktor Stribiżew Jan 17 '17 at 12:51