1

I am working on some regex crosswords. I decided to take one of the expressions and apply it to some real life text (using Notepad++) to see exactly what happens. It's left me baffled!

The regex I am matching is:

(.)+\1

The text I applied it to is:

Business Parks - Research business parks in the Reading area with conference capabilities

Green Park - expensive and no advertising allowed except via their newsletter

Arlington Business Park - no facility Thames Valley Science Park (TVSP) -

Sleep

The matches I get are (notice how matches can be mid-word):

Business Parks - Research business
Green Park - expensive and no advertising allowed except via their newslett
Arlington Business
Thames Vall
Slee

I'd be very grateful if someone could walk me through what is going on here. I anticipated a bit some sort of result to where repeated characters get matched due to the '\1'. However, I am particularly stumped why 'Green' gets evaluated and still continues up to 'newslett'.

Community
  • 1
  • 1
Profplum
  • 49
  • 5
  • You should put the regex in your question, not just the title. Anyway, `(.)+\1` matches every character in a string up to the last duplicate character. So `Millennium` would match `Millenn` – ctwheels May 08 '18 at 15:33
  • Check out this website: https://regexr.com/ It's got an `Explain` feature that should help you – Dan Crews May 08 '18 at 15:34
  • `.` means match any character and the `(...)` around the `.` mean it's a grouping referenced later (matched with `\1`). The `+` means one or more of the prior character. So `(.)+` means one or more of any character, and `(.)+\1` means one or more of a character followed by the last one identified. In short, it matches everything up to and including a repeated character. – lurker May 08 '18 at 15:34
  • [Regex101](https://regex101.com/r/aR4huJ/1/) does all the step by step explanation. See [the debugger](https://regex101.com/r/aR4huJ/1/debugger). – Wiktor Stribiżew May 08 '18 at 17:17

3 Answers3

1

It appears that the (.)+ is matching one or more characters, as expected, but, with each match, the text that is captured is updated. Thus it's matching from the beginning of the string to the last doubled character. The \1 is matching the previous match for (.), which is always the previous character to what \1 can match.

Tanktalus
  • 20,069
  • 4
  • 37
  • 65
0

(.)+\1 matches everything from the beginning of the string to the last duplicate characters.

However, I am particularly stumped why 'Green' gets evaluated and still continues up to 'newslett'.

This is because + used alone is greedy and takes up everything it can with it.

If you wanted the match to stop at green, (.)+?\1 could have been used with ? blocking the greediness of + and match multiple groups instead of one big group

Demo

Yassin Hajaj
  • 20,020
  • 9
  • 41
  • 81
  • 1
    Thanks to everyone for their input! Really helpful. I'll also be sure to make use of regex101.com That looks great for understanding what exactly is happening. – Profplum May 08 '18 at 19:05
0

The other answers have made some details but there are more behind the scenes... If we separate your regex into two parts, they would be (.)+ and \1. Both patterns should succeed for engine to return a match or one should fail for a total failure.

This (.)+ consumes one character at a time but continues up to the end of line then backtracks. It means it doesn't stop to look for \1 unless it's done at matching.

After reaching end of line, a backtrack from (.)+ happens - one character at a time and next pattern which is \1 tries to match at each step.

It's like starting a match from end of line so a match close to end of line satisfies engine asap.

revo
  • 43,830
  • 14
  • 67
  • 109