6

I'm using python regex library to parse some strings and currently I found that my regex is either too complicated or the string I'm searching is too long.

Here's an example of the hang up:

>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...

I'm not sure what's going on. Any help appreciated!

EDIT: Here's a link with examples of what I'm trying to match: Regxr

Gawndy
  • 129
  • 8
  • What are you trying to search for? – Chuck Apr 09 '17 at 19:01
  • 1
    You have catastrophic backtracking. https://regex101.com/r/0vq6T7/1 – Keatinge Apr 09 '17 at 19:02
  • 2
    As usual, catastrophic backtracking due to one obligatory and optional patterns inside a quantified group `(\w+'?\s*)+`. What do you want to match with it? Try changing it to `(\w+(?:'\s*\w+)*)` – Wiktor Stribiżew Apr 09 '17 at 19:02
  • I updated my question with an example of what I'm trying to match. @Keatinge thank you. I'll look into catastrophic backtracking – Gawndy Apr 09 '17 at 19:21

1 Answers1

7

The reason why the code execution hangs is catastrophic backtracking due to one obligatory and 1+ optional patterns (those that can match an empty string) inside a quantified group (\w+'?\s*)+ that allows a regex engine to test a lot of matching paths, so many that it takes too long to complete.

I suggest unwrapping the problematic group in such a way that ' or \s become obligatory and wrap them in an optional group:

(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***

See the regex demo

Here, (\w+(?:['\s]+\w+)*) will match 1+ word chars, and then 0+ sequences of 1+ ' or whitespaces followed with 1+ word chars. This way, the pattern becomes linear and the regex engine fails the match quicker if a non-matching string occurs.

The rest of the pattern:

  • \s*[-~]\s* - either - or ~ wrapped with 0+ whitespaces
  • (\$?\d+(?:\.\d+)?\$?) - Group 2 capturing
    • \$? - 1 or 0 $ symbols
    • \d+ - 1+ digits
    • (?:\.\d+)? - 1 or 0 zero sequences of:
      • \. - a dot
      • \d+ - 1+ digits
    • \$? - 1 or 0 $ symbols
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Thank you! I built this regex from scratch so I was afraid it wasn't going to be robust. I'm still a bit confused on lookarounds though. I see you added one. What's the difference between that and what I had before? – Gawndy Apr 09 '17 at 19:38
  • There are no lookarounds in the pattern I suggested. Just capturing and non-capturing groups and quantifiers. There should be no big difference, just `\$?\d+(?:\.\d+)?\$?` can match `$15$`, but I doubt it will happen. You may use your `(?:(\d+\.?\d+\$?)|(\$?\d+\.?\d+))` if you find that "shortcut" matches "too much". – Wiktor Stribiżew Apr 09 '17 at 19:42
  • 1
    Yes, that is a [non-capturing group](http://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-a-question-mark-followed-by-a-colon) used only to group subpatterns, not to create `.group()`s. – Wiktor Stribiżew Apr 09 '17 at 20:12
  • After reading that thread, I realize that I've been using the words match and capture interchangeably. From what I've read, matching will not "return" something and capture does. – Gawndy Apr 09 '17 at 20:35
  • 1
    This answer opened my eyes that behind the scene some logic is running and it is not like you write any valid regular expression and that is it. A simple unwrapping suggested here helped my re.search() to be up and running :) – ggaurav Jan 24 '20 at 08:19