Why does python's re.search method hang?

Question

I'm using python regex library to parse some strings and currently I found that my regex is either too complicated or the string I'm searching is too long.

Here's an example of the hang up:

>>> import re
>>> reg = "(\w+'?\s*)+[-|~]\s*((\d+\.?\d+\$?)|(\$?\d+\.?\d+))"
>>> re.search(reg, "**LOOKING FOR PAYPAL OFFERS ON THESE PAINTED UNCOMMONS**") #Hangs here...

I'm not sure what's going on. Any help appreciated!

EDIT: Here's a link with examples of what I'm trying to match: Regxr

You have catastrophic backtracking. https://regex101.com/r/0vq6T7/1 — Keatinge, Apr 09 '17 at 19:02
As usual, catastrophic backtracking due to one obligatory and optional patterns inside a quantified group `(\w+'?\s*)+`. What do you want to match with it? Try changing it to `(\w+(?:'\s*\w+)*)` — Wiktor Stribiżew, Apr 09 '17 at 19:02
I updated my question with an example of what I'm trying to match. @Keatinge thank you. I'll look into catastrophic backtracking — Gawndy, Apr 09 '17 at 19:21

Wiktor Stribiżew · Accepted Answer · 2017-04-09T19:34:48.410

7

The reason why the code execution hangs is catastrophic backtracking due to one obligatory and 1+ optional patterns (those that can match an empty string) inside a quantified group (\w+'?\s*)+ that allows a regex engine to test a lot of matching paths, so many that it takes too long to complete.

I suggest unwrapping the problematic group in such a way that ' or \s become obligatory and wrap them in an optional group:

(\w+(?:['\s]+\w+)*)\s*[-~]\s*(\$?\d+(?:\.\d+)?\$?)
^^^^^^^^^^^^^^^^^^^***

See the regex demo

Here, (\w+(?:['\s]+\w+)*) will match 1+ word chars, and then 0+ sequences of 1+ ' or whitespaces followed with 1+ word chars. This way, the pattern becomes linear and the regex engine fails the match quicker if a non-matching string occurs.

The rest of the pattern:

\s*[-~]\s* - either - or ~ wrapped with 0+ whitespaces
(\$?\d+(?:\.\d+)?\$?) - Group 2 capturing
- \$? - 1 or 0 $ symbols
- \d+ - 1+ digits
- (?:\.\d+)? - 1 or 0 zero sequences of:
  - \. - a dot
  - \d+ - 1+ digits
- \$? - 1 or 0 $ symbols

edited Apr 09 '17 at 19:34

answered Apr 09 '17 at 19:30

Wiktor Stribiżew

484,719
26
302
397

Thank you! I built this regex from scratch so I was afraid it wasn't going to be robust. I'm still a bit confused on lookarounds though. I see you added one. What's the difference between that and what I had before? – Gawndy Apr 09 '17 at 19:38
There are no lookarounds in the pattern I suggested. Just capturing and non-capturing groups and quantifiers. There should be no big difference, just `\$?\d+(?:\.\d+)?\$?` can match `$15$`, but I doubt it will happen. You may use your `(?:(\d+\.?\d+\$?)|(\$?\d+\.?\d+))` if you find that "shortcut" matches "too much". – Wiktor Stribiżew Apr 09 '17 at 19:42
1

Yes, that is a [non-capturing group](http://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group-what-does-a-question-mark-followed-by-a-colon) used only to group subpatterns, not to create `.group()`s. – Wiktor Stribiżew Apr 09 '17 at 20:12
After reading that thread, I realize that I've been using the words match and capture interchangeably. From what I've read, matching will not "return" something and capture does. – Gawndy Apr 09 '17 at 20:35
1

This answer opened my eyes that behind the scene some logic is running and it is not like you write any valid regular expression and that is it. A simple unwrapping suggested here helped my re.search() to be up and running :) – ggaurav Jan 24 '20 at 08:19

Why does python's re.search method hang?

1 Answers1

Linked

Related