-1

I'm trying to understand the behavior of regex when using \d and \w consecutively to match words and numbers in a sentence. I searched for similar questions but I couldn't find a good match (please let me know if this is somehow duplicate).

# Example sentence
"Adam has 100 friends. Bill has 23 friends. Cindy has 5 friends."

When I use regex [A-Za-z]+\s\w+\s\d+\w, it returns matches for:

  • Adam has 100
  • Bill has 23

BUT NOT FOR

  • Cindy has 5

I would have expected no matches at all since the greedily searched digits (\d+) are not followed by any word character (\w); they are followed by a white space instead. I think, somehow \w is matching digits following the first occurrence of any digit. I thought \d+ would have exhausted the stretch of digits in the search. Can you help me understand what is going on here?

Thanks

Atakan
  • 342
  • 1
  • 13
  • Could you clarify what kind of answer you expect? Explanation of how the pattern works? Then see https://regex101.com/r/LWD5hM/1/debugger – Wiktor Stribiżew Oct 10 '20 at 15:50
  • 1
    Hi Wiktor. I didn't understand the behavior initially, because I didn't know about backtracking to accommodate more of the query pattern by greedy matching. The answer below clarifies it. Thanks for the link! – Atakan Oct 11 '20 at 04:13

1 Answers1

2

I thought \d+ would have exhausted the stretch of digits in the search

No that is not the case. \d+ matches as many digits as it can before next \w (that also matches digit i.e. [a-zA-Z_0-9]) forces regex engine to backtrack one position so that \w can match one word character.

If you don't want this backtracking to happen then use possessive quantifier ++:

[A-Za-z]+\s\w+\s\d++\w

However note that \d++w pattern will always fail for all 3 cases because \d++ won't backtrack and \w will never be able to match a digit.

This pattern will succeed only if there is non-digit word character in the end like Chapter is 23A.

RegEx Demo

anubhava
  • 664,788
  • 59
  • 469
  • 547