1

I have the following piece of code Note: Line variable is from a line in a text file I'm reading and pattern variable is saved in a config file which I pickup and apply in the code

line ="[u'INVOICE# SMR/0038 f"', u'', u'', u'']"
pattern ='(?<=(invoice#)\s)[A-z]{3}/\d{1,5}'

regex = re.compile(r'' + pattern),re.IGNORECASE)
invNum= re.findall(pattern, str(line),re.IGNORECASE)[0]
      ........

I'm expecting to get invNum = SMR/0038 but instead I get invoice#. What's the issue? if try this pattern on https://regexr.com/ I see that the lookbehind is working. But transferring it to Python code doesn't work. See image below from https://regexr.com/

sample from regexr

Ani M
  • 39
  • 4

1 Answers1

2

Since re.findall returns the captured substring only if there is a capturing group in the pattern, you get the invoice# substring as you wrapped it with a capturing group.

Also, note that [A-z] matches more than just ASCII letters, it is one of the most confusing patterns in the regex world. Use [A-Za-z].

You need to capture the part you want to extract, you do not even need a lookbehind:

import re
line ="[u'INVOICE# SMR/0038 f\"', u'', u'', u'']"
pattern = re.compile('invoice#\s+([A-Za-z]{3}/\d{1,5})', re.I)
print( re.findall(pattern, line) ) # => ['SMR/0038']

See the online demo

Actually, as you need to get the first match only, use re.search (re.findall returns all matches):

m = pattern.search(line)
if m:
    print(m.group(1)) # => SMR/0038
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • Interesting...I always test the pattern on regexr.com and when I use your pattern on that site it returns me Invoice# SMR/0038..the whole thing instead of just the SMR/0038 string ...so what's the difference? Python seems to be doing a lookbehind even when you don't have the lookbehind specified in your pattern – Ani M Sep 16 '19 at 10:37
  • I wanted to capture the invoice number only if it is preceded by "Invoice#" label..there could be other words before the label...so if "JnJ LLC Invoice# SMT/778" string is passed I should get SMT/778 after regex is applied – Ani M Sep 16 '19 at 10:39
  • @AniM You do not need a lookbehind at all when you need a capturing group. The solution works with `JnJ LLC Invoice# SMT/778`, see [demo](https://ideone.com/IbaAef). You only need positive lookbehind when you expect overlapping matches. In other cases, capturing group is enough. – Wiktor Stribiżew Sep 16 '19 at 10:46