2

I need to extract the real issue number in my file name. There are 2 patterns:

  1. if there is no leading number in the file name, then the number, which we read first, is the issue number. For example
asdasd 213.pdf             ---> 213
abcd123efg456.pdf          ---> 123
  1. however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example
123abcd 4567sdds.pdf    ---> 4567, since 123 is ignored

890abcd 123efg456.pdf   ---> 123, since 890 is ignored

I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:

  1. if there is a leading number, remove it
  2. find the number in the remaining string

or in Python code


import re

reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number


lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')

for test in lstTest:
    if reNumHeading.match(test):
        span =  reNumHeading.match(test).span()
        stripTest = test[span[1]:]
    else:
        stripTest = test

    result = reNum.findall(stripTest)
    if result:
        print(result[0])


thanks

oyster
  • 423
  • 3
  • 12

2 Answers2

3

You can use ? quantifier to define optional pattern

>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
...     print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
... 
213
123
4567
123
  • (?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
    • since + is greedy, all the starting digits will be matched
  • .*? match any number of characters minimally (because we need the first match of digits)
  • (\d+) the required digits to be captured
  • re.search returns a re.Match object from which you can get various details
  • [1] on the re.Match object will give you string captured by first capturing group
    • use .group(1) if you are on older version of Python that doesn't support [1] syntax

See also: Reference - What does this regex mean?

Sundeep
  • 19,273
  • 2
  • 19
  • 42
3

Just match digits \d+ that follow a non-digit \D:

import re

lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')

for test in lstTest:
    res = re.search(r'\D(\d+)', test)
    print(res.group(1))

Output:

4567
213
123
123
Toto
  • 83,193
  • 59
  • 77
  • 109