skip leading number in regular expression?

Question

I need to extract the real issue number in my file name. There are 2 patterns:

if there is no leading number in the file name, then the number, which we read first, is the issue number. For example

asdasd 213.pdf             ---> 213
abcd123efg456.pdf          ---> 123

however, sometimes there is a leading number in the file name, which is just the index of file, so I have to ignore/skip it firstly. For example

123abcd 4567sdds.pdf    ---> 4567, since 123 is ignored

890abcd 123efg456.pdf   ---> 123, since 890 is ignored

I want to learn whether it is possilbe to write only one regular expression to implement it? Currently, my soluton involves 2 steps:

if there is a leading number, remove it
find the number in the remaining string

or in Python code


import re

reNumHeading = re.compile('^\d{1,}', re.IGNORECASE | re.VERBOSE) # to find leading number
reNum = re.compile('\d{1,}', re.IGNORECASE | re.VERBOSE) # to find number


lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')

for test in lstTest:
    if reNumHeading.match(test):
        span =  reNumHeading.match(test).span()
        stripTest = test[span[1]:]
    else:
        stripTest = test

    result = reNum.findall(stripTest)
    if result:
        print(result[0])

thanks

score 3 · Answer 1 · answered Nov 01 '19 at 14:18

You can use ? quantifier to define optional pattern

>>> import re
>>> s = '''asdasd 213.pdf
... abcd123efg456.pdf
... 123abcd 4567sdds.pdf
... 890abcd 123efg456.pdf'''
>>> for line in s.split('\n'):
...     print(re.search(r'(?:^\d+)?.*?(\d+)', line)[1])
... 
213
123
4567
123

(?:^\d+)? here a non-capturing group and ? quantifier is used to optionally match digits at start of line
- since + is greedy, all the starting digits will be matched
.*? match any number of characters minimally (because we need the first match of digits)
(\d+) the required digits to be captured
re.search returns a re.Match object from which you can get various details
[1] on the re.Match object will give you string captured by first capturing group
- use .group(1) if you are on older version of Python that doesn't support [1] syntax

See also: Reference - What does this regex mean?

score 3 · Answer 2 · answered Nov 01 '19 at 14:40

Just match digits \d+ that follow a non-digit \D:

import re

lstTest = '''123abcd 4567sdds.pdf
asdasd 213.pdf
abcd 123efg456.pdf
890abcd 123efg456.pdf'''.split('\n')

for test in lstTest:
    res = re.search(r'\D(\d+)', test)
    print(res.group(1))

Output:

skip leading number in regular expression?

2 Answers2