3

I used the following regular expression on pythex to test it:

(\d|t)(_\d+){1}\.

It works fine and I am primarily interested in group 2. That it works successfully is shown below:

pythex demo

However, I can't get Python to actually show me the correct results. Here's a MWE:

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.match(pattern, line)
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

The output is nothing.

However, the pythex above clearly shows two groups returned for each, the second should be present and hit off many more files. What am I doing wrong?

Bob Dylan
  • 1,443
  • 2
  • 10
  • 26

2 Answers2

8

You need to use re.search(), not re.match(). re.search() matches anywhere in the string, whereas re.match() matches only at the beginning.

import re

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.search(pattern, line)  # CHANGED HERE
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

Result:

('4', '_1')
('4', '_2')
('4', '_2')
('4', '_2')
('4', '_3')
('t', '_2')
('5', '_1')
('5', '_15')

Since you are compiling the regular expression, you can do search_obj = pattern.search(line) instead of search_obj = re.search(pattern, line). As for your regular expression itself, r'([\dt])(_\d+)\.' is equivalent to the one you're using, and a bit cleaner.

Cyphase
  • 10,336
  • 2
  • 24
  • 31
  • 1
    @BobDylan, that code right there gives that output right there. Paste it directly into a script and run it. – Cyphase Aug 12 '15 at 21:52
  • @BobDylan, I see you accepted the answer.. did you get it working? What was the issue? – Cyphase Aug 12 '15 at 22:02
  • @BobDylan, what was it, just out of curiosity? Plus someone else could run into the same issue. – Cyphase Aug 12 '15 at 22:04
  • In the actual code, I pass a file containing a file name on each line to a function that does this. I simply passed the filename rather than a handle to the open file, so it just searched the filename string and of course found nothing. I am an idiot. Sorry for wasting your time. I do appreciate the help, though – Bob Dylan Aug 12 '15 at 22:05
  • In that case it probably would have tried each character in the filename separately. No problem, it's not a waste of time; the `re.match()` versus `re.search()` issue was real :). But as an aside, that's why it's good to show your _actual_ code as much as you can, and not just something that is "equivalent" :P. – Cyphase Aug 12 '15 at 22:08
1

You need to use the following code:

import re
fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'([\dt])(_\d+)\.', re.IGNORECASE) # OPTIMIZED REGEX A BIT

for line in fn_list:
    search_obj = pattern.search(line)  # YOU NEED SEARCH WITH THE COMPILED REGEX
    if search_obj:
        matching_group = search_obj.group(2) # YOU NEED TO ACCESS GROUP 2 IF YOU ARE INTERESTED JUST IN GROUP 2
        print matching_group

See IDEONE demo

As for the regex, (\d|t) is the same as ([\dt]), but the latter is more efficient. Also, {1} is redundant in regex.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397