Regex works fine on Pythex, but not in Python

Question

I used the following regular expression on pythex to test it:

(\d|t)(_\d+){1}\.

It works fine and I am primarily interested in group 2. That it works successfully is shown below:

However, I can't get Python to actually show me the correct results. Here's a MWE:

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.match(pattern, line)
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

The output is nothing.

However, the pythex above clearly shows two groups returned for each, the second should be present and hit off many more files. What am I doing wrong?

But you used a different regex. You removed the 1st group. That is the reason. [Revert it](https://regex101.com/r/mS8zN4/3). Is `([\dt])(_\d+)\.` what you need? Also, you need `search`. — Wiktor Stribiżew, Aug 12 '15 at 21:45
From the Pythex output, it looks like the matches occur in the middle of the strings. `re.match()` only returns a result if it occurs _at the beginning of the string_. — TigerhawkT3, Aug 12 '15 at 21:48
@stribizhev fixed. I had tried several variations. This variation returns nothing in the Python program but still works in pythex — Bob Dylan, Aug 12 '15 at 21:50
@TigerhawkT3 I also tried `search` but that doesn't work either. Nothing is returned — Bob Dylan, Aug 12 '15 at 21:50
@BobDylan, then you did not post your actual code. You code works as expected: http://ideone.com/u4QFuW — kay, Aug 12 '15 at 21:56
Try this regex `([\dt])+(_\d+)\.` to match the ones that has the character `t` in them. (as shown in that screenshot) — Renae Lider, Aug 12 '15 at 22:01
Turns out my real error had nothing to do with regex, but got it working now. Arg... -- thank you everyone — Bob Dylan, Aug 12 '15 at 22:06

Cyphase · Accepted Answer · 2015-08-12T21:58:55.027

8

You need to use re.search(), not re.match(). re.search() matches anywhere in the string, whereas re.match() matches only at the beginning.

import re

fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'(\d|t)(_\d+){1}\.', re.IGNORECASE)

for line in fn_list:
    search_obj = re.search(pattern, line)  # CHANGED HERE
    if search_obj:
        matching_group = search_obj.groups()
        print matching_group

Result:

('4', '_1')
('4', '_2')
('4', '_2')
('4', '_2')
('4', '_3')
('t', '_2')
('5', '_1')
('5', '_15')

Since you are compiling the regular expression, you can do search_obj = pattern.search(line) instead of search_obj = re.search(pattern, line). As for your regular expression itself, r'([\dt])(_\d+)\.' is equivalent to the one you're using, and a bit cleaner.

edited Aug 12 '15 at 21:58

answered Aug 12 '15 at 21:50

Cyphase

10,336
2
24
31

1

@BobDylan, that code right there gives that output right there. Paste it directly into a script and run it. – Cyphase Aug 12 '15 at 21:52
@BobDylan, I see you accepted the answer.. did you get it working? What was the issue? – Cyphase Aug 12 '15 at 22:02
@BobDylan, what was it, just out of curiosity? Plus someone else could run into the same issue. – Cyphase Aug 12 '15 at 22:04
In the actual code, I pass a file containing a file name on each line to a function that does this. I simply passed the filename rather than a handle to the open file, so it just searched the filename string and of course found nothing. I am an idiot. Sorry for wasting your time. I do appreciate the help, though – Bob Dylan Aug 12 '15 at 22:05
In that case it probably would have tried each character in the filename separately. No problem, it's not a waste of time; the `re.match()` versus `re.search()` issue was real :). But as an aside, that's why it's good to show your _actual_ code as much as you can, and not just something that is "equivalent" :P. – Cyphase Aug 12 '15 at 22:08

Wiktor Stribiżew · Answer 2 · 2015-08-12T22:01:13.757

You need to use the following code:

import re
fn_list = ['IMG_0064.png',
           'IMG_0064.JPG',
           'IMG_0064_1.JPG',
           'IMG_0064_2.JPG',
           'IMG_0064_2.PNG',
           'IMG_0064_2.BMP',
           'IMG_0064_3.JPEG',
           'IMG_0065.JPG',
           'IMG_0065.JPEG',
           'IMG-20150623-00176-preview-left.jpg',
           'IMG-20150623-00176-preview-left_2.jpg',
           'thumb_2595.bmp',
           'thumb_2595_1.bmp',
           'thumb_2595_15.bmp']

pattern = re.compile(r'([\dt])(_\d+)\.', re.IGNORECASE) # OPTIMIZED REGEX A BIT

for line in fn_list:
    search_obj = pattern.search(line)  # YOU NEED SEARCH WITH THE COMPILED REGEX
    if search_obj:
        matching_group = search_obj.group(2) # YOU NEED TO ACCESS GROUP 2 IF YOU ARE INTERESTED JUST IN GROUP 2
        print matching_group

See IDEONE demo

As for the regex, (\d|t) is the same as ([\dt]), but the latter is more efficient. Also, {1} is redundant in regex.

thank you, I upvoted but because I'm a peon it doesn't show my vote — Bob Dylan, Aug 12 '15 at 22:10
:) I hope my regex explanation was of help from the very beginning. When you get more rep, you can always come back and upvote. — Wiktor Stribiżew, Aug 12 '15 at 22:12

Regex works fine on Pythex, but not in Python

2 Answers2

Linked

Related