Clarification on Python regexes and findall()

Question

I came across this problem as I was working on the Python Challenge. Number 10 to be exact. I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that.

So the regex I developed was: '(\d)\1*'

It worked well on the online regex tester, but when using it in my script it didn't perform the same:

regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)

> ['1', '1', '1', '1', '2', '2', '2',...]

And so on and so forth. So I learn about raw type in the re module for Python. Which is my first question: can someone please explain what exactly this does? The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \d+ and I don't understand why.

So I change my regex to r'(\d)\1*' and now try and use findall() to make a list of the sequences. And I get

> ['1', '2', '3']

Very confused again. I still don't understand this. Help please?

I decided to do this to get around this:

[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']

And get what I've been looking for. Then, based off of this thread, I try doing findall() adding a group to the whole regex -> r'((\d)\2*)'. I end up getting:

> [('1111', '1'), ('2222', '2'), ('3333', '3')]

At this point I'm all kinds of confused. I know that this result has something to do with multiple groups, but I'm just not sure.

Also, this is my first time posting so I apologize if my etiquette isn't correct. Please feel free to correct me on that as well. Thanks!

You should avoid glomming together multiple questions like this, it makes them harder to follow. It's better to post separate simple questions. You should also point out what result you were expecting instead of merely saying that you're confused by what you got. — millimoose, Jul 23 '12 at 16:11
"I came across this problem... I decided to solve it using regexes..." How many problems do you have now? — Daniel Roseman, Jul 23 '12 at 16:26
@DanielRoseman: for the purposes of the Python Challenge, problem 10, regular expressions are a good way to approach the problem. Provided you understand what the `re` module actually gives you.. — Martijn Pieters, Jul 23 '12 at 16:29
**State the question. State it in the first line.** Don't force us to sit through an essay plus reference the question off http://www.pythonchallenge.com/ so we have to go solve problems 0..9, **just to get to the statement of your question**. Grrr. If you want to write essays about code, put it on a blog. This site is for Q&A. Don't write "I did X. Then I read Z. So I tried Y. I'm confused." Instead write "I'm trying to do A, the result should look like B, why does code C produce D instead?" — smci, Jan 15 '15 at 21:20

Martijn Pieters · Accepted Answer · 2012-07-23T18:37:13.747

1

Since this is the challenge I won't give you a complete answer. You are on the right track however.

The finditer method returns MatchObject instances. You want to look at the .group() method on these and read the documentation carefully. Think about what the difference is between .group(0) and .group(1) there; plain .group() is the same as .group(0).

As for the \d escape character; because that particular escape combination has no meaning as a python string escape character, Python ignores it and leaves it as a backslash and letter d. It would indeed be better to use the r'' literal string format, as it would prevent nasty surprises when you do want to use a regular expression character set that also happens to be an escape sequence python does recognize. See the python documentation on string literals for more information.

Your .findall() with the r'((\d)\2*)' expression returns 2 elements per match as you have 2 groups in your pattern; the outer, whole group matching (\d)\2* and the inner group matching \d. From the .findall() documentation:

If one or more groups are present in the pattern, return a list of groups; this will be a list of tuples if the pattern has more than one group.

edited Jul 23 '12 at 18:37

answered Jul 23 '12 at 16:21

Martijn Pieters

889,049
245
3,507
2,997

What do you mean when you say that that escape combination has no meaning? I'm still not grasping that. Is it recognized as the digit character class or is not? (when you don't use the raw format) – Louis Jul 23 '12 at 16:35
@Louis: It has meaning as a regular expression class, but not as a python escape sequence (like `\n` would be, that's a newline). – Martijn Pieters Jul 23 '12 at 16:38
@Martin: I see, thank you. The only thing that escapes me now is why `findall()` produced the second results I mentioned. – Louis Jul 23 '12 at 16:59
@Louis: Because you end up having two match groups, `(\d)\2*` and `\d`. – JAB Jul 23 '12 at 17:10

Clarification on Python regexes and findall()

1 Answers1