I came across this problem as I was working on the Python Challenge. Number 10 to be exact. I decided to try and solve it using regexes - pulling out the repeating sequences, counting their length, and building the next item in the sequence off of that.
So the regex I developed was: '(\d)\1*'
It worked well on the online regex tester, but when using it in my script it didn't perform the same:
regex = re.compile('(\d)\1*')
text = '111122223333'
re.findall(regex, text)
> ['1', '1', '1', '1', '2', '2', '2',...]
And so on and so forth. So I learn about raw type in the re module for Python. Which is my first question: can someone please explain what exactly this does? The doc described it as reducing the need to escape backslashes, but it doesn't appear that it's required for simpler regexes such as \d+
and I don't understand why.
So I change my regex to r'(\d)\1*'
and now try and use findall()
to make a list of the sequences. And I get
> ['1', '2', '3']
Very confused again. I still don't understand this. Help please?
I decided to do this to get around this:
[m.group() for m in regex.finditer(text)]
> ['1111', '2222', '3333']
And get what I've been looking for. Then, based off of this thread, I try doing findall()
adding a group to the whole regex -> r'((\d)\2*)'
.
I end up getting:
> [('1111', '1'), ('2222', '2'), ('3333', '3')]
At this point I'm all kinds of confused. I know that this result has something to do with multiple groups, but I'm just not sure.
Also, this is my first time posting so I apologize if my etiquette isn't correct. Please feel free to correct me on that as well. Thanks!