0

I'm trying to create a regex that matches consecutive instances of an expression, but only if the text begins with that expression.

Let's say I want to find a number followed by a word: \d \w+.
For the text:

1 word 2 letters some more words 3 groups

I want to get two groups: "1 word" and "2 letters", because the line starts with a match (a number and a word - "1 word"), and another one follows right after ("2 letters"). But I don't want it to match "3 groups".

For the text:

abc 1 word 3 letters

no groups should match because it starts with "abc".

Thanks in advance!

John L
  • 11
  • 1
  • 2
    It looks like you want us to write some code for you. While many users are willing to produce code for a coder in distress, they usually only help when the poster has already tried to solve the problem on their own. A good way to demonstrate this effort is to include the code you've written so far (forming a [mcve]), example input (if there is any), the expected output, and the output you actually get (output, tracebacks, etc.). The more detail you provide, the more answers you are likely to receive. Check the [tour] and [ask]. – TigerhawkT3 Mar 27 '17 at 17:48
  • Check the linked question's top answer, and look under quantifiers (for `+`) and groups (for `(?: ... )`). – TigerhawkT3 Mar 27 '17 at 17:51
  • Oh! Are you sleeping or what? I made an incredible answer to your question, prevent it to be closed and you are asleep? No comment, no upvote, no downvote? – Casimir et Hippolyte Mar 27 '17 at 21:17

1 Answers1

0

You can't do it in "pure regex" with the re module. But you can use the re.finditer method to check the start index for each result:

import re

s = '1 word 2 letters some more words 3 groups'

def getFromStart(p, s):
    index = 0
    for m in re.finditer(p, s):
        if m.start() == index:
            yield m
            index = m.end()
        else:
            raise StopIteration

print([m.group(1) for m in getFromStart(r'(\d \w+)\s*', s)])

Other way, don't use the re module and install the regex module that has the \G anchor available. This anchor matches the position after the previous result (and by default the start of the string). Starting the pattern with it ensures that successive matches are contiguous from the beginning of the string:

import regex

s = '1 word 2 letters some more words 3 groups'

print(regex.findall(r'\G(\d \w+)\s*', s))
Casimir et Hippolyte
  • 83,228
  • 5
  • 85
  • 113