2

I am trying to use regex in Python to find and print all matching lines from a multiline search. The text that I am searching through may have the below example structure:

AAA
ABC1
ABC2
ABC3
AAA
ABC1
ABC2
ABC3
ABC4
ABC
AAA
ABC1
AAA

From which I want to retrieve the ABC*s that occur at least once and are preceeded by an AAA.

The problem is, that despite the group catching what I want:

match = <_sre.SRE_Match object; span=(19, 38), match='AAA\nABC2\nABC3\nABC4\n'>

... I can access only the last match of the group:

match groups = ('AAA\n', 'ABC4\n')

Below is the example code that I use for this problem.

#! python
import sys
import re
import os

string = "AAA\nABC1\nABC2\nABC3\nAAA\nABC1\nABC2\nABC3\nABC4\nABC\nAAA\nABC1\nAAA\n"
print(string)

p_MATCHES = []
p_MATCHES.append( (re.compile('(AAA\n)(ABC[0-9]\n){1,}')) ) #   
matches = re.finditer(p_MATCHES[0],string)

for match in matches:
    strout = ''
    gr_iter=0
    print("match = "+str(match))
    print("match groups = "+str(match.groups()))
    for group in match.groups():
    gr_iter+=1
    sys.stdout.write("TEST GROUP:"+str(gr_iter)+"\t"+group) # test output
    if group is not None:
        if group != '':
            strout+= '"'+group.replace("\n","",1)+'"'+'\n'
sys.stdout.write("\nCOMPLETE RESULT:\n"+strout+"====\n")
glamredhel
  • 316
  • 2
  • 10

2 Answers2

6

Here is your regular expression:

(AAA\r\n)(ABC[0-9]\r\n){1,}

Regular expression visualization

Debuggex Demo

Your goal is to capture all ABC#s that immediately follow AAA. As you can see in this Debuggex demo, all ABC#s are indeed being matched (they're highlighted in yellow). However, since only the "what is being repeated" part

ABC[0-9]\r\n

is being captured (is inside the parentheses), and its quantifier,

{1,}

is not being captured, this therefore causes all matches except the final one to be discarded. To get them, you must also capture the quantifier:

AAA\r\n((?:ABC[0-9]\r\n){1,})

Regular expression visualization

Debuggex Demo

I've placed the "what is being repeated" part (ABC[0-9]\r\n) into a non-capturing group. (I've also stopped capturing AAA, as you don't seem to need it.)

The captured text can be split on the newline, and will give you all the pieces as you wish.

(Note that \n by itself doesn't work in Debuggex. It requires \r\n.)


This is a workaround. Not many regular expression flavors offer the capability of iterating through repeating captures (which ones...?). A more normal approach is to loop through and process each match as they are found. Here's an example from Java:

   import java.util.regex.*;

public class RepeatingCaptureGroupsDemo {
   public static void main(String[] args) {
      String input = "I have a cat, but I like my dog better.";

      Pattern p = Pattern.compile("(mouse|cat|dog|wolf|bear|human)");
      Matcher m = p.matcher(input);

      while (m.find()) {
         System.out.println(m.group());
      }
   }
}

Output:

cat
dog

(From http://ocpsoft.org/opensource/guide-to-regular-expressions-in-java-part-1/, about a 1/4 down)


Please consider bookmarking the Stack Overflow Regular Expressions FAQ for future reference. The links in this answer come from it.

Community
  • 1
  • 1
aliteralmind
  • 18,274
  • 16
  • 66
  • 102
  • That is a worthy workaround. What I aimed at is to get an iterable set of groups like: match groups = ('AAA', 'ABC1', 'ABC2', 'ABC3',...) What I get with this solution is match groups = ('AAA', 'ABC1\nABC2\nABC3\n') – glamredhel Apr 14 '14 at 14:45
  • 2
    sorry, I've tried upvoting but didn't have enough reputation. The answer helped indeed. I've built-in another set of loops as a workaround to get what I wanted. Not exactly how I wanted to solve it, but still a solution. – glamredhel Apr 14 '14 at 18:34
0

You want the pattern of consecutive ABC\n occurring after a AAA\n in the most greedy way. You also want only the group of consecutive ABC\n and not a tuple of that and the most recent ABC\n. So in your regex, exclude the subgroup within the group. Notice the pattern, write the pattern that represents the whole string.

AAA\n(ABC[0-9]\n)+

Then capture the one you are interested in with (), while remembering to exclude subgroup(s)

AAA\n((?:ABC[0-9]\n)+)

You can then use either findall() or finditer(). I find findIter easier especially when you are dealing with more than one capture. finditer:-

import re
matches_iter = re.finditer(r'AAA\n((?:ABC[0-9]\n)+)', string)

[print(i.group(1)) for i in matches_iter]

findall, used the original {1,} as its a more verbose form of + :-

matches_all = re.findall(r'AAA\n((?:ABC[0-9]\n){1,})', string)

[[print(x) for x in y.split("\n")] for y in matches_all]