0

I have the text below from a file. I have also tried it from within the program as a string (using '''mytext''').

RECORD1  Sed similique nostrum quibusdam minus. Rerum repudiandae et ipsum numquam commodi repellendus. Aut minima ratione vel 
beatae minima reprehenderit provident neque. Earum quam temporibus repudiandae quidem officiis
RECORD2 Sed similique nostrum quibusdam minus. Rerum repudiandae et ipsum numquam commodi repellendus. Aut minima ratione vel 
beatae minima reprehenderit provident neque. Earum quam temporibus repudiandae quidem officiis
RECORD3   It is a long established fact that a reader will be distracted by the readable content of a page when looking at its 
layout. 
RECORD4 

If I use Notepad++'s find with,

(RECORD.*?\s).*?(?=(RECORD.*?\s)) (and I check newline)

I can match from RECORDx to just before the next RECORDx. In other words, I get this below because of my look ahead.

RECORD1  Sed similique nostrum quibusdam minus. Rerum repudiandae et ipsum numquam commodi repellendus. Aut minima ratione vel 
beatae minima reprehenderit provident neque. Earum quam temporibus repudiandae quidem officiis

So I only get the Record which is what I need. It does this with the positive look ahead (?=(RECORD.*?\s)) and "match newline" in Notepad++. This does not seem to work in Python, and I do not know how to format it correctly. How do I do a look ahead in Python like I did in Notepad++?

I have looked at this, https://markantoniou.blogspot.com/2008/06/notepad-how-to-use-regular-expressions.html

But I am not sure what to do.

This is my Python and it returns to the prompt with nothing. I know re is working because I can do things like .* and it works fine, or even (RECORD.*?\s) to return just the literal word RECORD.

import re
regex = r"(RECORD.*?\s).*?(?=(RECORD.*?\s))"
filepath = 'test.txt'
with open(filepath) as fp:
    data = fp.read()
matches = re.finditer(regex, data)
for matchNum, match in enumerate(matches):
    matchNum = matchNum + 1

    print ("Match {matchNum} was found at {start}-{end}: {match}".format(matchNum = matchNum, start = match.start(), end = match.end(), match = match.group()))

    for groupNum in range(0, len(match.groups())):
        groupNum = groupNum + 1

        print ("Group {groupNum} found at {start}-{end}: {group}".format(groupNum = groupNum, start = match.start(groupNum), end = match.end(groupNum), group = match.group(groupNum)))

(I did the actual regex and the readfile, but most of the rest is generated from https://regex101.com/) Among other places, I have looked here, Python regex positive look ahead, and I have tried various pattern combinations in Python and Notepad++.

johnny
  • 18,093
  • 48
  • 144
  • 235

0 Answers0