Regex replace anything but multiple occurence of group

Question

I am trying to find a proper regex to replace anything in a string but a group preceded by a certain pattern.

Suppose I have records like these:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr. Lorem ipsum duo dolores, tempor et ea rebum.
L. i. sed diam; duo dolores. Lorem ipsum tempor et ea. Duo dolores
L.i. nonumy eirmod tempor et ea rebum. L. i. consetetur sadipscing.

I want to replace anything in the strings but what is preceded by a variant of lorem ipsum. I wish to have the following outcome:

dolor sit amet; duo dolores
sed diam; tempor et ea
nonumy eirmod tempor et ea rebum; consetetur sadipscing

I tried the following code to capture the group but am not able to capture the second occurrence of the group.

'.*((Lorem ipsum)|(L\. *i\.)) ([0-9A-Za-z]+)+.*','\4; '

I suspect it has to with the second .* among other reasons. I'm trying to do this in Oracle 11g but am not opposed to doing this with Python.

What exactly do you want to replace the things with? How many words after `Lorem Ipsum` do you want to capture? What do you want to do with that captured group afterwards? Could you please elaborate, step by step — Chase, Mar 27 '20 at 09:53
Is there a reason why you have pairs on each line separated by a semi-colon such as `dolor sit amet; duo dolores` rather than just separate strings? — DarrylG, Mar 27 '20 at 10:15
`amet,` ends on a comma, while the expected is `amet;`. Also for `rebum.` and `rebum;` Is there a rule for the punctuations? — The fourth bird, Mar 27 '20 at 10:23
@Nilsic Take a look at this [expression](https://regex101.com/r/HBuB3A/1). I don't know how to do it without the trailing `;` unless you add a programmatic element. — oriberu, Mar 27 '20 at 12:58
@Chase I want to replace anything but the desired string with an empty string, extracting what I have described in my post. The amount of words after `Lorem Ipsum` is dynamic; anything after `Lorem Ipsum` until any punctuation is to be captured. The captured group should be extracted/printed and if the captured group occures multiple times, I want it to be separated by `;`. — Nilsic, Mar 30 '20 at 05:23
@DarrylG There is no specific reason in having the semi-colon as a separator instead of separate strings. — Nilsic, Mar 30 '20 at 05:33
@WiktorStribiżew Thank you for your solution in Python! I will try to adapt this to my actual problem. — Nilsic, Mar 30 '20 at 05:36
@Thefourthbird I used different puctuation just to give an example that I am dealing with exactly that. — Nilsic, Mar 30 '20 at 05:40
@oriberu I like your solution a lot! The negative lookbehind you used seems to be what I was struggling to find. — Nilsic, Mar 30 '20 at 05:49

DarrylG · Answer 1 · 2020-03-27T11:09:45.703

2

To detect the individual strings:

# Regex Pattern
pattern = r'(?:(Lorem ipsum )|(L\.\s?i\. ))(.*?)(?=[^\w\s])'

# Find matching string
result = [m[2] for m in re.findall(pattern, s, re.I)]

# Pattern matches
print('\n'.join(result))

print('\nFormatted into pairs')

# Display as pairs
#    Group into pairs
m = ['; '.join([result[i], result[i+1]]) for i in range(0, len(result), 2)]

#    Print Pairs
print('\n'.join(m))

Output

Matching Patterns

dolor sit amet
duo dolores
sed diam
tempor et ea
nonumy eirmod tempor et ea rebum
consetetur sadipscing

Formatted into pairs

dolor sit amet; duo dolores
sed diam; tempor et ea
nonumy eirmod tempor et ea rebum; consetetur sadipscing

Explanation

Using pattern:

pattern = r'(?:(Lorem ipsum )|(L\.\s?i\. ))(.*?)(?=[^\w\s])'

(?:(Lorem ipsum )|(L\.\s?i\. )) - non capturing group for variants of Loren ipsum
(.*?) - 'non-greedy' match any characters
(?=[^\w\s]) - lookahead to stop on non-word or space character

edited Mar 27 '20 at 11:09

answered Mar 27 '20 at 10:27

DarrylG

11,572
2
13
18

Note that the `(.*?)(?=[\.,;])` means the only "stop" punctuation supported is `.`, `,` and `;`, and it must be present even at the end of the string for it to work. – Wiktor Stribiżew Mar 27 '20 at 10:29
@WiktorStribiżew--good point. I suppose I should limit it to what he has in his example. I'll update my answer. – DarrylG Mar 27 '20 at 10:33
Now, you are basically using [my solution](https://stackoverflow.com/questions/60883407/regex-replace-anything-but-multiple-occurence-of-group/60884132?noredirect=1#comment107717993_60883407). Please do not use others' solutions. – Wiktor Stribiżew Mar 27 '20 at 10:52
@WiktorStribiżew--I object when you say I was using your solution since I have seen `(\w[\w\s]*)` in other posts. But, to be non-controversial I switched to another stop pattern, namely: (?=[^\w\s]) which I got from [this post](https://stackoverflow.com/questions/56209963/combining-regular-expressions-in-python-w-and-s). – DarrylG Mar 27 '20 at 11:08
@WiktorStribiżew are you sure *the my solution* link addresses the right location ? – Barbaros Özhan Mar 27 '20 at 12:06
@DarrylG Thanks for your solution. I will try to adapt this to my problem. As already stated in another comment the use of a lookbehind/lookahead seems promising. – Nilsic Mar 30 '20 at 06:02

Regex replace anything but multiple occurence of group

1 Answers1