0

I am trying to find a proper regex to replace anything in a string but a group preceded by a certain pattern.

Suppose I have records like these:

Lorem ipsum dolor sit amet, consetetur sadipscing elitr. Lorem ipsum duo dolores, tempor et ea rebum.
L. i. sed diam; duo dolores. Lorem ipsum tempor et ea. Duo dolores
L.i. nonumy eirmod tempor et ea rebum. L. i. consetetur sadipscing.

I want to replace anything in the strings but what is preceded by a variant of lorem ipsum. I wish to have the following outcome:

dolor sit amet; duo dolores
sed diam; tempor et ea
nonumy eirmod tempor et ea rebum; consetetur sadipscing

I tried the following code to capture the group but am not able to capture the second occurrence of the group.

'.*((Lorem ipsum)|(L\. *i\.)) ([0-9A-Za-z]+)+.*','\4; '

I suspect it has to with the second .* among other reasons. I'm trying to do this in Oracle 11g but am not opposed to doing this with Python.

halfer
  • 18,701
  • 13
  • 79
  • 158
Nilsic
  • 51
  • 1
  • 5
  • What exactly do you want to replace the things with? How many words after `Lorem Ipsum` do you want to capture? What do you want to do with that captured group afterwards? Could you please elaborate, step by step – Chase Mar 27 '20 at 09:53
  • Is there a reason why you have pairs on each line separated by a semi-colon such as `dolor sit amet; duo dolores` rather than just separate strings? – DarrylG Mar 27 '20 at 10:15
  • 1
    Python: https://ideone.com/qWybdW. Does it work for you? – Wiktor Stribiżew Mar 27 '20 at 10:23
  • `amet,` ends on a comma, while the expected is `amet;`. Also for `rebum.` and `rebum;` Is there a rule for the punctuations? – The fourth bird Mar 27 '20 at 10:23
  • 1
    @Nilsic Take a look at this [expression](https://regex101.com/r/HBuB3A/1). I don't know how to do it without the trailing `;` unless you add a programmatic element. – oriberu Mar 27 '20 at 12:58
  • @Chase I want to replace anything but the desired string with an empty string, extracting what I have described in my post. The amount of words after `Lorem Ipsum` is dynamic; anything after `Lorem Ipsum` until any punctuation is to be captured. The captured group should be extracted/printed and if the captured group occures multiple times, I want it to be separated by `;`. – Nilsic Mar 30 '20 at 05:23
  • @DarrylG There is no specific reason in having the semi-colon as a separator instead of separate strings. – Nilsic Mar 30 '20 at 05:33
  • @WiktorStribiżew Thank you for your solution in Python! I will try to adapt this to my actual problem. – Nilsic Mar 30 '20 at 05:36
  • @Thefourthbird I used different puctuation just to give an example that I am dealing with exactly that. – Nilsic Mar 30 '20 at 05:40
  • @oriberu I like your solution a lot! The negative lookbehind you used seems to be what I was struggling to find. – Nilsic Mar 30 '20 at 05:49

1 Answers1

2

To detect the individual strings:

# Regex Pattern
pattern = r'(?:(Lorem ipsum )|(L\.\s?i\. ))(.*?)(?=[^\w\s])'

# Find matching string
result = [m[2] for m in re.findall(pattern, s, re.I)]

# Pattern matches
print('\n'.join(result))

print('\nFormatted into pairs')

# Display as pairs
#    Group into pairs
m = ['; '.join([result[i], result[i+1]]) for i in range(0, len(result), 2)]

#    Print Pairs
print('\n'.join(m))

Output

Matching Patterns

dolor sit amet
duo dolores
sed diam
tempor et ea
nonumy eirmod tempor et ea rebum
consetetur sadipscing

Formatted into pairs

dolor sit amet; duo dolores
sed diam; tempor et ea
nonumy eirmod tempor et ea rebum; consetetur sadipscing

Explanation

Using pattern:

pattern = r'(?:(Lorem ipsum )|(L\.\s?i\. ))(.*?)(?=[^\w\s])'

(?:(Lorem ipsum )|(L\.\s?i\. )) - non capturing group for variants of Loren ipsum
(.*?) - 'non-greedy' match any characters
(?=[^\w\s]) - lookahead to stop on non-word or space character
DarrylG
  • 11,572
  • 2
  • 13
  • 18
  • Note that the `(.*?)(?=[\.,;])` means the only "stop" punctuation supported is `.`, `,` and `;`, and it must be present even at the end of the string for it to work. – Wiktor Stribiżew Mar 27 '20 at 10:29
  • @WiktorStribiżew--good point. I suppose I should limit it to what he has in his example. I'll update my answer. – DarrylG Mar 27 '20 at 10:33
  • Now, you are basically using [my solution](https://stackoverflow.com/questions/60883407/regex-replace-anything-but-multiple-occurence-of-group/60884132?noredirect=1#comment107717993_60883407). Please do not use others' solutions. – Wiktor Stribiżew Mar 27 '20 at 10:52
  • @WiktorStribiżew--I object when you say I was using your solution since I have seen `(\w[\w\s]*)` in other posts. But, to be non-controversial I switched to another stop pattern, namely: (?=[^\w\s]) which I got from [this post](https://stackoverflow.com/questions/56209963/combining-regular-expressions-in-python-w-and-s). – DarrylG Mar 27 '20 at 11:08
  • @WiktorStribiżew are you sure *the my solution* link addresses the right location ? – Barbaros Özhan Mar 27 '20 at 12:06
  • @DarrylG Thanks for your solution. I will try to adapt this to my problem. As already stated in another comment the use of a lookbehind/lookahead seems promising. – Nilsic Mar 30 '20 at 06:02