How do I remove a block of text containing a specific phrase via RegEx

Question

I have some text, which looks like this:

12 12 obj
<<
Some content here
>>
endobj
12 13 obj
<<
Some content here with a email address that contains @mail.
>>
endobj
11 12 obj
<<
Some more content here
>>
endobj

I want to remove any of the text blocks, starting with /d+ /d+ obj (e.g. 12 13 obj) to the point where they end at endobj where they contain a specific string, which in this case, would be @mail. I'm having some trouble finding the right RegEx for this though.

I'm able to successfully select each block with (\d+\ \d+\ obj[\s\S]+?endobj) See test here: https://regex101.com/r/V4WAMl/5

But I am unable to get this to work as I want (\d+\ \d+\ obj[\s\S]+?@mail[\s\S]+?endobj) See test here: https://regex101.com/r/V4WAMl/4

I have an idea as to why this is happening, but I'm not really sure how to work around it. My theory is that lazy modifier is being greedy because it does not match initially so it stops at the next one which does match. I've tried a combination of various excludes ^(?:*****), but those just seem to be not matching anything when I try.

Wiktor Stribiżew · Accepted Answer · 2018-11-08T13:29:02.367

Use the following solution:

\d+ \d+ obj(?:(?!\d+ \d+ obj)[\s\S])*?@mail[\s\S]*?endobj
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

See the regex demo

The point here is that you need to match a starting_delimiter, then any char, 0+ occurrences, as few as possible, that does not start the starting_delimiter pattern, then matches the required pattern, and then matches any 0+ chars as few as possible up to the trailing_pattern:

<START>(?:(?!<START>)[\s\S])*?<WORD>[\s\S]*?<END>

Details:

\d+ \d+ obj - 1 or more digits, space, 1+ digits, obj
(?:(?!\d+ \d+ obj)[\s\S])*? - any char ([\s\S]) that is not a starting point for the \d+ \d+ obj sequence (thus, the regex engine won't be able to overflow to the next \d+ \d+ obj block, you may also add a |@mail alternative to the negative lookahead, but since the lazy quantifier is used, it is not necessary), as few as possible (for more details about this construct, see this post)
@mail - a literal substring @mail
[\s\S]*? - any 0+ chars, as few as possible
endobj - a literal substring.

Note that you may add a multiline modifier and add ^ (start of a line) and $ (end of a line) anchors where necessary to make matching safer and more precise (demo).

I had something very similar and came up with the (much simpler) regex: `^\d+ \d+ obj .*searchstring.*endobj $`. However, yours is more full-proof, for sure. But I see that your regex, just like mine, is replacing the result with an empty line. I would like it to replace the found string with no empty line at all. How can this be achieved? — GeertVc, May 17 '20 at 04:59
@GeertVc The regex does not replace, it only matches. What you need is out of the OP question scope, I believe. If you need to remove a line break, match it. — Wiktor Stribiżew, May 17 '20 at 10:57

How do I remove a block of text containing a specific phrase via RegEx

1 Answers1

Linked

Related