3

What is the problem:

I have a multiline text, for example:

1: This is test string for my app. d
2: This is test string for my app.
3: This is test string for my app. abcd
4: This is test string for my app.
5: This is test string for my app.
6: This is test string for my app.
7: This is test string for my app. d
8: This is test string for my app.
9: This is test string for my app.
10: This is another string.

The line numbers are only for better visualization in here, they are not part of the text itself.

What I have tried:

I have a tried two different Regex (flags are always: i g and m):

^([^\r\n]*)$(.*?)(?:(?:\r?\n|\r)\1)+$

see here: regexr.com/5nklg

and

^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)

see here: regexr.com/5nkla

They both produce different outputs, both are good, but not perfect.

What I would like to achieve:

Remove all duplicate phrases in the text, but keep one. So here for example keep the first "This is test string for my app." from line 1, match the same phrase on line 2 - 9 and keep number 10.

It would alsow work for me if I can keep the last instead of the first matching phrase. So here this would be match line 1 - 8, keep 9 and 10.

Is there a way to do this with Regex?

FYI: I will use the Regex in python later to sub the duplicates out:

re.sub(r"^(.*)(?:\r?\n|\r)(?=[\s\S]*^\1$)", "", my_text, flags=re.MULTILINE)

EDIT: a 'phrase' means let's say 3 or more words. so match any duplication that is longer than 2 words. so the expected output after the first sub would be:

This is test string for my app. d  //from line 1
This is test string for my app.    //from line 2
abcd                               //from line 3
This is another string.            //from line 10

Thanks in advance!

G43beli
  • 3,154
  • 4
  • 14
  • 25
  • 1
    Is `This is test string for my app. abcd` also a duplicate? – anubhava Mar 02 '21 at 10:44
  • Do you mean you want to identify dupe lines up to the first period on a line? `^([^\n\r.]*)\..*(?:\r?\n|\r)(?=[\s\S]*^\1\..*$)`? See [demo](https://regex101.com/r/Frn10D/1). (Or, if the dot with the rest of the line is optional, `^([^\n\r.]*)(?:\..*)?(?:\r?\n|\r)(?=[\s\S]*^\1(?:\..*)?$)`) – Wiktor Stribiżew Mar 02 '21 at 10:45
  • You could match until the first dot and capture all after it in group 2 if a dot there is allowed. Then repeat all following lines that start with group 1, and replace with group 1 and 2. `^([^.\r\n]+\.)(.*)(?:\n\1.*)*` https://regex101.com/r/EmzMKm/1 – The fourth bird Mar 02 '21 at 10:51
  • Matching all the lines but last two - [`^([^.\r\n]+\.).*$(?=[\s\S]*\1)`](https://regex101.com/r/OBInqg/1) – Gurmanjot Singh Mar 02 '21 at 11:05
  • 1
    @anubhava only the duplicate phrase in it: "This is test string for my app." abcd can stay. Just duplicate phrases inside of this string. No matter if there is a line break or a period at the end – G43beli Mar 02 '21 at 13:12
  • 1
    Ok, try `re.sub(r'^(([^\n\r.]*).*)(?:(?:\r?\n|\r)\2.*)*', r'\1', my_text, flags=re.M)`, see [this regex demo](https://regex101.com/r/Frn10D/2). – Wiktor Stribiżew Mar 02 '21 at 13:29
  • 1
    @G43beli: Can you please show your expected output in question? – anubhava Mar 02 '21 at 15:24
  • What do you mean by `abcd can stay` ? What defines a phrase, is a single word on a line also ok? – The fourth bird Mar 02 '21 at 20:25
  • @Thefourthbird good question. I was not sure myself. I would say a phrase is 3 or more words. So if there is any duplication it should get matched. I will update the question – G43beli Mar 02 '21 at 20:53
  • @WiktorStribiżew that works for my usecase! Can you maybe post an answer below so I can accept it? Maybe with a short explanation? – G43beli Mar 02 '21 at 21:00

1 Answers1

2

You can use

re.sub(r'^(([^\n\r.]*).*)(?:(?:\r?\n|\r)\2.*)*', r'\1', my_text, flags=re.M)

See the regex demo.

Details:

  • ^ - start of a line (since the re.M option is used, ^ now matches line start positions)
  • (([^\n\r.]*).*) - Group 1: zero or more chars other than a dot, CR and LF captured into Group 2, and then the rest of the line
  • (?:(?:\r?\n|\r)\2.*)* - zero or more sequences of
    • (?:\r?\n|\r) - a CRLF, CR or LF line ending
    • \2 - same text as in Group 2
    • .* - the rest of the line.

The replacement is the Group 1 value.

Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397