0

My goal is to delete all matches from an input using a regular expression with Java 7:

input.replaceAll([regex], "");

Given this example input with a target string abc-:

<TAG>test-test-abc-abc-test-abc-test-</TAG>test-abc-test-abc-<TAG>test-abc-test-abc-abc-</TAG>

What regex could I use in the code above to match abc- only when it is between the <TAG> and </TAG> delimiters? Here is the desired matching behaviour, with <--> for a match:

               <--><-->     <-->                                       <-->     <--><-->
<TAG>test-test-abc-abc-test-abc-test-</TAG>test-abc-test-abc-<TAG>test-abc-test-abc-abc-</TAG>

Expected result:

<TAG>test-test-test-test-</TAG>test-abc-test-abc-<TAG>test-test-</TAG>

The left and right delimiters are always different. I am not particularly looking for a recursive solution (nested delimiters).

I think this might be doable with lookaheads and/or lookbehinds but I didn't get anywhere with them.

alecigne
  • 61
  • 6
  • What programming language are you using? – MonkeyZeus Jan 26 '21 at 19:19
  • I edited my post to specify my end goal which is to remove all occurrences of a string using Java 7. – alecigne Jan 26 '21 at 19:25
  • You're not gonna believe this but there exist XML/HTML parsers in Java which will let you achieve your goal more easily using XPath – MonkeyZeus Jan 26 '21 at 19:27
  • It will be as simple as `//tag[text()[contains(.,'abc-')]]` – MonkeyZeus Jan 26 '21 at 19:28
  • @MonkeyZeus I wanted to use a parser (`jsoup`) at the beginning. However, without going into too much detail, I am working on a legacy Java application and I can't introduce a parser at the moment. – alecigne Jan 26 '21 at 22:00

1 Answers1

1

You can use a regex like

(?s)(\G(?!^)|<TAG>(?=.*?</TAG>))((?:(?!<TAG>|</TAG>).)*?)abc-

See the regex demo. Replace with $1$2. Details:

  • (?s) - a Pattern.DOTALL embedded flag option
  • (\G(?!^)|<TAG>(?=.*?</TAG>)) - Group 1 ($1): either of the two:
    • \G(?!^) - end of the previous successful match
    • | - or
    • <TAG>(?=.*?</TAG>) - <TAG> that is immediately followed with any zero or more chars, as few as possible, followed with </TAG> (thus, we make sure there is actually the closing, right-hand delimiter further in the string)
  • ((?:(?!<TAG>|</TAG>).)*?) - Group 2 ($2): any one char (.), zero or more repetitions, but as few as possible (*?) that does not start a <TAG> or </TAG> char sequences (aka tempered greedy token)
  • abc- - the pattern to be removed, abc-.

In Java:

String pattern = "(?s)(\\G(?!^)|<TAG>(?=.*?</TAG>))((?:(?!<TAG>|</TAG>).)*?)abc-";
String result = text.replaceAll(pattern, "$1$2");
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397