What is this Java statement containing regex supposed to do?

Question

I am working on some legacy Java code and I see this statement:

Pattern lineWithCommentP2 = Pattern.compile("//(.[^<>]+?)(\\R|$)", Pattern.CASE_INSENSITIVE);
Matcher m = lineWithCommentP2.matcher(s);
s = m.replaceAll("<span class=\"cip\">//$1</span>$2");

As per the comment in the code, it is supposed to replace any line of text in the format

text1//text2
text3//text4

with

text1<span class="cip">//text2</span>
text3<span class="cip">//text4</span>

However, while testing it, I see that it is replacing the original line with

text1<span class="cip">//text2
</span>
text3<span class="cip">//text4
</span>

(It is adding a new line after text2 and text4).

I am not able to tweak the regex to avoid that extra line break. Any idea why and how can I fix it?

thank you.

ADDED THE FOLLOWING : To reproduce, create a text file with this data:

<p>test statement </p>
<pre class="code">public class TestClass{   
   public static void main(String[] args){
       statement1; //1
       stement2(); //2
   }
}
</pre>
<p>test stmt</p>

Then run the following code :

  byte[] ba = Files.readAllBytes(Paths.get("c:\\temp\\test.txt"));
  String s = new String(ba);
  Pattern lineWithCommentP2 = Pattern.compile("//(.[^<>]+?)(\\R|$)", Pattern.CASE_INSENSITIVE);
  Matcher m = lineWithCommentP2.matcher(s);
  s = m.replaceAll("<span class=\"cip\">//$1</span>$2");
  Files.write(Paths.get("c:\\temp\\test2.txt"), s.getBytes(), StandardOpenOption.CREATE);

This generates the following content in test2.text:

<p>test statement </p>
<pre class="code">public class TestClass{   
   public static void main(String[] args){
       statement1; <span class="cip">//1
</span>
       stement2(); <span class="cip">//2
</span>
   }
}
</pre>
<p>test stmt</p>

What happens when you remove the `$2` in the `replaceAll` call? — Felix Jassler, Jun 25 '20 at 10:21
One of the best places to understand a particular regex is [Regex101](https://regex101.com/r/SLwU7c/1) — Arvind Kumar Avinash, Jun 25 '20 at 10:22
Removing $2 removes the new line after . I want to remove the newline that it added on its own before . — Priyshrm, Jun 25 '20 at 10:25
@Priyshrm what is the line separator inserted before ``? `\r`, `\n`, `\r\n`? — Vladimir Shefer, Jun 25 '20 at 10:27
@VladimirShefer Ah, my bad. I couldn't reproduce this issue either. You could try adding a `\n` inside the square brackets of `Pattern.compile` (for example `"//(.[^<>\n]+?)(\\R|$)"`) — Felix Jassler, Jun 25 '20 at 10:28
Can't repro either, but as a final check, try to exclude any Unicode line breaks, `Pattern lineWithCommentP2 = Pattern.compile("//([^<>\n\\u000B\f\r\\u0085\\u2028\\u2029]+)(\\R|$)");` — Wiktor Stribiżew, Jun 25 '20 at 10:30
Adding \n didn't work. I have added exact code to reproduce this issue in the original post. thanks a lot for your time. Sincerely appreciated. — Priyshrm, Jun 25 '20 at 10:55

score 2 · Accepted Answer · answered Jun 25 '20 at 11:24

The regex is as follows:

//            Match '//'
(             Start capture group 1
  .             Match any character, except linebreaks
  [^<>]+?       Match any character, except `<` and `>`, one or more times, reluctantly
)             End capture group 1
(             Start capture group 2
  \\R           Match linebreak, e.g. `\r`, `\n`, or `\r\n`
  |             OR
  $             Match end of input
)             End capture group 2

You have the following text:

...\r\n
       statement1; //1\r\n
       stement2(); //2\r\n
...

Since capture group 1 is one character plus one or more characters, it means capture group 1 matches 2 or more characters. Since it is reluctant, it will stop matching as soon as the remaining pattern is satisfied.

That happens immediately, so you get:

Group 0: "//1\r\n"
Group 1: "1\r", with . matching "1" and [^<>]+? matching "\r"
Group 2: "\n", with \\R matching "\n"

Solution

To fix, remove the . and also make sure group 1 doesn't match linebreak characters, by adding \v (vertical whitespace) to the list of excluded characters:

"//([^<>\\v]+?)(\\R|$)"

FYI: Since there are no letters in the regex, specifying flag CASE_INSENSITIVE is useless, and misleading, so get rid of it.

What is this Java statement containing regex supposed to do?

1 Answers1