6

I am learning about Java regexes, and I noticed the following operator:

\\*1

I'm having hard time figuring out what it means (searching in the web didn't help). For example, what is the difference between these two options:

    Pattern p1 = Pattern.compile("(a)\\1*"); // option1
    Pattern p2 = Pattern.compile("(a)"); // option2

    Matcher m1 = p1.matcher("a");
    Matcher m2 = p2.matcher("a");

    System.out.println(m1.group(0));
    System.out.println(m2.group(0));

Result:

a
a

Thanks!

dimo414
  • 42,340
  • 17
  • 131
  • 218
Friedman
  • 189
  • 1
  • 1
  • 9

3 Answers3

6

\\1 is back reference corresponding in this case to the first capturing group which is (a) here.

So (a)\\1* is equivalent to (a)a* in this particular case.

Here is an example that shows the difference:

Pattern p1 = Pattern.compile("(a)\\1*");
Pattern p2 = Pattern.compile("(a)");

Matcher m1 = p1.matcher("aa");
Matcher m2 = p2.matcher("aa");

m1.find();
System.out.println(m1.group());
m2.find();
System.out.println(m2.group());

Output:

aa
a

As you can see when you have several a the first regular expression captures all the successive a while the second one captures only the first one.

Nicolas Filotto
  • 39,066
  • 11
  • 82
  • 105
3

\\1* looks for a again, 0 or more times. Maybe easier to understand would be this example, using (a)\\1+, which looks for at least 2 as:

Pattern p1 = Pattern.compile("(a)\\1+");
Matcher m1 = p1.matcher("aaaaabbaaabbba");
while (m1.find()) System.out.println(m1.group());

the output will be:

aaaaa
aaa

But the last a won't match because it is not repeated.

assylias
  • 297,541
  • 71
  • 621
  • 741
1

In Perl, \1 through \9 are always interpreted as back references; a backslash-escaped number greater than 9 is treated as a back reference if at least that many subexpressions exist, otherwise it is interpreted, if possible, as an octal escape. In this class octal escapes must always begin with a zero. In this class, \1 through \9 are always interpreted as back references, and a larger number is accepted as a back reference if at least that many subexpressions exist at that point in the regular expression, otherwise the parser will drop digits until the number is smaller or equal to the existing number of groups or it is one digit.

From the Pattern docs.

So it looks like p2 is only good for one "a" while p1 is good for any number of "a" as long as there is at least one. The star is X* X, zero or more times. It is called a Kleene star.

Imposter
  • 233
  • 1
  • 10