27

I've identified some unexpected behavior in Java's regex implementation. When using java.util.regex.Pattern and java.util.regex.Matcher, the following regular expression does not match correctly on the input "Merlot" when using Matcher's find() method:

((?:White )?Zinfandel|Merlot)

If I change the order of the expressions inside the outermost matching group, Matcher's find() method does match.

(Merlot|(?:White )?Zinfandel)

Here is some test code that illustrates the problem.

RegexTest.java

import java.util.regex.*;

public class RegexTest {
    public static void main(String[] args) {
        Pattern pattern1 = Pattern.compile("((?:White )?Zinfandel|Merlot)");
        Matcher matcher1 = pattern1.matcher("Merlot");
        // prints "No Match :("
        if (matcher1.find()) {
            System.out.println(matcher1.group(0));
        } else {
            System.out.println("No match :(");
        }

        Pattern pattern2 = Pattern.compile("(Merlot|(?:White )?Zinfandel)");
        Matcher matcher2 = pattern2.matcher("Merlot");
        // prints "Merlot"
        if (matcher2.find()) {
            System.out.println(matcher2.group(0));
        } else {
            System.out.println("No match :(");
        }
    }
}

The expected output is:

Merlot
Merlot

But the actual output is:

No Match :(
Merlot

I've verified this unexpected behavior exists in Java version 1.7.0_11 on Ubuntu linux and also Java version 1.6.0_37 on OSX 10.8.2. I reported this behavior as a bug to Oracle yesterday and got back an automated email telling me my bug report has been received and has an internal review ID of 2441589. I can't find my bug report when I search for that id in their bug database. (Can you hear the crickets?)

Have I uncovered a bug in Java's presumably thoroughly tested and used regex implementation (hard to believe in 2013), or am I doing something wrong?

Asaph
  • 147,774
  • 24
  • 184
  • 187

4 Answers4

8

The following:

import java.util.regex.*;

public class T {
  public static void main( String args[] ) {
    System.out.println( Pattern.compile("(a)?bb|c").matcher("c").find() );
    System.out.println( Pattern.compile("(a)?b|c").matcher("c").find() );
  }
}

prints

false
true

on:

  • JDK 1.7.0_13
  • JDK 1.6.0_24

The following:

import java.util.regex.*;

public class T {
  public static void main( String args[] ) {
    System.out.println( Pattern.compile("((a)?bb)|c").matcher("c").find() );
    System.out.println( Pattern.compile("((a)?b)|c").matcher("c").find() );
  }
}

prints:

true
true
Dave Jarvis
  • 28,853
  • 37
  • 164
  • 291
Mikhail Vladimirov
  • 12,571
  • 1
  • 31
  • 34
  • That seems to confirm that this is a real bug. I would expect it to output `true true`. – Asaph Feb 05 '13 at 17:57
  • 1
    Pattern.compile("(a)?bb|c").matcher("c").matches() is true – jdb Feb 05 '13 at 18:14
  • @jdb: As you stated above, `matches()` works as expected. `find()` is where the problem is. Unfortunately, if you need to extract matched text, you must use `find()`. – Asaph Feb 05 '13 at 18:17
  • I mean it looks like a bug. – jdb Feb 05 '13 at 18:27
  • 1
    Here you go: Pattern.compile("((?:White )?Zin|Merlot)").matcher("Merlot").find() is true – jdb Feb 05 '13 at 18:30
4

It seems to be fixed in Java 1.8.

Welcome to Scala version 2.11.0-20130930-063927-2bba779702 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0-ea).
Type in expressions to have them evaluated.
Type :help for more information.

scala> import java.util.regex._
import java.util.regex._

scala> Pattern.compile("((?:White )?Zinfandel|Merlot)")
res0: java.util.regex.Pattern = ((?:White )?Zinfandel|Merlot)

scala> .matcher("Merlot")
res1: java.util.regex.Matcher = java.util.regex.Matcher[pattern=((?:White )?Zinfandel|Merlot) region=0,6 lastmatch=]

scala> .find()
res2: Boolean = true
som-snytt
  • 38,672
  • 2
  • 41
  • 120
2

I don't understand all that's going on, but I've been playing with your example to try to extract some diagnostic information you might be able to add to your bug report.

First, if you use a possessive quantifier, it works but I don't know why:

Pattern pattern1 = Pattern.compile("((?:White )?+Zinfandel|Merlot)");

Also, if the first group in the choice is shorter than the second one, then it works either way:

Pattern pattern1 = Pattern.compile("((?:White )?Zinf|Merlot)");

Like I said, I don't really understand how this could be. None of these two findings make any sense to me but just thought I'd share...

mprivat
  • 20,572
  • 4
  • 51
  • 62
  • 3
    In my test, it appears that the length of "Zinf"/"Zinfandel" is relevant to the size of the pattern space. If you run the full, original pattern against "Merlot blahblah", it maches for me. Perhaps reaching the end of pattern space causes it to prematurely exit. – cmonkey Feb 05 '13 at 18:59
2

The bug was obviously fixed in Java 8, and was resolved 'Won't Fix' as a backport to Java 7. However, as a workaround you can either use an independent (atomic) grouping for "White" or you can isolate the test case for "White Zinfandel" by wrapping it into a separate alternating test group.

In your example have a non-capturing group inside the first capturing group with the following.

Non-capturing group modifier (?:White)

((?:White )?Zinfandel|Merlot)

As a work around using an independent capturing group will succeed.

Independent non-capturing group modifier (?>White)

((?>White )?Zinfandel|Merlot)

Recreating the test case for either an independent non-capturing group or group alternation in Java 1.7.0_71 works.

java version "1.7.0_71"
Java(TM) SE Runtime Environment (build 1.7.0_71-b14)
Java HotSpot(TM) 64-Bit Server VM (build 24.71-b01, mixed mode)

Independent Non-Capturing Group Or Group Alternation

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RegexTest {

    public static void main( String[] args ) {

        Pattern independentNCG = Pattern.compile( "((?>White )?Zinfandel|Merlot)" );
        Matcher independentNCGMatcher = independentNCG.matcher( "Merlot" );

        Pattern alternateGroupPattern = Pattern.compile( "(((?:White )?Zinfandel)|Merlot)" );
        Matcher alternateGroupMatcher = alternateGroupPattern.matcher( "Merlot" );

        System.out.println( independentNCGMatcher.find() ? independentNCGMatcher.group( 0 ) : "No match found for Merlot" );
        System.out.println( alternateGroupMatcher.find() ? alternateGroupMatcher.group( 0 ) : "No match found for Merlot" );

    }
}

RETURNS

Merlot
Merlot
nhahtdh
  • 52,949
  • 15
  • 113
  • 149
Edward J Beckett
  • 4,701
  • 1
  • 38
  • 39
  • 1
    I think you need to remove the word "named" everywhere it appears in this answer. – Asaph Dec 19 '14 at 22:51
  • 1
    relevant link: [What is a regex “independent capturing group”?](http://stackoverflow.com/q/50524/1048572). Not that I understand *why* it would work, and how the original regex's behaviour would be expected though. – Bergi Dec 20 '14 at 19:01