1

I have the following regular expression which matches any characters at the beginning ending with text in parenthesis "Hi (Stackoverflow)".

When I enter this text to be matched, the program just keeps running.

String pattern = "^[a-zA-Z]+([\\s]*[\\w]*)*\\([\\w]+\\)"
String text = "Asdadasdasd sadsdsad sdasd (s)"
String text2 = "Asdadasdasd sadsdsad sdasd (s) sdsd"

System.out.println(text.matches(pattern)) - it works
System.out.println(text2.matches(pattern)) - never ending story

What is wrong?

Bernhard Barker
  • 50,899
  • 13
  • 85
  • 122
Luke
  • 833
  • 1
  • 9
  • 16
  • add `\\s*` before the escaped opening round bracket. – Casimir et Hippolyte Jan 12 '18 at 09:49
  • Also, remove all the useless square brackets around character class shorthands and change `([\\s]*[\\w]*)*` to `(?>\\s+\\w+)*` or better `(?:\\s+\\w+)*+` *(The idea is to be more constrictive to avoid a *catastrophic backtracking*). – Casimir et Hippolyte Jan 12 '18 at 09:52
  • What is the expected result for `Asdadasdasd sadsdsad sdasd (s) sdsd`? No match or `Asdadasdasd sadsdsad sdasd (s)`? If no match, you need to use `matches("[a-zA-Z]+(?:\\s+\\w+)*+\\(\\w+\\)")`. – Wiktor Stribiżew Jan 12 '18 at 10:17
  • What means this ?: However, it doesnt work. When I enter ^[a-zA-Z]+(\s*\w*)*\(\w+\) then I can match text which is not ending with right bracket. I tried to enter char $ at the end of regex and it works now. But it is very strange... – Luke Jan 14 '18 at 18:43
  • I need the text to be only in this format "Any string ending with text inside brackets. Nothing else." – Luke Jan 14 '18 at 18:46

2 Answers2

1

The second one takes long (or at least can take long, depending on the implementation) thanks to the *'s in your regex.

Your regex starts off trying to match like this:

[a-zA-Z]+   \s* \w*      \s* \w*   \s* \w* \( \w+ \) [unmatched]
Asdadasdasd     sadsdsad     sdasd     X   (  s   )  sdsd

At this point you might expect it to say "okay, doesn't match, we're done".

But this is not what it does.

Instead, it will backtrack in an attempt to find a match that would work (since it's not all that easy for a computer to figure out that backtracking will be a waste of time in this case).

Where it previously matched the second \w* to sdasd, it will now try 1 less character, i.e. sdas, and then it will add another \s*\w* which will match 0 characters for \s* and d for \w*.

[a-zA-Z]+   \s* \w*      \s* \w*  \s* \w* \s* \w* \( \w+ \) [unmatched]
Asdadasdasd     sadsdsad     sdas X   d       X   (  s   )  sdsd

This also won't work, so it will instead try sda and then sd, which won't work and lead it to splitting that up further to sda, s and d.

[a-zA-Z]+   \s* \w*      \s* \w*  \s* \w* \s* \w* \( \w+ \) [unmatched]
Asdadasdasd     sadsdsad     sda  X   sd      X   (  s   )  sdsd

[a-zA-Z]+   \s* \w*      \s* \w*  \s* \w* \s* \w* \s* \w* \( \w+ \) [unmatched]
Asdadasdasd     sadsdsad     sda  X   s   X   d       X   (  s   )  sdsd

And so on, until each \w is just matching one character.

PS: The above is not necessarily exactly what it does, it's more intended to give a basic idea of what happens.

PPS: Used \ instead of \\ for brevity.

How do you fix it?

There are a few ways to fix it.

The one requiring the least changes is perhaps to use (\\s*\\w*)*+ instead - *+ makes the * possessive, which prevents it from backtracking at all (which is in line with what we want here).

^[a-zA-Z]+(\\s*\\w*)*+\\(\\w+\\)

What would also work is to use \\s+ instead of \\s*, although this would lead to some slightly different behaviour (specifically that 0-9 can no longer appear before the first space, which can be fixed by adding \\w* before your brackets).

This fixes it because we can no longer match 0 characters for \\s, which prevents a lot of work we would've otherwise done while backtracking.

   ^[a-zA-Z]+(\\s+\\w*)*\\(\\w+\\)
OR ^[a-zA-Z]+\\w*(\\s+\\w*)*\\(\\w+\\)

I'd also recommend removing the + from the [a-zA-Z] in either case, since this is already covered by the \\w* (thus doesn't change what the regex matches) and (in my opinion) makes the desired behaviour of the regex clearer when looking at it.

PS: [\\s]* is equivalent to \\s*.

Community
  • 1
  • 1
Bernhard Barker
  • 50,899
  • 13
  • 85
  • 122
0
private static final Pattern pattern = Pattern.compile("[a-zA-Z]+([\\s]*[\\w]*)*\\([\\w]+\\)");

public static void main(String[] args) {

    String text = "Asdadasdasd sadsdsad sdasd (s)";
    String text2 = "Asdadasdasd sadsdsad sdasd (s) sdsd (k) ssdd";

    match(text);
    match(text2);
}


private static void match(String text) {
    Matcher matcher = pattern.matcher(text);
    while (matcher.find()) {
        System.out.println(matcher.group(0));
    }
}

and the output is:

Asdadasdasd sadsdsad sdasd (s)
Asdadasdasd sadsdsad sdasd (s)
sdsd (k)
dbl
  • 1,063
  • 8
  • 15