2

I want to match strings enclosed in triple "-quotes which may contain line breaks, and which don't contain any """-substrings except at the very beginning and in the very end.

Valid example:

"""foo
bar "baz" blah"""

Invalid example:

"""foo bar """ baz"""

I tried using the following regex (as Java String literal):

"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\""

and it seems to work on short examples. However, on longer examples, like on a string consisting of thousand lines with hello world, it gives me a StackOverflowError.

Scala snippet to reproduce the error

import java.util.regex.{Pattern, Matcher}

val text = "\"" * 3 + "hello world \n" * 1000 + "\"" * 3
val p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"")
println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt())
println(p.matcher(text).lookingAt())

(note: test locally, Scastie times out; or maybe reduce 1000 to smaller number?).

Java snippet that produces the same error

import java.util.regex.Pattern;
import java.util.regex.Matcher;

class RegexOverflowMain {
  public static void main(String[] args) {
    StringBuilder bldr = new StringBuilder();
    bldr.append("\"\"\"");
    for (int i = 0; i < 1000; i++) {
      bldr.append("hello world \n");
    }
    bldr.append("\"\"\"");
    String text = bldr.toString();
    Pattern p = Pattern.compile("(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*\"\"\"");
    System.out.println(p.matcher("\"\"\" foo bar baz \n baz bar foo \"\"\"").lookingAt());
    System.out.println(p.matcher(text).lookingAt());
  }
}

Question

Any idea how to make this "stack safe", i.e. can someone find a regex that accepts the same language, but does not produce a StackOverflowError when fed to the Java regex API?

I don't care whether the solution is in Scala or Java (or whatever), as long the same underlying Java regex library is used.

Andrey Tyukin
  • 38,712
  • 4
  • 38
  • 75
  • This here definitely looks somewhat useful: [why `(A|B)*` results in recursive calls](https://stackoverflow.com/a/7510006/2707792). Note however that for example `"(?m)\"\"\"(?:\"?\"?[^\"])*\"\"\""` suffers from the same errors, even though it doesn't contain alternatives `|`. – Andrey Tyukin Jun 05 '18 at 23:04
  • Must the `"""` be at the start and the end of the string? And why not just write it manually? It would just take a few lines more and would be much faster – x4rf41 Jun 05 '18 at 23:17
  • @x4rf41 Yes, three double-quotes, followed by whatever characters, as long as there are no triple-double-quotes, then again three double-quotes in the very end. I'd prefer to have a regex that does this, because it would fit nicely into the existing framework. Otherwise, I'd have to invent some "generalized regular language accepting interface" -- this is what I want to avoid, if possible. – Andrey Tyukin Jun 05 '18 at 23:19
  • ... and maybe I should explicitly exclude the corner-cases like `"""""""` (seven double-quotes), representing a string literal with a single double quote. Assume that all double quotes inside the string are separated from the starting `"""` and ending `"""` by some other characters. I don't care whether those strange corner-cases are accepted or rejected. – Andrey Tyukin Jun 05 '18 at 23:28
  • try this one `^"""((?!""")[\s\S])*"""$` (without the multiline flag). It works here: https://regex101.com/ with both of your examples as expected. it uses a negative look ahead instead. that should prevent the stack overflow i hope (not sure though). escaped: `^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$` – x4rf41 Jun 05 '18 at 23:31

2 Answers2

2

Solution using a negative look-ahead to basically find a string that starts with """ and end with """ and contains content that does not include """

As Plain regex: ^"""((?!""")[\s\S])*"""$

As Java escaped regex: "^\"\"\"((?!\"\"\")[\\s\\S])*\"\"\"$"

\s\S includes line-break (its basically . + line-break or . with single line flag)

This should be used without the multiline flag so that ^ and $ match the start and end of the string and not the start and end of the line

otherwise this:

""" ab """abc""" abc """

would match

Also i used this as reference for how to exclude the """: Regular expression to match a line that doesn't contain a word?

x4rf41
  • 4,783
  • 1
  • 18
  • 30
2

The full answer below optimizes the regex performance, but to prevent stack overflow, as a simple solution, just make the repeating group possessive.

Non-possessive repeating groups with choices need recursive calls to allow backtracking. Making it possessive fixes the problem, so simply add a + after the *:

"(?m)\"\"\"(?:[^\"]|(?:\"[^\"])|(?:\"\"[^\"]))*+\"\"\""


Also note that if you want to match entire input, you need to call matches(), not lookingAt().


Performance boost

Note: A quick performance test showed this to be more than 6 times faster than regex in answer by x4rf41.

Instead of matching one of

  • Not a quote: [^\"]
  • Exactly one quote: (?:\"[^\"])
  • Exactly two quotes: (?:\"\"[^\"])

in a loop, first match everything up to a quote. If that is a single- or double-quote, but not a triple-quote, match the 1-2 quotes then everything up to next quote, repeat as needed. Finally match the ending triple-quote.

That matching is definitive, so make the repeats possessive. This also prevent stack overflow in case input has many embedded quotes.

"{3}          match 3 leading quotes
[^"]*+        match as many non-quotes as possible (if any) {possesive}
(?:           start optional repeating group
   "{1,2}       match 1-2 quotes
   [^"]++       match one or more non-quotes (at least one) {possesive}
)*+           end optional repeating group                  {possesive}
"{3}          match 3 trailing quotes

Since you don't use ^ or $, there is no need for (?m) (MULTILINE)

As Java string:

"\"{3}[^\"]*+(?:\"{1,2}[^\"]++)*+\"{3}"

Andreas
  • 138,167
  • 8
  • 112
  • 195