0

I have been trying to study up on regex; however, I cannot seem to grasp what these two regex statements are saying.

private static final Pattern BALANCED_TEXT =
    Pattern.compile("(?s)((?:\\\\.|[^\\\\{}]"
                    + "|[{](?:\\\\.|[^\\\\{}])*[}])*)"
                    + "\\}"
                    + "|.");

private static final Pattern INPUT_PATTERN =
    Pattern.compile("(?s)(\\p{Blank}+)"
                    + "|(\\r?\\n((?:\\r?\\n)+)?)"
                    + "|\\\\([\\p{Blank}{}\\\\])"
                    + "|\\\\(\\p{Alpha}+)([{]?)"
                    + "|((?:[^\\p{Blank}\\r\\n\\\\{}]+))"
                    + "|(.)");

I would appreciate it if someone could explain these two regex statements to me in depth. Thanks in advance!

Jisoo Han
  • 259
  • 2
  • 6
  • 11
  • According to [documentation](http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) `(?s)` is DOTALL flag. It will let dot `.` also match new line marks. `\\p{Blank}` is the same as space or tabulator `[ \t]` so you can replace it in your regex. Also `\\p{Alpha}` is the same as `[\p{Lower}\p{Upper}]` which is the same as `[a-zA-Z]`. Now just print `BALANCED_TEXT` and `INPUT_PATTERN` and use them on https://www.debuggex.com/ and http://regex101.com/ to see how they work. Also remove `(?s)` if needed. – Pshemo Oct 21 '13 at 01:54
  • I am confused on what ?: and [^\\\\{}] mean. Also what is the point of the + sign? – Jisoo Han Oct 21 '13 at 02:02
  • 1
    You said that you have bean trying to study regex :/ `+` means that element or group before it can repeat one or more times. `(?:xxx)` is [non capturing group](http://stackoverflow.com/q/3512471/1393766). `[^abc]` means every character except these inside `[^...]` so in case of `[^\\\\{}]` it is the same as every character but not \, {, or }. – Pshemo Oct 21 '13 at 02:08

1 Answers1

3

The whole first regex is:

(?s)((?:\\\\.|[^\\\\{}]|[{](?:\\\\.|[^\\\\{}])*[}])*)\\}|.

First you should do away with java string escapes (e.g. \\ to mean \). You get a regex:

(?s)((?:\\.|[^\\{}]|[{](?:\\.|[^\\{}])*[}])*)\}|.

First thing is (?s) a DOTALL flag with makes . match newlines. Second thing to look at is top level structure. Since | is an OR operator, with lowest precedence, it's:

(something)\} OR SINGLE ANY CHARACTER - DOT

So it will first try to match something ending with } (since } is a special character in regex it's prefaced with \. The part before } will be matched as group 1 because of the () around it.

Let's look at what's inside the outermost ().

The outermost form is (?: something)*. It will match 0 or more repetitions of something.

The (?: ) means that what's inside is a non-capturing group, that is, it doesn't generate a group in match like ( ) would. It allows the | OR expressions to correctly alternate with each other without including the outtermost |..

Let's look what that something is. It's a series of OR expressions, which are tried from left to right.

First one is \\. which matches \ followed by any character (notice \\ is escaped \, while . is not escaped.

The second one is a character class [\\{}] which matches any character that is not \ or { or }.

Third one is matches character { followed by 0 or more matches of the inner (?: ) followed by }. Inner (?: ) matches either \ followed by any character or any character which is not \ or { or }.

So if you put this together this matches:

First part will match anything that ends with } (group 1 will not include } while whole match can. Before last } it will match:

  • Empty string
  • Any characters escaped by \
  • Sequences of characters between { }

better explained as: it will match pretty much anything except \ by itself, { } without each other, it won't match nested { } pairs. Above exceptions can be escaped by \.

It will also match any character at all (the last .) but that match will have empty group 1.

Samples of (java unescaped) strings that match:

a}, h{ello}}, h{\{ello}}, x, h{\\ello}}, {}}

Seems like that regex is wrong since it won't match {} but it will match } and {}} while being named BALANCED_TEXT.

RokL
  • 2,687
  • 2
  • 18
  • 22