5

I am trying to fetch thread names from the thread dumps file. The thread names are usually contained within "double quotes" in the first line of each thread dump. It may look as simple as follows:

"THREAD1" daemon prio=10 tid=0x00007ff6a8007000 nid=0xd4b6 runnable [0x00007ff7f8aa0000]

Or as big as follows:

"[STANDBY] ExecuteThread: '43' for queue: 'weblogic.kernel.Default (self-tuning)'" daemon prio=10 tid=0x00007ff71803a000 nid=0xd3e7 in Object.wait() [0x00007ff7f8ae1000]

The regular expression I wrote is simple one: "(.*)". It captures everything inside double quotes as a group. However it causes heavy backtracking thus requiring a lot of steps, as can be seen here. Verbally we can explain this regex as "capture anything that is enclosed inside double quotes as a group"

So I came up with another regex which performs the same: "([^\"])". Verbally we can describe this regex as "capture any number of non-double quote characters that are enclosed inside double quotes". I did not found any fast regex than this. It does not perform any backtracking and hence it requires minimum steps as can be seen here.

I told this above to my colleague. He came up with yet another one: "(.*?)". I didnt get how it works. It performs considerable less backtracking than the first one but is a bit slower than the second one as can be seen here. However

  • I don't get why the backtracking stops early.
  • I understand ? is a quantifier which means once or not at all. However I dont understand how once or not at all is getting used here.
  • In fact I am not able to guess how can we describe this regex verbally.

My colleague tried explaining me but I am still not able to understand it completely. Can anyone explain?

yAnTar
  • 3,609
  • 9
  • 41
  • 64
Mahesha999
  • 17,254
  • 23
  • 89
  • 157
  • Do you need to match substrings like `"` + `substring having no quote` + `"`? – Wiktor Stribiżew Nov 23 '15 at 11:10
  • I think you should use `.*?` which will make the search lazy. I think your current regex has a flaw. If there is a `"some text here"` in the line after your thread name, then the last `"` will be mapped. – TheLostMind Nov 23 '15 at 11:10
  • @VinodMadyalkar: You are suggesting one of the least efficient solutions. Lazy matching has some very important drawbacks. A negated character class solution is best then. – Wiktor Stribiżew Nov 23 '15 at 11:11
  • @stribizhev - But looking at the String, being greedy will involve backtracking a lot. And what if there is another `""` after the thread name?. The Thread name starts at the beginning of the String, is it worth going back? – TheLostMind Nov 23 '15 at 11:12
  • 3
    @VinodMadyalkar: `"([^"]*+)"` is the best regex for matching `"no-quotes-here"`-like strings. – Wiktor Stribiżew Nov 23 '15 at 11:13
  • `+?` and `*?` are two _eager_ postfix operators on their own. They differ from `+` and `*` that they accept the shortest matching sequence. (``"..."..."`) – Joop Eggen Nov 23 '15 at 11:13
  • @stribizhev - Agreed. That does look like it is the fastest – TheLostMind Nov 23 '15 at 11:14
  • @Mahesha999: What do you mean by *I don't get why the backtracking stops early.*? With what regex? – Wiktor Stribiżew Nov 23 '15 at 11:14
  • @VinodMadyalkar yes you are right if there are three double quotes in line, then `"(.*)"` will capture the outer group. – Mahesha999 Nov 23 '15 at 11:17
  • @stribizhev seems that I dont understand what `*?` means. I thought `?` here too means *once or not at all*. Was confused how was that working. But just now I read online its something called *reluctant* quantifier. Need to read more about it then only I will understand how `"(.*?)"` is working. – Mahesha999 Nov 23 '15 at 11:19
  • And also there is always only two double quotes on that line, always enclosing the thread name. – Mahesha999 Nov 23 '15 at 11:21
  • @Mahesha999: *reluctant* = *lazy* quantifier. You can read about it in my answer and at rexegg, and at regular-expressions.info. It just matches as few characters as possible to return a valid match. Also, a standalon `?` quantifier is a greedy quantifier as it really matches 1 or 0 characters. `??` is a *lazy*/*reluctant* quantifier as it will match 0 or 1 characters. I.e. if it can match 0 symbols to return a valid match, it will return zero-length substring. – Wiktor Stribiżew Nov 23 '15 at 19:33
  • Using (or recommending use of) the `.*` and `.*?` dot star expressions in cases like this is a sure sign that the author has NOT read/studied Friedl's classic work: [Mastering Regular Expressions](https://www.amazon.com/Mastering-Regular-Expressions-Friedl/dp/0596528124). The dot star is rarely needed or warranted - *Say what you mean and mean what you say!* – ridgerunner Aug 18 '16 at 14:09

1 Answers1

12

Brief explanation and a solution

The "(.*)" regex involves a lot of backtracking because it finds the first " and then grabs the whole string and backtracks looking for the " that is closest to the end of string. Since you have a quoted substring closer to the start, there's more backtracking than with "(.*?)" as this lazy quantifier *? makes the regex engine look for the closest " after the first " found.

The negated character class solution "([^"]*)" is the best from the 3 because it does not have to grab everything, just all characters other than ". However, to stop any backtracking and make the expression ultimately efficient, you can use possessive quantifiers.

If you need to match strings like " + no quotes here + ", use

"([^"]*+)"

or even you do not need to match the trailing quote in this situation:

"([^"]*+)

See regex demo

In fact I am not able to guess how can we describe this regex verbally.

The latter "([^"]*+) regex can be described as

  • " - find the first " symbol from the left of the string
  • ([^"]*+) - match and capture into Group 1 zero or more symbols other than ", as many as possible, and once the engine finds a double quote, the match is returned immediately, without backtracking.

Quantifiers

More information on quantifiers from Rexegg.com:

A* Zero or more As, as many as possible (greedy), giving up characters if the engine needs to backtrack (docile)
A*? Zero or more As, as few as needed to allow the overall pattern to match (lazy)
A*+ Zero or more As, as many as possible (greedy), not giving up characters if the engine tries to backtrack (possessive)

As you see, ? is not a separate quantifier, it is a part of another quantifier.

I advise to read more about why Lazy Quantifiers are Expensive and that Negated Class Solution is really safe and fast to deal with your input string (where you just match a quote followed by non-quotes and then a final quote).

Difference between .*?, .* and [^"]*+ quantifiers

  • Greedy "(.*)" solution works like this: checks each symbol from left to right looking for ", and once found grabs the whole string up to the end and checks each symbol if it is equal to ". Thus, in your input string, it backtracks 160 times.

enter image description here

Since the next " is not far, the number of backtrack steps is much fewer than with greedy matching.

enter image description here

  • possessive quantifier solution with a negated character class "([^"]*+)" works like this: the engine finds the leftmost ", and then grabs all characters that are not " up to the first ". The negated character class [^"]*+ greedily matches zero or more characters that are not a double quote. Therefore, we are guaranteed that the dot-star will never jump over the first encountered ". This is a more direct and efficient way of matching between some delimiters. Note that in this solution, we can fully trust the * that quantifies the [^"]. Even though it is greedy, there is no risk that [^"] will match too much as it is mutually exclusive with the ". This is the contrast principle from the regex style guide [see source].

Note that the possessive quantifier does not let the regex engine backtrack into the subexpression, once matched, the symbols between " become one hard block that cannot be "re-sorted" due to some "inconveniences" met by the regex engine, and it will be unable to shift any characters from and into this block of text.

For the current expression, it does not make a big difference though.

enter image description here

Community
  • 1
  • 1
Wiktor Stribiżew
  • 484,719
  • 26
  • 302
  • 397
  • 1
    * lazy quantifier* is a poor name for *non-greedy quantifiers*. They are not "lazy" in the same sense that the word is applied anywhere else in computing – Borodin Nov 23 '15 at 12:24
  • 1
    @Borodin - I disagree. The term: "Lazy" aptly and succinctly describes the behavior of this modifier. Also, this term is used extensively in Jeffrey Friedl's classic work: [Mastering Regular Expressions](https://www.amazon.com/Mastering-Regular-Expressions-Friedl/dp/0596528124). Hands down, the most useful book I've ever read. – ridgerunner Aug 18 '16 at 13:50
  • @ridgerunner: Of course you are welcome to your opinion, but I have to disagree. ***A*** In programming, a *lazy* operation is generally one that is delayed until its effect is required, which is nothing to do with what a non-greedy regex quantifier does. ***B*** In English, *lazy* is not the opposite of *greedy*. ***C*** However highly you may think of Friedl's book, it is strange to imagine that everything it says is true, irrefutable, and cannot be bettered. It is also very out of date now, although the basic principles still apply. – Borodin Aug 18 '16 at 14:37
  • I'm curious if [this question](https://stackoverflow.com/questions/57325260/can-we-search-regexp-from-the-middle-of-a-text-back-to-beginning) can be answered with a single regex rather than my two-regex solution...? – T.J. Crowder Aug 04 '19 at 09:19