Too slow regexp match in tcl when use submatch

Question

I have following test sute for TCL:

set v z[string repeat t 10000]b[string repeat t 10000]g[string repeat t 10000]z

If I use regexp with just match mode - its ok:

time {regexp {z.*?b.*?g(.+?)z} $v} 20
[TCL_OK] 340.4 microseconds per iteration

But if I want to get submatch, regexp apply dramatically slow:

time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 5
[TCL_OK] 157007.4 microseconds per iteration

What problem with my regexp and why regexp apply too slow only with submatch return mode?

I using following environment:

parray tcl_platform
tcl_platform(byteOrder)     = littleEndian
tcl_platform(machine)       = intel
tcl_platform(os)            = Windows NT
tcl_platform(osVersion)     = 6.1
tcl_platform(pathSeparator) = ;
tcl_platform(platform)      = windows
tcl_platform(pointerSize)   = 4
tcl_platform(threaded)      = 1
tcl_platform(user)          = kot
tcl_platform(wordSize)      = 4
[TCL_OK]
puts $tcl_patchLevel
8.6.0
[TCL_OK]

Update. Additional tests:

Non-capture match - time best:

time {regexp {z.*?b.*?g(.+?)z} $v} 5
[TCL_OK] 1178.2 microseconds per iteration

Capture-match, non-greedy - time bad:

time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 5
[TCL_OK] 13796072.4 microseconds per iteration

Capture-match, greedy - time ok:

time {regexp {z.*b.*g(.+)z} $v -> asd} 5
[TCL_OK] 7097.4 microseconds per iteration
string length $asd
[TCL_OK] 100007

Capture-match, non-greedy+greedy+greedy - time very bad:

time {regexp {z.*?b.*g(.+)z} $v -> asd} 5
[TCL_OK] 38177041.6 microseconds per iteration
string length $asd
[TCL_OK] 100000

And finally, capture-match, non-greedy+non-greedy+greedy - match is non-greedy and time is ok:

time {regexp {z.*?b.*?g(.+)z} $v -> asd} 5
[TCL_OK] 4157.0 microseconds per iteration
string length $asd
[TCL_OK] 100000

Tcl's RE engine work very unpredictable for me.

I'd imagine that it's smart enough that it can ignore the "capture" part of "capture group" unless that's requested, which would save time on memory allocation and the like, but that wouldn't be much time. — Fund Monica's Lawsuit, May 04 '16 at 12:31
@QPaysTaxes It's exactly that, and it lets a different implementation of the RE engine be used internally. Basically, without capturing the code can use a much simpler automaton to do the matching. (There are cases when it can go even faster, but this isn't one of them.) — Donal Fellows, May 04 '16 at 12:37
Your regex simplified and improved `z[^b]*b[^g]*g([^z]+)z`. Think you'll see gains in speed. — SamWhan, May 04 '16 at 12:51
This is just test suite. I my real application 'z'/'b'/'g'/'z' is much more complicate strings/matches. — Chpock, May 04 '16 at 13:03
@Chpock: What is your real pattern? The only technique to speed up lazy matching is *unroll-the-loop* technique. However, to show how it works, the real pattern is required. — Wiktor Stribiżew, May 04 '16 at 13:27
@Wiktor Stribiżew: one of my real RE looks like `regexp -nocase {\s*?\s*?(.+?)} $ReqRAW -> msgblock` — Chpock, May 04 '16 at 14:03
Ok, the [original](https://regex101.com/r/gI6kE1/1) takes 88 steps here (no need to take this number as a real measurement, just as a starting point, since I am using PCRE here). Changing all to greedy matching with negated character classes: [`]*href=([\"'])([^\"'<>]*)\1[^<>]*>([^<>]*)`](https://regex101.com/r/gI6kE1/2) - 53 steps (as I doubt there are `'` or `"` inside the href values and assuming there are no `` in the contents). Unfortunately, I see no way to unroll this efficiently. — Wiktor Stribiżew, May 04 '16 at 14:12
Using [regular expressions to parse XML/HTML](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454) is always going to be **inefficient** as well as **error-prone**. Use an XML/HTML parser such as tDOM instead. See [FAQ](http://stackoverflow.com/questions/22937618/reference-what-does-this-regex-mean/22944075#22944075) (Under **General information** / "Do not use regex to parse HTML"). — Peter Lewerin, May 04 '16 at 18:17
@Peter Lewerin: I agree, use re for html parse is horrible practice. But I need this "by design" to keep some code light, clear, simple and short as much as possible. So DOM/XML parsers is not accessible. My question is a try to understand why only fifth RE in my "additional tests" work fast and as expected but other don't. — Chpock, May 04 '16 at 18:43
When I use tDOM and XPath to retrieve data the resulting code typically is a lot lighter, clearer, simpler, and shorter than the corresponding code using regular expressions would have been. But it's of course your choice. — Peter Lewerin, May 04 '16 at 18:51

Donal Fellows · Accepted Answer · 2016-05-05T08:36:49.637

Non-capturing parentheses are faster than capturing parentheses (since they let a more optimal compilation strategy be used) so when possible, Tcl's RE engine internally uses the non-capturing form. When is that possible? When there are no back-references (\1) in the regular expression, and no use of the captured substrings from outside (which information that Tcl passes in). By adding in the extra arguments to capture the substrings, you're forcing a less efficient route in the RE compiler (though you're getting more information in return, of course).

[EDIT] It turns out that non-greedy REs don't work well with capturing parentheses with Tcl's current RE engine. (No idea why; the code's a bit complicated. Well, a lot complicated!) But it's possible to write this particular regular expression in a way that can be matched quickly.

Firstly, time scaling for my machine:

% time {regexp {z.*?b.*?g(.+?)z} $v} 2000
98.98675999999999 microseconds per iteration

For comparison, here's a greedy version without parens (slightly faster, but not by much):

time {regexp {z.*b.*g.+z} $v} 2000
96.954045 microseconds per iteration

Next, the original slow match:

% time {regexp {z.*?b.*?g(.+?)z} $v -> asd} 50
163337.53884 microseconds per iteration
% string length $asd
10000

Now, the faster version:

% time {regexp {z[^b]*b[^g]*g([^z]*)z} $v -> asd} 5000
341.0937716 microseconds per iteration
% string length $asd
10000

This uses greedy matching and instead makes it so that far less backtracking is required by replacing (for example) .*?b with [^b]*b. Note that you can still see the cost of using capturing at all, but at least this works pretty quickly and captures the same range of characters.

I'm guessing that the very slow matching you were experience is because the engine is backtracking a lot.

Thanks for explanation. Looks like this is problem with non-greedy match mode in RE engine. Is there way to write RE which executes acceptable time? — Chpock, May 04 '16 at 13:12
@Chpock Try the update I've written. It started as a comment, but grew far too long so I've added it in there. — Donal Fellows, May 05 '16 at 08:37
It seems the best answer is "non-greedy REs don't work well". And no other regex implementation for Tcl, other then core. — Chpock, May 05 '16 at 15:02

Too slow regexp match in tcl when use submatch

1 Answers1