Optimization techniques used by std::regex_constants::optimize

Question

I am working with std::regex, and whilst reading about the various constants defined in std::regex_constants, I came across std::optimize, reading about it, it sounds like it is useful in my application (I only need one instance of the regex, initialized at the beginning, but it is used multiple times throughout the loading process).

According to the working paper n3126 (pg. 1077), std::regex_constants::optimize:

Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output.

I was curious as to what type of optimization would be performed, but there doesn't seem to be much literature about it (indeed, it seems to be undefined), and one of the only things I found was at cppreference.com, which stated that std::regex_constants::optimize:

Instructs the regular expression engine to make matching faster, with the potential cost of making construction slower. For example, this might mean converting a non-deterministic FSA to a deterministic FSA.

However, I have no formal background in computer science, and whilst I'm aware of the basics of what an FSA is, and understand the basic difference between a deterministic FSA (each state only has one possible next state), and a non-deterministic FSA (with multiple potential next states); I do not understand how this improves matching time. Also, I would be interested to know if there are any other optimizations in various C++ Standard Library implementations.

For comparison, regarding Perls `/o` regex optimization, there was a **[discussion here on SO](http://stackoverflow.com/q/550258/170194)**. For Perl, these optimitzations don't do really much any more. Better optimization techniques (regarding the structure of the regular expression itself) are discussed [in the Friedl book](http://shop.oreilly.com/product/9780596528126.do) in detail. — rubber boots, Jul 21 '12 at 13:52

Jonathan Wakely · Accepted Answer · 2012-07-21T14:02:11.877

There's some useful information on the topic of regex engines and performance trade offs (far more than can fit in a stackoverflow answer) in Mastering Regular Expressions by Jeffrey Friedl.

It's worth noting that Boost.Regex, which was the source for N3126, documents optimize as "This currently has no effect for Boost.Regex."

P.S.

indeed, it seems to be implementation-defined

No, it's unspecified. Implementation-defined means an implementation is required to define the choice of behaviour. Implementations are not required to document how their regex engines are implemented or what (if any) difference the optimize flag makes.

P.S. 2

in various STL implementations

std::regex is not part of the STL, the C++ Standard Library is not the same thing as the STL.

score 2 · Answer 2 · answered Jul 21 '12 at 13:54

2

See http://swtch.com/~rsc/regexp/regexp1.html for a nice explanation on how NFA based regex implementations can avoid the exponential backtracking that occurs in DFA matchers in certain circumstances.

answered Jul 21 '12 at 13:54

JohannesD

12,008
1
35
28

That article suggests that using an NFA is faster than a DFA regex machine implemented with recursive backtracking. Yet the "optimisation" described by cppreference.com suggests that converting from an NFA to a DFA yields a performance advantage? How does this work? Thanks for the link, by the way, it is an interesting read! – Thomas Russell Jul 21 '12 at 14:16
@Shaktal - Oh, right. I wonder if they just have accidentally switched the terms in the cppreference article. Or they might imply that the regex compiler could do some analysis and determine based on that whether a DFA or an NFA is faster on a case-to-case basis. – JohannesD Jul 21 '12 at 14:22
2

@Shaktal The article is about how a traditional automaton is much faster than the implementations necessary to perform backreferences and co. The rest of the paper is basically a simple explanation of how you'd implement such a finite automaton. It's the usual tradeoff: Power vs. Speed. Perl "regexes" are more powerful than the traditional ones, but you give up the guaranteed O(N) runtime and easy optimizations. One possibility for `optimize` would be to check whether the additional features of the expensive implementation are necessary and if not fall back to the faster NFA. – Voo Jul 21 '12 at 21:03
When captures are limited then backtracking isn't required and a DFA will be faster. – Nathan Phillips Mar 06 '18 at 12:59

Optimization techniques used by std::regex_constants::optimize

2 Answers2

Linked