Increase Boost regex speed or use PCRE in C++

Question

Hello this is my string:

key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}

i made more than 1 milion of this line and put them in a file like (same)

key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}
key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}
key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}
key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}
key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}
..
..
..

Now, I want to use regex

key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}]

(get each line (other) inside)

I test it in PHP with preg_match_all function and I'm surprised that PHP detects all of the 1 milion line in just 3 seconds but my real program is in C++ so I tried this regex on C++

regex RegexString(R"~(key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}])~", regex_constants::optimize);

and I'm surprised but this time was bad. After 10 Min regex got all lines (detect)

I used the Boost and got better result (2 Min) but what I saw in the PHP (PCRE) (3 sec) made me crazy ... Now, what should I do?

Is there any way in Boost or standard C++ regex to increase speed (done in 3-10 Seconds) or I have to use just PCRE on my C++ project?

Results

Regex : 10 Min
Boost : 2 Min
Pcre(PHP) : 3 Seconds

Some explanation :- http://stackoverflow.com/questions/33163365/regex-works-very-slow http://stackoverflow.com/questions/14205096/c11-regex-slower-than-python — rock321987, Apr 17 '16 at 19:12
not helped !!! i just want to know if i can optimze this regex in boost please tell me what is the best optimize or if there is no way to increase speed i will using pcre — Elh48, Apr 17 '16 at 19:16
Are you sure that the time difference is a result of the regex and not something else? Are you compiling the regex inside a loop or something? — Laurel, Apr 17 '16 at 19:27
Can you show me the exact thing you want to extract from that sample input? I'll show you an approach that is likely faster. — sehe, Apr 17 '16 at 19:53
i want to get each line $2 ... and i have more than 1 milion lines with diffrent $2 content — Elh48, Apr 17 '16 at 19:58
You don't need to put all characters that aren't letters between square brackets, it's useless. You only need to escape (with a backslash for example) characters that have a special meaning in a pattern (ie:`) ( [ + * ? . \ | ^ $` and `{` eventually) — Casimir et Hippolyte, Apr 17 '16 at 21:10
@CasimiretHippolyte, that `~` is part of the [C++ raw string](http://en.cppreference.com/w/cpp/language/string_literal) delimiter, not part of the pattern itself: in C++ `R"foo(bar)foo"` is equal to `"bar"`. — Lucas Trzesniewski, Apr 18 '16 at 08:30
@LucasTrzesniewski: Thanks, yes I have seen that yesterday, but I was too tired to retrieve the comment, *(and I have totally forgotten this morning...)*. You saved me. — Casimir et Hippolyte, Apr 18 '16 at 08:33
Taking out my calculator to put things in perspective: `1,000,000 lines per 10 minutes = 1.6 lines per millisecond`; `1,000,000 lines per 2 minutes = 8.3 lines per millisecond`; `1,000,000 lines per 3 seconds = 333.3 lines per millisecond`; Since the regex isn't pathological, we can only assume this post is bull-puky. In _all of regex land_ a line such as this using _any regex engine_ is matched in the very low microseconds. I vote to close this thread.. — , Apr 18 '16 at 17:01

score 1 · Answer 1 · 2016-04-18T02:00:17.643

i used the boost and got better result (2 Min)

You'd have to show me that to believe it !!

Using benchmark software from this app RegexFormat that uses Boost, I get less than 3 seconds.

The thing with that benchmark software is you can use a single test line
and run it a million times and its the same as a million lines running it once.

Here are the results, you can try it out for yourself.
Basically, it runs in 2.5 seconds across the board.

Two regexes are tested, one with the extra capture group, one without,
that represents your dual regexes text above.

The target line :

key{info('1'),details('1'),others('{"1": "2test data1", "2": "2test data2"}')}

1 Line run 1,000,000 times:

Regex1:   key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}]
Options:  < none >
Completed iterations:   1000  /  1000     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    2.78 s,   2777.70 ms,   2777696 µs


Regex2:   (key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}])
Options:  < none >
Completed iterations:   1000  /  1000     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    2.89 s,   2893.58 ms,   2893576 µs

1,000 Lines run 1,000 times:

Regex1:   key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}]
Options:  < none >
Completed iterations:   1  /  1     ( x 1000 )
Matches found per iteration:   1000
Elapsed Time:    2.38 s,   2381.16 ms,   2381163 µs


Regex2:   (key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}])
Options:  < none >
Completed iterations:   1  /  1     ( x 1000 )
Matches found per iteration:   1000
Elapsed Time:    2.50 s,   2495.65 ms,   2495649 µs

10,000 Lines run 100 times:

Regex1:   key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}]
Options:  < none >
Completed iterations:   100  /  100     ( x 1 )
Matches found per iteration:   10000
Elapsed Time:    2.38 s,   2384.73 ms,   2384729 µs


Regex2:   (key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}])
Options:  < none >
Completed iterations:   100  /  100     ( x 1 )
Matches found per iteration:   10000
Elapsed Time:    2.50 s,   2497.35 ms,   2497349 µs

Finally, an overboard test. 1 Line run 9,999,000 times:

 Regex1:   key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}]
Options:  < none >
Completed iterations:   9999  /  9999     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    27.54 s,   27536.56 ms,   27536560 µs


Regex2:   (key[{]info[(][']1['][)],details[(][']1['][)],others[(]['][{](.*?)[}]['][)][}])
Options:  < none >
Completed iterations:   9999  /  9999     ( x 1000 )
Matches found per iteration:   1
Elapsed Time:    28.73 s,   28726.18 ms,   28726182 µs

@Elh48 - The boost::regex benchmark to match your line using your regex, 1 million times, taking 2 seconds, is staring you right in the face. I'm guessing you don't know what you're doing. — , Apr 18 '16 at 16:24
@Elh48 Stop the harassment "!!!!!!!?????". You failed to show your code, after repeatedly being asked. You yell all the time in disbelief. I've reached the conclusion you don't WANT to be helped, and you're a troll. — sehe, Apr 18 '16 at 20:33

Laurel · Answer 2 · 2016-04-17T20:44:50.750

0

According to regex101.com, it takes 610 steps to match your regex against 5 lines. That's a lot.

It takes 230 steps if you change (.*?) to ([^}]*). This should cut the time to less than 5 minutes.

If you may have }s in your expression that ([^}]*) will fail to match, try ((?:[^}]*}??)*?) instead. It adds 25-40 steps, but may not be as slow as your original.

I can shave off 10 steps if you remove the capture group around the entire expression. You don't need it, $0 is equivalent.

The thing you need to understand is that C++ uses a different regex engine than PCRE. PCRE is very advanced, and likely includes more optimizations.

Without a better idea of what your data can contain, it will be hard to know what can be optimized.

The other thing you could consider is using moving some of the work done in the regex to C++.

For example, you could try finding all instances of key{info(' and remove key[{]info[(]['] from the start of your regex. You could then try seeing if a match can be made directly after each key{info(' occurrence.

Better yet, why not replace all the }')}s with a character that will never occur anywhere else in the line, then use ([^c]*) instead of (.*?)[}]['][)][}].

edited Apr 17 '16 at 20:44

answered Apr 17 '16 at 19:44

Laurel

5,522
11
26
49

@Elh48 Can your data contain `}`s inside `$2`? If the answer is no, adding this will not break the regex. – Laurel Apr 17 '16 at 19:49
yes may contain } !!! and i cant use starter at all .. i have lines, and .. i have to get all .. this regex work in 3 seconds in php !!! i dont know why i have to wait 2-3 min in c++ boost !!!!!!! – Elh48 Apr 17 '16 at 19:51
@Elh48 I can barely understand what you're saying. What do you mean by "starter"? – Laurel Apr 17 '16 at 20:03
2

@Elh48 you haven't shown us any code, so we can't possibly know why it runs so slow ("!!!") – sehe Apr 17 '16 at 23:14
im going to use pcre because it is faster at all and i got 1 second (Wow) with this ((?:[^}]*}??)*?) ... thank you man but if you know anything better please tell me – Elh48 Apr 18 '16 at 07:50
@Laurel - I see this over and over. `Steps` is _not a linear_ indicator of performance. – Apr 18 '16 at 17:06
@sln Yes, I know that. But it gives a good indication of when there might be a problem. – Laurel Apr 18 '16 at 21:28

Increase Boost regex speed or use PCRE in C++

2 Answers2