11

Consider this regex.

a*b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

Now consider this regex.

(?>a*)b

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 133 steps in debugger to fail.

And lastly this regex:

a*+b  (a variant of atomic group)

This will fail in case of aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac

This takes 67 steps in debugger to fail.

When I check the benchmark atomic group (?>a*)b performs 179% faster.

Now atomic groups disable backtracking. So performance in match is good.

  1. But why are the number of steps more? Can somebody explain on this?

  2. Why is there a diff. in steps between two atomic groups (?>a*)b and a*+b.

Do they work differently?

Unihedron
  • 10,251
  • 13
  • 53
  • 66
vks
  • 63,206
  • 9
  • 78
  • 110
  • You mean this `a*+b` an atomic group? – Avinash Raj Sep 29 '14 at 06:13
  • 2
    __A few points:__ 1. `*+` in `a*+b` is __possessive__, nothing about it makes it atomic. 2. The number of steps are more because the engine backtracks to match. That is a direct need of regex optimisation to prevent abundant backtracking and potential ReDoS. 3. `(?>a*)` =/= `(?>a)*` – Unihedron Sep 29 '14 at 06:42
  • @AvinashRaj `a*+b` is not atomic but prevent backtracking.so i trieed that – vks Sep 29 '14 at 06:45
  • @Unihedron engine backtracks because i have disabled optimisation? are atomic groups a part of engine optimisation? – vks Sep 29 '14 at 06:46

3 Answers3

8

So what's backtracking?

The engine comes to quantifiers that are greedy by default. Greedy modifiers matches all possible and backtracks by demand, allowing efficient matches,

as referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:

A greedy quantifier first matches as much as possible. So the .* matches the entire string. Then the matcher tries to match the f following, but there are no characters left. So it "backtracks", making the greedy quantifier match one less thing (leaving the "o" at the end of the string unmatched). That still doesn't match the f in the regex, so it "backtracks" one more step, making the greedy quantifier match one less thing again (leaving the "oo" at the end of the string unmatched). That still doesn't match the f in the regex, so it backtracks one more step (leaving the "foo" at the end of the string unmatched). Now, the matcher finally matches the f in the regex, and the o and the next o are matched too. Success! [...]

What does this have to do with a*+b?

In /a*+b/:

  • a The literal character "a".
  •     *+ Zero or more, possessive.
  • b The literal character "b".

As referenced by Greedy vs. Reluctant vs. Possessive Quantifiers:

A possessive quantifier is just like the greedy quantifier, but it doesn't backtrack. So it starts out with .* matching the entire string, leaving nothing unmatched. Then there is nothing left for it to match with the f in the regex. Since the possessive quantifier doesn't backtrack, the match fails there.

Why does it matter?

The machine won't realize if it's doing an (in)efficient match on its own. See here for a decent example: Program run forever when matching regex. In many scenarios, regexes written quickly may not be efficient and may well easily be problematic in deployment.

So what's an atomic group?

After the pattern within the atomic group finishes matching, it will not let go, ever. Study this example:

Pattern: (?>\d\w{2}|12)c
Matching string: 12c

Looks perfectly legitimate, but this match fails. The steps are simple: The first alternation of the atomic group matches perfectly - \d\w{2} consumes 12c. The group then completes its match - now here is our pointer location:

Pattern: (?>\d\w{2}|12)c
                   ^
Matching string: 12c
                    ^

The pattern advances. Now we try to match c, but there is no c. Instead of trying to backtrack (releasing \d\w{2} and consuming 12), the match fails.

Well that's a bad idea then! Why would we prevent backtracking, Unihedron?

Now imagine we're manipulating with a JSON object. This file is not small. Backtracking from the end is going to be a bad idea.

       "2597401":[{"jobID":"2597401",
                 "account":"TG-CCR120014",
                 "user":"charngda",
                 "pkgT":{"pgi/7.2-  5":{"libA":["libpgc.so"],
                 "flavor":["default"]}},          
                 "startEpoch":"1338497979",
                 "runTime":"1022",
                 "execType":"user:binary",              
                 "exec":"ft.D.64",
                 "numNodes":"4",
                 "sha1":"5a79879235aa31b6a46e73b43879428e2a175db5",
                 "execEpoch":1336766742,
                 "execModify":"Fri May 11 15:05:42 2012",
                 "startTime":"Thu May 31 15:59:39 2012",
                 "numCores":"64",
                 "sizeT":{"bss":"1881400168","text":"239574","data":"22504"}},  
                 {"jobID":"2597401",
                 "account":"TG-CCR120014",
                 "user":"charngda",
                 "pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
                 "flavor":["default"]}},
                 "startEpoch":"1338497946",
                 "runTime":"33"  "execType":"user:binary",
                 "exec":"cg.C.64",
                 "numNodes":"4",
                 "sha1":"caf415e011e28b7e4e5b050fb61cbf71a62a9789",
                 "execEpoch":1336766735,
                "execModify":"Fri May 11 15:05:35 2012",
                "startTime":"Thu May 31 15:59:06 2012",
                "numCores":"64",
                "sizeT":{"bss":"29630984","text":"225749","data":"20360"}},
                {"jobID":"2597401",
                "account":"TG-CCR120014",
                "user":"charngda",
                "pkgT":{"pgi/7.2-5":  {"libA":["libpgc.so"],
                "flavor":["default"]}},
                "startEpoch":"1338500447",
                "runTime":"145",
                "execType":"user:binary",
                "exec":"mg.D.64",
                "numNodes":"4",
                "sha1":"173de32e1514ad097b1c051ec49c4eb240f2001f",
                "execEpoch":1336766756,
                "execModify":"Fri May 11 15:05:56 2012",
                "startTime":"Thu May 31 16:40:47 2012",
                "numCores":"64",
                "sizeT":{"bss":"456954120","text":"426186","data":"22184"}},{"jobID":"2597401",
                "account":"TG-CCR120014",
                "user":"charngda",
                "pkgT":{"pgi/7.2-5":{"libA":["libpgc.so"],
                "flavor":["default"]}},
                "startEpoch":"1338499002",
                "runTime":"1444",
                "execType":"user:binary",
                "exec":"lu.D.64",
                "numNodes":"4",
                "sha1":"c6dc16d25c2f23d2a3321d4feed16ab7e10c2cc1",
                "execEpoch":1336766748,
                "execModify":"Fri May 11 15:05:48 2012",
                "startTime":"Thu May 31 16:16:42 2012",
                "numCores":"64",
                "sizeT":{"bss":"199850984","text":"474218","data":"27064"}}],

Uh oh...

Do you get what I mean now? :P

I'll leave you to figure out the rest, and try to find out more about possessive quantifiers and atomic groups; I'm not writing anything else into this post. Here is where the JSON came from, I saw the answer a few days ago, very inspiring: REGEX reformatting.

Read also

Community
  • 1
  • 1
Unihedron
  • 10,251
  • 13
  • 53
  • 66
  • Dats actually great.clears a thing or two.but the question remains.Why does it take debugger more steps to fail when atomic group is defined? – vks Sep 29 '14 at 07:14
8

Author note:

    This answer targets question 1 as delivered by the bounty text "I am looking forward to the exact reason why more steps are being needed by the debugger.I dont need answers explaining how atomic groups work.";
    Jerry's answer addresses the other concerns very well, while my other answer takes a ride through the mentioned constructs, how they work, and why they are important. For full knowledge, simply reading this post is not enough!

Every group in a regular expression takes a step to step into and out of the group.

    WHAT?!
Yeah, I'm serious, read on...

Firstly, I would like to present you with quantified non-capturing groups, over without the group:

Pattern 1: (?:c)at
Pattern 2: cat

So what exactly happens here? We'll match the patterns with the test string "concat" on a regex engine with optimizations disabled:

(?:c)atcat

While we're at it, I present you some more groups:

groups

    Oh no! I'm going to avoid using groups!

But wait! Please note that the number of steps taken to match has no correlation with the performance of the match. engines optimizes away most of the "unnecessary steps" as I've mentioned. Atomic groups are still the most efficient, despite more steps taken on an engine with optimizations disabled.

Perhaps relevant:

Community
  • 1
  • 1
Unihedron
  • 10,251
  • 13
  • 53
  • 66
4

I think something is wrong...

I don't know how you did your benchmarking but both a*+b and (?>a*)b should be the same. To quote regular-expressions.info (emphasis mine):

Basically, instead of X*+, write (?>X*). It is important to notice that both the quantified token X and the quantifier are inside the atomic group. Even if X is a group, you still need to put an extra atomic group around it to achieve the same effect. (?:a|b)*+ is equivalent to (?>(?:a|b)*) but not to  (?>a|b)*. The latter is a valid regular expression, but it won't have the same effect when used as part of a larger regular expression.

And just to confirm the above, I ran the following on ideone:

$tests = 1000000;

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*b/    : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a*)b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a*)b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/a*+b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /a*+b/   : %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

$start = microtime( TRUE );
for( $i = 1; $i <= $tests; $i += 1 ) {
    preg_match('/(?>a)*b/','aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac');
}
$stop = microtime( TRUE );

printf( "For /(?>a)*b/: %1.15f per iteration for %s iterations\n", ($stop - $start)/$tests, $tests );
unset( $stop, $start );

Getting this as the output:

For /a*b/    : 0.000000879034996 per iteration for 1000000 iterations
For /(?>a*)b/: 0.000000876362085 per iteration for 1000000 iterations
For /a*+b/   : 0.000000880002022 per iteration for 1000000 iterations
For /(?>a)*b/: 0.000000883045912 per iteration for 1000000 iterations

Now, I am by no means a PHP expert so I don't know if this is the right way to benchmark stuff in there, but that's they all have about the same performance, which is kind of expected given the simplicity of the task.

Still, couple of things I note from the above:

Neither (?>a)*b nor (?>a*)b are 179% faster than another regex; the above all are within 7% from each other.

Back to the actual question

But why are the number of steps more? Can somebody explain on this?

It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group...

1   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?> a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...
               ^

That's 2 more steps because of the group than the 3 steps...

1   / a*+ b /x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
      ^^^
3   / a*+ b /x    aaaaaaaaaaaaaaaaaaaaa...
          ^

Where you can say that a*b is the same as (?:a*)b but the latter having more steps:

1   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
     ^
2   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaa...
      ^^^^^^^^
3   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
          ^^
4   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaac
            ^
5   / (?: a* ) b/x    aaaaaaaaaaaaaaaaaaaaa...

Note: Even there, you see that regex101 has the steps optimised a bit for the number of steps in a*b.


Conclusion

Possessive quantifiers and atomic groups work differently depending on how you use them. If I take regular-expression's example and tweaking it a little:

(?>a|b)*ac match aabaac

But

(?:a|b)*+ac and (?>(?:a|b)*)ac don't match aabaac.

Unihedron
  • 10,251
  • 13
  • 53
  • 66
Jerry
  • 67,172
  • 12
  • 92
  • 128
  • I checked the performance thing at `regexhero.net`.and the steps difference i m talking about is between `a*b` and `(?>a*)b`.At `regex101.com` there is the differnece present when we disable optimisation.I m trying to find out the reason behind it.67 vs 133 respectively – vks Oct 01 '14 at 08:32
  • @vks Well, ok, but I don't see how that changes the content of my answer in terms of steps. In terms of performance, [`(?>a*)b` is the fastest](http://i.stack.imgur.com/Nz4BX.png) but *not* `(?>a)*b`. – Jerry Oct 01 '14 at 09:10
  • Oh thanx for pointing out.Somebody had edited that.edited back to original.the question is 67 vs 133.doesthe presence of atomic group increases the step?why does it take long to fail thant the normal one....the defintion of atomic suggests it should be opposite right? – vks Oct 01 '14 at 09:23
  • @vks The one with the (correct) atomic group is *fastest*; not slowest. – Jerry Oct 01 '14 at 09:26
  • That's what is d doubt.The performance of atomic group is fastest but the no of steps to fail is more.That is what i am after :( – vks Oct 01 '14 at 09:52
  • @vks Please read my answer. I said "It is to be noted that the number of steps is not a direct representation of the performance of a regex. It is a factor, but not the ultimate determinant. There are more steps because the breakdown has steps before entering the group, and after entering the group..." – Jerry Oct 01 '14 at 09:58
  • @vks Again from my answer, `a*b` and `(?:a*)b` match exactly the same thing, have the same performance but `(?:a*)b` has more steps because of the group. When you compare things, you have to compare on the same bases (here, the two regex match exactly the same thing). – Jerry Oct 01 '14 at 10:01