26

I have gone through the docs for Atomic Grouping and rubyinfo and some questions came into my mind:

  1. Why the name "Atomic grouping"? What "atomicity" does it have that general grouping doesn't?
  2. How does atomic grouping differ to general grouping?
  3. Why are atomic groups called non-capturing groups?

I tried the below code to understand but had confusion about the output and how differently they work on the same string as well?

irb(main):001:0> /a(?>bc|b)c/ =~ "abbcdabcc"
=> 5
irb(main):004:0> $~
=> #<MatchData "abcc">
irb(main):005:0> /a(bc|b)c/ =~ "abcdabcc"
=> 0
irb(main):006:0> $~
=> #<MatchData "abc" 1:"b">
Wai Ha Lee
  • 7,664
  • 52
  • 54
  • 80
Arup Rakshit
  • 109,389
  • 25
  • 234
  • 293
  • Did you read the [atomic grouping docs](http://ruby-doc.org/core-1.9.3/Regexp.html#label-Atomic+Grouping) or the [`named_captures` docs](http://ruby-doc.org/core-1.9.3/Regexp.html#method-i-named_captures) that you linked to? – mu is too short Jan 19 '13 at 06:44
  • @muistooshort Yes,I read. But asked fundamentals I have to be cleared out first,which in turn can give me more exposure to start. As I would like to know when `grouping` I have,then how and why we need such `atomic grouping` what it can do that general `grouping` can't. Could you help me to have such basic understanding to be cleared out? – Arup Rakshit Jan 19 '13 at 06:49
  • 1
    This question has been added to the [Stack Overflow Regular Expression FAQ](http://stackoverflow.com/a/22944075/2736496), under "Groups". – aliteralmind Apr 10 '14 at 00:26

3 Answers3

55

A () has some properties (include those such as (?!pattern), (?=pattern), etc. and the plain (pattern)), but the common property between all of them is grouping, which makes the arbitrary pattern a single unit (unit is my own terminology), which is useful in repetition.

The normal capturing (pattern) has the property of capturing and group. Capturing means that the text matches the pattern inside will be captured so that you can use it with back-reference, in matching or replacement. The non-capturing group (?:pattern) doesn't have the capturing property, so it will save a bit of space and speed up a bit compared to (pattern) since it doesn't store the start and end index of the string matching the pattern inside.

Atomic grouping (?>pattern) also has the non-capturing property, so the position of the text matched inside will not be captured.

Atomic grouping adds property of atomic compared to capturing or non-capturing group. Atomic here means: at the current position, find the first sequence (first is defined by how the engine matches according to the pattern given) that matches the pattern inside atomic grouping and hold on to it (so backtracking is disallowed).

A group without atomicity will allow backtracking - it will still find the first sequence, then if the matching ahead fails, it will backtrack and find the next sequence, until a match for the entire regex expression is found or all possibilities are exhausted.

Example

Input string: bbabbbabbbbc
Pattern: /(?>.*)c/

The first match by .* is bbabbbabbbbc due to the greedy quantifier *. It will hold on to this match, disallowing c from matching. The matcher will retry at the next position to the end of the string, and the same thing happens. So nothing matches the regex at all.


Input string: bbabbbabbbbc
Pattern: /((?>.*)|b*)[ac]/, for testing /(((?>.*))|(b*))[ac]/

There are 3 matches to this regex, which are bba, bbba, bbbbc. If you use the 2nd regex, which is the same but with capturing groups added for debugging purpose, you can see that all the matches are result of matching b* inside.

You can see the backtracking behavior here.

  • Without the atomic grouping /(.*|b*)[ac]/, the string will have a single match which is the whole string, due to backtracking at the end to match [ac]. Note that the engine will go back to .* to backtrack by 1 character since it still have other possibilities.

    Pattern: /(.*|b*)[ac]/
    bbabbbabbbbc
    ^             -- Start matching. Look at first item in alternation: .*
    bbabbbabbbbc
                ^ -- First match of .*, due to greedy quantifier
    bbabbbabbbbc
                X -- [ac] cannot match
                  -- Backtrack to ()      
    bbabbbabbbbc
               ^  -- Continue explore other possibility with .*
                  -- Step back 1 character
    bbabbbabbbbc
                ^ -- [ac] matches, end of regex, a match is found
    
  • With the atomic grouping, all possibilities of .* is cut off and limited to the first match. So after greedily eating the whole string and fail to match, the engine have to go for the b* pattern, where it successfully finds a match to the regex.

    Pattern: /((?>.*)|b*)[ac]/
    bbabbbabbbbc
    ^             -- Start matching. Look at first item in alternation: (?>.*)
    bbabbbabbbbc
                ^ -- First match of .*, due to greedy quantifier
                  -- The atomic grouping will disallow .* to be backtracked and rematched
    bbabbbabbbbc
                X -- [ac] cannot match
                  -- Backtrack to ()
                  -- (?>.*) is atomic, check the next possibility by alternation: b*
    bbabbbabbbbc
    ^             -- Starting to rematch with b*
    bbabbbabbbbc
      ^           -- First match with b*, due to greedy quantifier
    bbabbbabbbbc
       ^          -- [ac] matches, end of regex, a match is found
    

    The subsequent matches will continue on from here.

nhahtdh
  • 52,949
  • 15
  • 113
  • 149
  • perfect example you have given,now just for clarity - could you add with that same example,what if it is `non-atomic` matching? – Arup Rakshit Jan 19 '13 at 08:12
  • @RubyTheBang: I actually was writing it. Check the 2nd example to see if it clears things up. – nhahtdh Jan 19 '13 at 08:14
  • Yes,I just finished. Can your last part be a little bit visualized - to see and clear view how regex taking the pattern with input string produces the output - `` . just asking to see during `backtracking` how does `regex` changing its decisions(means how it decides which one to take and which one to reject and when rjects how it covers and looking for the next correct pattern.) – Arup Rakshit Jan 19 '13 at 08:27
11

I recently had to explain Atomic Groups to someone else and I thought I'd tweak and share the example here.

Consider the (big|small|biggest) (cat|dog|bird).

Matches in bold

  • the big dog
  • the small bird
  • the biggest dog
  • the small cat

For the first line, a regex engine would find the . It would then proceed on to our adjectives (big, small, biggest), it finds big. Having matched "big", it proceeds and finds the space. It then looks at our pets (cat, dog, bird) and finds cat, skips it, and finds dog.

For the second line, our regex would find the . It would proceed and look at big, skip it, look at and find small. It then finds " ". It looks at "cat", skips it, looks at "dog", skips it, and finds "bird".

For the third line, our regex would find the , It continues on and finds big which matches the immediate requirement, and proceeds. It can't find the space, so it backtracks (rewinds the position to the last choice it made). It skips big, looks at small and skips it. It finds biggest which also matches the immediate requirement. It then finds " ". It looks at cat and skips it, and matches dog.

For the fourth line, our regex would find the . It would proceed to look at big, skip it, look at and find small. It then finds " ". It looks at and matches cat.

Consider the (?>big|small|biggest) (cat|dog|bird)

Note the ?> atomic group on adjectives.

Matches in bold

  • the big dog
  • the small bird
  • the biggest dog
  • the small cat

For the first line, second line, and fourth line, our engine functions the same way.

For the third line, our regex would find the , It continues on and find "big" which matches the immediate requirement, and proceeds. It can't find the space, but the atomic group, being the last choice the engine made, won't allow that choice to be re-examined (prohibits backtracking). Since it can't make a new choice, the match has to fail, since our simple expression has no other choices.

This is only a basic summary. An engine wouldn't need to look at the entirety of cat to know that it doesn't match dog, merely looking at the c is enough. When trying to match bird, the c in cat and the d in dog are enough to tell the engine to examine other options.

However if you had ...((cat|snake)|dog|bird), the engine would also, of course, need to examine snake before it dropped to the previous group and examined dog and bird.

There are also plenty of choices an engine can't decide without going past what may not seem like a match. If you have ((red)?cat|dog|bird), The engine will look at "r", back out, notice the ? quantifier, ignore the subgroup (red), and look for a match.

Regular Jo
  • 4,348
  • 3
  • 18
  • 36
4

An "atomic group" is one where the regular expression will never backtrack past. So in your first example /a(?>bc|b)c/ if the bc alternation in the group matches, then it will never backtrack out of that and try the b alternation. If you slightly alter your first example to match against "abcdabcc" then you'll see it still matches the "abcc" at the end of the string instead of the "abc" at the start. If you don't use an atomic group, then it can backtrack past the bc and try the b alternation and end up matching the "abc" at the start.

As for question two, how it's different, that's just a rephrasing of your first question.

And lastly, atomic groups are not "called" non-capturing groups. That's not an alternate name for them. Non-capturing groups are groups that do not capture their content. Typically when you match a regular expression against a string, you can retrieve all the matched groups, and if you use a substitution, you can use backreferences in the substitution like \1 to insert the captured groups there. But a non-capturing group does not provide this. The classic non-capturing group is (?:pattern). An atomic group happens to also have the non-capturing property, hence why it's called a non-capturing group.

Lily Ballard
  • 169,315
  • 25
  • 364
  • 333
  • 1
    thanks to show your interest on my pain. Here is my confusion - `An "atomic group" is one where the regular expression will never backtrack past` - what backtracking is? How does it being performed by regular expressions without `atomic grouping`? --- This is my whole confusion area! Can you help me to visualize the same? – Arup Rakshit Jan 19 '13 at 07:39
  • 1
    @RubyTheBang: The regular expression engine works by matching each component part of the expression (called an "atom") in turn against the string. When it fails a match, it backtracks to a previous atom that was either a group with an alternation, or one modified by some form of repetition operator (`?`, `*`, `+`, or `{n[,m]}`). It then alters the match it chose for that atom and tries again. – Lily Ballard Jan 19 '13 at 07:42
  • @RubyTheBang: So when matching your expression against `"abcdabcc"` it first matches the `a` against `a`. It then matches the `(?>bc|c)` and picks the first alternation, `bc`. This also matches. It then tries to match `c`, but the input string has `d` instead. This fails. It starts backtracking, but when it hits the atomic group, it stops backtracking (due to atomicity). So it throws out the whole match and tries again starting from the `b`. Eventually it finds the `abcc` and finishes the match. – Lily Ballard Jan 19 '13 at 07:43
  • @RubyTheBang: But with `/a(bc|b)c/` as the pattern, when it fails to match that final `c`, it can backtrack and pick the second alternation in the group, the `b`, and match that. Now it can successfully match the final `c` in the pattern, to get the `"abc"` at the start of the string. – Lily Ballard Jan 19 '13 at 07:44
  • Can you help me to show the same in your post, being a newbie still fully confused on that part - eager to see how internally engine works on the same? – Arup Rakshit Jan 19 '13 at 07:45
  • @RubyTheBang: I haven't the foggiest idea what you mean by that last comment. – Lily Ballard Jan 19 '13 at 07:45
  • The link I pasted in the description mentioned that - `atomic groups are "called" non-capturing groups`.Thus I asked `why?` – Arup Rakshit Jan 19 '13 at 08:03
  • 1
    Hmm. Saying that the engine won't backtrack *past* an atomic group seems wrong, to me. That makes it sound like `/(aab|a)(?>a)ba/` won't match `aaba`, which is false; the engine will in fact happily backtrack *past* the atomic group to try changing the first group's match from `aab` to `a`. The engine won't try different matches for an atomic group while backtracking, but *will* backtrack *past* it. – Mark Amery Oct 31 '18 at 10:44