11

I'm trying to write a regex that matches xa?b?c? but not x. In reality, 'x', 'a', 'b', and 'c' are not single characters, they are moderately complex sub-expressions, so I'm trying to avoid something like x(abc|ab|ac|bc|a|b|c). Is there a simple way to match "at least one of a, b, and c, in that order" in a regex, or am I out of luck?

So8res
  • 8,944
  • 8
  • 46
  • 82
  • 1
    How are you using the regex: matching a whole string, or plucking matches out of some larger text? – Alan Moore Nov 12 '10 at 20:33
  • Plucking matches out of a larger text. Fortunately, there's a bunch of great answers below that deal with both possibilities. Many thanks, everybody! – So8res Nov 15 '10 at 01:31

7 Answers7

11

Here’s the shortest version:

(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

If you need to keep around the match in a separate group, write this:

((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))

But that isn’t very robust in case a, b, or c contain capture groups. So instead write this:

(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))

And if you need a group for the whole match, then write this:

(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))

And if like me you prefer multi-lettered identifiers and also think this sort of thing is insane without being in /x mode, write this:

(?x)
(?<Whole_Match>
    (?<Group_A> a) ?
    (?<Group_B> b) ?  
    (?<Group_C> c) ?

    (?(<Group_A>)           # Succeed 
      | (?(<Group_B>)       # Succeed
          | (?(<Group_C>)   # Succeed
              |             (*FAIL)
            )
        )
    )
 )

And here is the full testing program to prove that those all work:

#!/usr/bin/perl
use 5.010_000;

my @pats = (
    qr/(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))/,
    qr/((a)?(b)?(c)?)(?(2)|(?(3)|(?(4)|(*FAIL))))/,
    qr/(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL))))/,
    qr/(?<M>(?<A>a)?(?<B>b)?(?<C>c)?(?(<A>)|(?(<B>)|(?(<C>)|(*FAIL)))))/,
    qr{
        (?<Whole_Match>

            (?<Group_A> a) ?
            (?<Group_B> b) ?
            (?<Group_C> c) ?

            (?(<Group_A>)               # Succeed
              | (?(<Group_B>)           # Succeed
                  | (?(<Group_C>)       # Succeed
                      |                 (*FAIL)
                    )
                )
            )

        )
    }x,
);

for my $pat (@pats) {
    say "\nTESTING $pat";
    $_ = "i can match bad crabcatchers from 34 bc and call a cab";
    while (/$pat/g) {
        say "$`<$&>$'";
    }
}

All five versions produce this output:

i <c>an match bad crabcatchers from 34 bc and call a cab
i c<a>n match bad crabcatchers from 34 bc and call a cab
i can m<a>tch bad crabcatchers from 34 bc and call a cab
i can mat<c>h bad crabcatchers from 34 bc and call a cab
i can match <b>ad crabcatchers from 34 bc and call a cab
i can match b<a>d crabcatchers from 34 bc and call a cab
i can match bad <c>rabcatchers from 34 bc and call a cab
i can match bad cr<abc>atchers from 34 bc and call a cab
i can match bad crabc<a>tchers from 34 bc and call a cab
i can match bad crabcat<c>hers from 34 bc and call a cab
i can match bad crabcatchers from 34 <bc> and call a cab
i can match bad crabcatchers from 34 bc <a>nd call a cab
i can match bad crabcatchers from 34 bc and <c>all a cab
i can match bad crabcatchers from 34 bc and c<a>ll a cab
i can match bad crabcatchers from 34 bc and call <a> cab
i can match bad crabcatchers from 34 bc and call a <c>ab
i can match bad crabcatchers from 34 bc and call a c<ab>

Sweet, eh?

EDIT: For the x in the beginning part, just put whatever x you want at the start of the match, before the very first optional capture group for the a part, so like this:

x(a)?(b)?(c)?(?(1)|(?(2)|(?(3)|(*FAIL))))

or like this

(?x)                        # enable non-insane mode

(?<Whole_Match>
    x                       # first match some leader string

    # now match a, b, and c, in that order, and each optional
    (?<Group_A> a ) ?
    (?<Group_B> b ) ?  
    (?<Group_C> c ) ?

    # now make sure we got at least one of a, b, or c
    (?(<Group_A>)           # SUCCEED!
      | (?(<Group_B>)       # SUCCEED!
          | (?(<Group_C>)   # SUCCEED!
              |             (*FAIL)
            )
        )
    )
)

The test sentence was constructed without the x part, so it won’t work for that, but I think I’ve shown how I mean to go at this. Note that all of x, a, b, and c can be arbitrarily complex patterns (yes, even recursive), not merely single letters, and it doesn’t matter if they use numbered capture groups of their own, even.

If you want to go at this with lookaheads, you can do this:

(?x)

(?(DEFINE)
       (?<Group_A> a)
       (?<Group_B> b)
       (?<Group_C> c)
)

x

(?= (?&Group_A)
  | (?&Group_B)
  | (?&Group_C)
)

(?&Group_A) ?
(?&Group_B) ?
(?&Group_C) ?

And here is what to add to the @pats array in the test program to show that this approach also works:

qr{
    (?(DEFINE)
        (?<Group_A> a)
        (?<Group_B> b)
        (?<Group_C> c)
    )

    (?= (?&Group_A)
      | (?&Group_B)
      | (?&Group_C)
    )

    (?&Group_A) ?
    (?&Group_B) ?
    (?&Group_C) ?
}x

You’ll notice please that I still manage never to repeat any of a, b, or c, even with the lookahead technique.

Do I win? ☺

tchrist
  • 74,913
  • 28
  • 118
  • 169
  • +1 for nice example sentence and for a functioning implementation. interesting. BTW what do we do with the `x` from the original question? – LarsH Nov 12 '10 at 21:59
  • @LarshH: The `x` in the original question is just leading crud in the string. It’s not part of the match. I didn’t put in any anchors, which is why I can loop through a progressive match to get all the matches one set at a time. – tchrist Nov 12 '10 at 22:07
  • @LarsH, @Nate: I just saw what the `x` is there. Let me put it back in. – tchrist Nov 12 '10 at 22:17
  • 1
    @tchrist: How do you know the OP is looking for a Perl-specific solution? I don't see anything to indicate that. – Alan Moore Nov 13 '10 at 05:58
  • @Alan Moore: I don't. He asked for a regex. If they don’t specify I language, that’s not **my** fault. And of course I’ll write in Perl; what would you expect? Sure, the test program is in Perl, but that's because it's a lot easier to put that together than a C program for PCRE — in which my solution also works perfectly well. – tchrist Nov 13 '10 at 10:50
  • @Alan Moore: It *may* actually work elsewhere, too, like PHP using its `preg` stuff. The problem is that I can’t seem to divine what version of PCRE that PHP supports! Have you any idea? PCRE’s `pcre_version()` function returns `"8.10 2010-06-25"` on my system. I don’t have PHP installed, but perhaps it’s just a matter of linking to a properly build `libpcre.so`. By “properly built”, I minimally mean one for which `PCRE_CONFIG_UTF8 = 1` and `PCRE_CONFIG_UNICODE_PROPERTIES = 1` according to its `pcre_config()` function. – tchrist Nov 13 '10 at 15:39
  • In my experience, if someone posts a regex question here on SO and doesn't mention a flavor, it usually means they aren't aware there **are** different flavors. Your regexes may work in PHP or Flex/ActionScript, which happen to use the PCRE library, but the OP could just as easily be targeting Python, .NET, Java, JavaScript or Ruby. They could even be talking about the search/replace feature in an editor; Notepad++ and Visual Studio come up pretty often--but that's usually people who *do* know some regex, wondering why nothing works. :-/ – Alan Moore Nov 13 '10 at 16:27
  • 2
    @Alan Moore: Then should we just forbid `regex` without another tag? That seems wrong, because no tag is supposed to *not* be able to be standalone. And yet, what you’re saying sounds like we cannot answer anything at all, since we don’t even know simple things like BRE vs ERE dialects. Why shouldn’t people just give it their best shot? If they do not specify, I’ll give a Perl solution every single time. I think I made it pretty clear that I was providing Perl code, too. – tchrist Nov 13 '10 at 16:40
  • 2
    No, you do just have to make your best guess much of the time. Without any other clues, I think it's safe to assume a Perl-derivative flavor with the most common features, like lookaheads and reluctant quantifiers (e.g., JavaScript). If the solution requires more esoteric features (as this one does), I try to get more info from the OP, suggest a non-regex alternative, or make it very clear which flavors the solution will work in. But I *definitely* don't want to discourage you from posting answers like this one--this is great stuff! – Alan Moore Nov 13 '10 at 19:55
  • As for you question about PCRE versions: I have no idea. I don't do PHP myself, I just use online testers and pastebins and to test my answers. (I didn't see that comment when I posted my next one.) – Alan Moore Nov 13 '10 at 20:04
5

Not trivially, if you don't have lookahead.

x(ab?c?|bc?|c)
Ignacio Vazquez-Abrams
  • 699,552
  • 132
  • 1,235
  • 1,283
  • Vazquez: You are kind of right, in that it is not “trivial” without lookahead, although it is still *possible* using a couple of different approaches. I give two solutions, one with lookahead and one without, while @Tim Pietzcker gives one using backrefs “creatively”. The problem with your answer is that the OP requested something that didn’t repeat `a`, `b`, or `c`, and your answer did that. So it does not seem to answer the question as posed! – tchrist Nov 13 '10 at 15:43
5

How about this:

x(?:a())?(?:b())?(?:c())?(\1|\2|\3)

The empty capturing groups after a, b and c will always match (an empty string) if a, b or c match, in that order.

The (\1|\2|\3) part will only match if at least one of the preceding groups participated in the match. So if you just have x, the regex fails.

Every part of the regex will be evaluated just once.

Of course, if x, a, b and c are more complex subexpressions that contain capturing groups themselves, you have to adjust the numbers of the backreferences accordingly*.

Since this regex does look a bit strange, here's the verbose version:

x          # Match x
(?:a())?   # Try to match a. If this succeeds, \1 will contain an empty string.
(?:b())?   # Same with b and \2.
(?:c())?   # Same with c and \3.
(\1|\2|\3) # Now try to match the content of one of the backreferences. 
           # This works if one of the empty parentheses participated in the match.
           # If so, the backref contains an empty string which always matches. 
           # Bingo!

You might need to surround this with anchors (^ and $) unless you don't mind it matching xb within the string cxba etc.

For example, in Python:

>>> r=re.compile(r"x(?:a())?(?:b())?(?:c())?(\1|\2|\3)$")
>>> for test in ("x", "xa", "xabc", "xba"):
...     m = r.match(test)
...     if m:
...         print("{} --> {}".format(test, m.group(0)))
...     else:
...         print("{} --> no match".format(test))
...
x --> no match
xa --> xa
xabc --> xabc
xba --> no match

*or, if your regex flavor knows named capturing groups, you can use those, for example

x(?:a(?P<a>))?(?:b(?P<b>))?(?:c(?P<c>))?((?P=a)|(?P=b)|(?P=c))

in Python/PCRE. In .NET (and possibly other flavors), it's even legal to have several capturing groups that use the same name, making another simplification possible:

x(?:a(?<m>))?(?:b(?<m>))?(?:c(?<m>))?\k<m>
Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
  • @Tim, wow, looks pretty interesting. You might need to explain that `(?:regex)` is a non-capturing group. Also, are you sure the regex fails if you just have `x`? "If a backreference was not used in a particular match attempt ..., it is simply empty. Using an empty backreference in the regex is perfectly fine. It will simply be replaced with nothingness." (http://www.regular-expressions.info/brackets.html) – LarsH Nov 12 '10 at 21:54
  • @LarsH: Yes, the regex fails if there is just an `x`. This works (at least) in .NET, Java, Perl, Python, Ruby, and PCRE (PHP). From the Regular Expressions Cookbook, p. 304: "Since any attempt to match a backreference such as `\1` will fail if the corresponding capturing group has not yet participated in the match, backreferences to empty groups can be used to control the path a regex engine takes through a pattern." – Tim Pietzcker Nov 12 '10 at 22:20
  • @LarsH: The quote you linked to appears to be wrong (even though it's by the same author). `a(b)?\1` matches `abb` but not `a` or `ab`. – Tim Pietzcker Nov 12 '10 at 22:32
  • @Tim: thanks for the clarification. Weird that he would contradict himself like that. Are you going to send him feedback? (http://www.regular-expressions.info/about.html) – LarsH Nov 12 '10 at 22:40
  • @LarsH, @Tim: There's no contradiction. `\1` matches whatever the first capturing group matched. In `a(b)\1` that's 'b', but in Tim's answer and the examples from the book, it's an empty string. I started to suggest something like this, but I was waiting for Nate's answer to my comment. There's no "might" about it: you have to have *some* way to anchor the match for this to work. – Alan Moore Nov 12 '10 at 23:34
  • @Alan Moore: Jan Goyvaerts writes in the tutorial linked above that if the regex `Set(Value)?` is applied to `Set`, backreference 1 will simply be empty (this is true), so referring to it again in the regex is OK - but that isn't the case: `Set(Value)?\1` does not match `Set`. It doesn't matter if the parentheses contain something or not - if a backreference is used (within a regex) that refers to a part of the regex that did not participate in the match at all, the entire regex fails. – Tim Pietzcker Nov 13 '10 at 06:30
  • @Tim Pietzcker: Your solution is Quite Interesting. I’ve never relied on unset backrefs failing entirely instead of counting as the empty string and therefore always succeeding. Might I please trouble you to also provide a named-capture version so that it isn’t sensitive to embedded capture groups in `x`, `a`, `b`, and/or `c` throwing off the numbering scheme? Thanks. – tchrist Nov 13 '10 at 13:20
  • @TimPietzcker: Thanks! For named backrefs, PCRE supports `\k`, `\k{NAME}`, `\g{NAME}`, and `(?P=NAME)`; Perl supports `\k`, `\k'NAME'`, `\g{NAME}`, and `(?P=NAME)`. Me, I like `\k` myself because it pairs well with `(?...)` and `(?()...|...)`. I rather dislike the lengthy Python version. After the match in Perl, you access named captures in for example `$+{NAME}` for the left-most capture named NAME, and via `$-{NAME}` for an array ref to all captures named NAME. Not sure what Python does with multiples though. Have you tried the `(?|...)` branch-reset for this? – tchrist Nov 13 '10 at 15:13
  • @tchrist: Interesting. It's becoming more and more apparent that regular-expressions.info needs to be updated. Python doesn't allow multiples; I've added a version for .NET that uses them. – Tim Pietzcker Nov 13 '10 at 15:21
  • @Tim Pietzcker: I agree completely regarding the *regular-expressions.info* site. I’ve attempted to contact its author, but get no response. People who hide their email addresses make this quite frustrating (mine is a matter of public record). BTW, is there any reason you don’t use more readable patterns in Python via `r"""(?x)…"""`? True, that does add more characters to a syntax that’s already too punctuation-heavy, but I think it’s worth it. Of course, in Perl you just go from `/…/` to `/…/x`, so it’s easier there. In perl6, `/x` mode is the *default* for patterns, **THANK GOODNESS**! ☺ – tchrist Nov 13 '10 at 15:49
  • @LarsH, @tchrist: I just got a reply from Jan Goyvaerts, and he has removed the incorrect statements LarsH quoted in his first comment from his website. The statement was correct for JavaScript but wrong for most other regex flavors. – Tim Pietzcker Nov 25 '10 at 08:07
  • @Tim: glad to know I wasn't just misunderstanding the statements on his site. Thanks for following up on that. – LarsH Nov 26 '10 at 03:53
3

How about something like

x(?=[abc])a?b?c?
Blindy
  • 55,135
  • 9
  • 81
  • 120
  • Pretty slick, but requires repeating a, b, and c once: is there a way to do it without repeating a, b, and c, or is that too much to ask for? – So8res Nov 12 '10 at 20:20
  • Nah, you need to repeat it because they're two different conditions. You need at least one a, b or c AND the specific pattern. – Blindy Nov 12 '10 at 20:21
  • @Blindy, the q says "In reality, 'x', 'a', 'b', and 'c' are not single characters, they are moderately complex sub-expressions". So instead of `[abc]` you need `(a|b|c)`. – LarsH Nov 12 '10 at 20:23
  • @LarsH, perhaps, but I'm pretty certain the OP understands that and how to fix it if he needs it, so I answered his specific question in the most performant way. – Blindy Nov 12 '10 at 20:26
  • @Nate, @Blindy: No, you **don’t** need to repeat `a`, `b`, or `c` to solve this. There are even a couple of different ways to go about that. – tchrist Nov 13 '10 at 15:21
2

If you absolutely must not repeat a, b, or c, then this is the shortest, simplest regex--provided that x represents a fixed-length expression, or that the implementation you are using supports a variable-length one. It uses a negative look-behind, and Perl, for example, will die on a variable length look-behind.

Basically, it's what you are saying, rephrased:

/(x)a?b?c?(?<!x)/;

Here's what it says: I want to match xa?b?c? but when I consider it I don't want the last expression to have been x.

In addition, it will not work if the match for a, b, or c ends with x. (hat-tip: tchrist)

Axeman
  • 29,194
  • 2
  • 42
  • 98
1

Here's the shortest I could come up with:

x(ab?c?|bc?|c)

I believe it matches the criteria while minimising repetition (although there is some). It also avoid using any look-aheads or other processor-intensive expressions, which is probably more valuable than saving regex string length.

This version repeats c three times. You could adapt it so that either a or b is the one repeated most often, so you could choose the shortest of a, b and c to be the one to be repeated three times.

Spudley
  • 157,081
  • 38
  • 222
  • 293
  • That answer is poor because it repeats `b` once and `c` twice, something the OP said he was trying to avoid. At least three solutions solve this without the repeats, but yours is not one of these. ☹ – tchrist Nov 13 '10 at 15:46
  • @tchrist: the OP said he was trying to avoid repeating every possible combination; he didn't say he wouldn't consider any repetition at all. Mine is considerably shorter than his original solution. Granted there are others who manage to avoid the repetition, but I believe these may end up being more complex than they appear since they require back-references. – Spudley Nov 13 '10 at 17:12
  • I don’t use backreferences in mine. – tchrist Nov 15 '10 at 01:25
0

If you don't need to find a maximal (greedy) match, you can drop the "in that order", because if you match x(a|b|c) and ignore any following text you have already matched "at least one of a, b, and c, in that order". In other words, if all you need is a true/false answer (does it match or not), then x(a|b|c) is sufficient. (Another assumption: that you are trying to determine whether the input string contains a match, not whether the whole string matches the regexp. I.e. see @Alan Moore's question.)

However if you want to identify a maximal match, or match against the entire input string, you can use lookahead: x(?=(a|b|c))a?b?c?

There is some redundancy there but a lot less than the combinatorial approach you were trying to avoid.

LarsH
  • 25,732
  • 8
  • 77
  • 136