0

What's the difference between below there regex patterns?

pattern1 = "^ab|cd$"
pattern2 = "^(ab|cd)$"
pattern3 = "^(ab)|(cd)$"

I try to write a regex expression to match roman number format(0~3999). And I wrote a pattern as below:

pattern = "^M{1,3}|(CM|C?D|D?C{1,3})|(X?L|XC|L?X{1,3})|(I?V|IX|V?I{1,3})$"

And this pattern matches "DIIII" or XIIII or something like this pattern, but I expected most three I be matched.

Why does this happen?

slee
  • 449
  • 1
  • 5
  • 14

2 Answers2

6
r"^ab|cd$"

Matches ab at the start or cd at the end. Note that this won't match ab's which are present at the middle or at the end of a line. Likewise this won't match cd's which are present at the start or at the middle.

r"^(ab|cd)$"

Matches the whole line which contains only ab or cd. Further the string cd or ab was captured by a single group.

r"^(ab)|(cd)$"

Same as the first one but it captures ab or cd into two separate groups.

Avinash Raj
  • 160,498
  • 22
  • 182
  • 229
3

^ab|cd$ means

  1. beginning of string, followed by ab or
  2. cd followed by end of string.

That is, ab123 matches, because the bold part matches the 1, and 123cd matches, because the bold part matches the 2. That is, the | symbol has lowest precedence of them all.

Used with .search it is the same as s.startswith('ab') or s.endswith('cd'); though if you use .match instead of .search the patterns must match at the beginning of string, thus you get s.startswith('ab') or s == 'cd'.

^(ab|cd)$ means

  • beginning of string, followed by either ab or cd followed by end of string
  • whatever is matched with ab|cd is available as match.group(1)

^(ab)|(cd)$ means the same as the first one, excepting that if ab is matched, it is available as match.group(1) and likewise cd if matched, the text matching that part is available as match.group(2).

Note that (...) serve 2 purposes in regular expressions - they group atoms into a single atom, and also make the matched text available in the match object. If you just need grouping, you should use (?:...) instead as generating submatch strings can be expensive.


As to the problem of your Roman numeral regular expression, you have used | branching incorrectly on the upper level.

^M{1,3}|(CM|C?D|D?C{1,3})|(X?L|XC|L?X{1,3})|(I?V|IX|V?I{1,3})$

if used with .match (with .search, the 2-4 are not even bound to the beginning of the string), it stands for

  1. beginning of string followed by M{1,3} and anything after that or
  2. beginning of string followed by CM|C?D|D?C{1,3} and anything after that or
  3. beginning of string followed by X?L|XC|L?X{1,3} and anything after that or
  4. beginning of string followed by I?V|IX|V?I{1,3} followed by end of string.

You do not want to use | at the main level, but instead have each of these 1-4 as optional, with ?; also you'd want to generally group with non-capturing group ((?: )) instead. Thus we get:

^(?:M{1,3})?(?:CM|C?D|D?C{1,3})?(?:X?L|XC|L?X{1,3})?(?:I?V|IX|V?I{1,3})?$

except that it still matches empty string. To make it not match an empty string, you can use zero-width positive lookahead to require that the whole construct matches at least 1 (any) character.

From Python docs,

(?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

Thus we can put this right after the beginning-of-string anchor ^, to ensure that the whole string at least matches ^. (that is, beginning of string followed by 1 character):

^(?=.)(?:M{1,3})?(?:CM|C?D|D?C{1,3})?(?:X?L|XC|L?X{1,3})?(?:I?V|IX|V?I{1,3})?$

which means:

  • beginning of string (^)
  • there is at least 1 character before end of string ((?=.))
  • M{1,3} (optional), followed by
  • CM|C?D|D?C{1,3} (optional), followed by
  • X?L|XC|L?X{1,3} (optional), followed by
  • I?V|IX|V?I{1,3} (optional), followed by
  • end of string $

which should be what you wanted.

Antti Haapala
  • 117,318
  • 21
  • 243
  • 279
  • Your answer clearly explained my problem. But I still don't get the usage of (?=). and what's the difference between "(?=.)a" and "a(?=.)"? – slee Feb 12 '15 at 08:10