^ab|cd$
means
- beginning of string, followed by
ab
or
cd
followed by end of string.
That is, ab
123
matches, because the bold part matches the 1, and
123
cd
matches, because the bold part matches the 2. That is, the |
symbol has lowest precedence of them all.
Used with .search
it is the same as s.startswith('ab') or s.endswith('cd')
; though
if you use .match
instead of .search
the patterns must match at the beginning of string, thus you get s.startswith('ab') or s == 'cd'
.
^(ab|cd)$
means
- beginning of string, followed by either
ab
or cd
followed by end of string
- whatever is matched with
ab|cd
is available as match.group(1)
^(ab)|(cd)$
means the same as the first one, excepting that if ab
is matched, it is available as match.group(1)
and likewise cd
if matched, the text matching that part is available as match.group(2)
.
Note that (...)
serve 2 purposes in regular expressions - they group atoms into a single atom, and also make the matched text available in the match object. If you just need grouping, you should use (?:...)
instead as generating submatch strings can be expensive.
As to the problem of your Roman numeral regular expression, you have used |
branching incorrectly on the upper level.
^M{1,3}|(CM|C?D|D?C{1,3})|(X?L|XC|L?X{1,3})|(I?V|IX|V?I{1,3})$
if used with .match
(with .search
, the 2-4 are not even bound to the beginning of the string), it stands for
- beginning of string followed by
M{1,3}
and anything after that or
- beginning of string followed by
CM|C?D|D?C{1,3}
and anything after that or
- beginning of string followed by
X?L|XC|L?X{1,3}
and anything after that or
- beginning of string followed by
I?V|IX|V?I{1,3}
followed by end of string.
You do not want to use |
at the main level, but instead have each of these 1-4 as optional, with ?
; also you'd want to generally group with non-capturing group ((?: )
) instead. Thus we get:
^(?:M{1,3})?(?:CM|C?D|D?C{1,3})?(?:X?L|XC|L?X{1,3})?(?:I?V|IX|V?I{1,3})?$
except that it still matches empty string. To make it not match an empty string, you can use zero-width positive lookahead to require that the whole construct matches at least 1 (any) character.
From Python docs,
(?=...)
Matches if ...
matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov)
will match 'Isaac '
only if it’s followed by 'Asimov'
.
Thus we can put this right after the beginning-of-string anchor ^
, to ensure that the whole string at least matches ^.
(that is, beginning of string followed by 1 character):
^(?=.)(?:M{1,3})?(?:CM|C?D|D?C{1,3})?(?:X?L|XC|L?X{1,3})?(?:I?V|IX|V?I{1,3})?$
which means:
- beginning of string (
^
)
- there is at least 1 character before end of string (
(?=.)
)
M{1,3}
(optional), followed by
CM|C?D|D?C{1,3}
(optional), followed by
X?L|XC|L?X{1,3}
(optional), followed by
I?V|IX|V?I{1,3}
(optional), followed by
- end of string
$
which should be what you wanted.