Difference betwen the [a-z][a-z]* and [a-z]+ regular expressions

Question

What is the difference between the following regular expressions. For me they are both the same

[a-z][a-z]* Vs [a-z]+
[a-z][a-z]* Vs [a-z]*[a-z]

Please put the regexes between ` characters. Right now it's hard to read your question because you probably didn't intend the italics. — Tom van der Woerdt, Dec 25 '12 at 23:00
@Abraham FYI, you've asked 7 questions and only accepted one of them. You really should accept more answers if you want people to answer your questions. — jdotjdot, Dec 26 '12 at 21:32

jdotjdot · Answer 1 · 2014-07-10T07:00:20.417

These regexes are identical, as you thought.

#1:

[a-zA-Z]  # exactly one alphabetic char
[a-zA-Z]* # 0 to infinite alphabetic chars

versus

[a-zA-Z]+ # 1 to infinite alphabetic chars

One is just 1 + [0, \infinity] = [1, \infinity], the other [1, \infinity].

Further comments

#2 works similarly, all you're doing in each case is taking one example of the repeated character (in your case, [a-zA-Z], out of the repeated character command, * or +.

The answer below that points out that the more readable version is preferred is right on target. There is absolutely no reason to do something like [a-zA-Z]*[a-zA-Z] or [a-zA-Z][a-zA-Z]*, since ultimatley they're both just [a-zA-Z]+.

TL;DR

All are the same, and anytime you're repeating two identical commands in a row in a regex, you're doing something wrong.

Update:

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z]*[a-
zA-Z]', '2323hfjfkf 23023493')"
1000000 loops, best of 3: 1.14 usec per loop

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z]+',
'2323hfjfkf 23023493')"
1000000 loops, best of 3: 1 usec per loop

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z][a-z
A-Z]*', '2323hfjfkf 23023493')"
1000000 loops, best of 3: 0.956 usec per loop

Turns out that [a-zA-Z][a-zA-Z]* is marginally faster than using [a-zA-Z]+. I'm a little surprised, but frankly I don't think the loss in readability is worth the .05 microsecond gain in efficiency.

Would be interesting to view this from the engines point of view. In other words, how the engine performs when given regexes that result in the very same thing but look different. Is one alternative heavier on resources than the other etc? — Firas Dib, Dec 25 '12 at 23:30
+1 for performance comparison. I'm curious what the source of the difference in performance is. — MC Emperor, Dec 26 '12 at 09:49

score 1 · Answer 2 · answered Dec 25 '12 at 23:10

1

Functionally all these regular expressions are identical.

Using the + quantifier, though, may be problematic in some cases, because depending on the parser and its settings it may or it may not need to be escaped (\+) in order to retain its special meaning. That is why some people avoid using + and prefer the more explicit XX* form, in order to keep their regular expressions more portable.

As far as Java is concerned, though, + always retains its special meaning, unless escaped.

answered Dec 25 '12 at 23:10

thkala

76,870
21
145
185

Another reason to know of the equivalence is if you were wanting to prove something about a given regex. This is used in the translation of a regular expression to a state machine by the Thomson construction as it just doesn't bother to deal with `+` as the regex could be rewritten with `*`. That's all academic though :-) – Will Dec 25 '12 at 23:34

score 0 · Answer 3 · answered Dec 25 '12 at 23:02

0

Yes, all four are totally equal regular expressions. [a-z]+ is the simplest one and should be chosen for readability issues.

answered Dec 25 '12 at 23:02

Phil Rykoff

11,403
2
36
62

score 0 · Answer 4 · answered Dec 25 '12 at 23:02

0

You're right that [a-zA-Z][a-zA-Z]* and [a-zA-Z]+ match all of the same strings so in that respect there's no difference. There's one main advantage [a-zA-Z]+ has over the other which is that it's more readable (readability counts!).

answered Dec 25 '12 at 23:02

Danny Roberts

3,122
21
27

score 0 · Answer 5 · answered Dec 25 '12 at 23:06

0

Both are the same check out Pattern Reluctant quantifiers. [a-zA-Z]+ is more readable for yourself and others.

answered Dec 25 '12 at 23:06

Srujan Kumar Gulla

5,400
9
42
75

score 0 · Answer 6 · answered Dec 25 '12 at 23:14

[a-zA-Z][a-zA-Z]* Vs [a-zA-Z]*[a-zA-Z]

I think the main difference between this regular expression is that first expression will be done early than the second one. Because tree-walk for match for [a-zA-Z][a-zA-Z]* consist of steps less than another part of the expression.

Difference betwen the [a-z][a-z]* and [a-z]+ regular expressions

6 Answers6

#1:

Further comments

TL;DR

Update: