-1

What is the difference between the following regular expressions. For me they are both the same

  1. [a-z][a-z]* Vs [a-z]+
  2. [a-z][a-z]* Vs [a-z]*[a-z]
jdotjdot
  • 14,082
  • 11
  • 57
  • 101
Abraham Guchi
  • 91
  • 2
  • 11

6 Answers6

6

These regexes are identical, as you thought.

#1:

[a-zA-Z]  # exactly one alphabetic char
[a-zA-Z]* # 0 to infinite alphabetic chars

versus

[a-zA-Z]+ # 1 to infinite alphabetic chars

One is just 1 + [0, \infinity] = [1, \infinity], the other [1, \infinity].

Further comments

#2 works similarly, all you're doing in each case is taking one example of the repeated character (in your case, [a-zA-Z], out of the repeated character command, * or +.

The answer below that points out that the more readable version is preferred is right on target. There is absolutely no reason to do something like [a-zA-Z]*[a-zA-Z] or [a-zA-Z][a-zA-Z]*, since ultimatley they're both just [a-zA-Z]+.

TL;DR

All are the same, and anytime you're repeating two identical commands in a row in a regex, you're doing something wrong.

Update:

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z]*[a-
zA-Z]', '2323hfjfkf 23023493')"
1000000 loops, best of 3: 1.14 usec per loop

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z]+',
'2323hfjfkf 23023493')"
1000000 loops, best of 3: 1 usec per loop

$ python -m timeit -s "import re" "re.search(r'[a-zA-Z][a-z
A-Z]*', '2323hfjfkf 23023493')"
1000000 loops, best of 3: 0.956 usec per loop

Turns out that [a-zA-Z][a-zA-Z]* is marginally faster than using [a-zA-Z]+. I'm a little surprised, but frankly I don't think the loss in readability is worth the .05 microsecond gain in efficiency.

jdotjdot
  • 14,082
  • 11
  • 57
  • 101
  • Would be interesting to view this from the engines point of view. In other words, how the engine performs when given regexes that result in the very same thing but look different. Is one alternative heavier on resources than the other etc? – Firas Dib Dec 25 '12 at 23:30
  • 1
    +1 for performance comparison. I'm curious what the source of the difference in performance is. – MC Emperor Dec 26 '12 at 09:49
1

Functionally all these regular expressions are identical.

Using the + quantifier, though, may be problematic in some cases, because depending on the parser and its settings it may or it may not need to be escaped (\+) in order to retain its special meaning. That is why some people avoid using + and prefer the more explicit XX* form, in order to keep their regular expressions more portable.

As far as Java is concerned, though, + always retains its special meaning, unless escaped.

thkala
  • 76,870
  • 21
  • 145
  • 185
  • Another reason to know of the equivalence is if you were wanting to prove something about a given regex. This is used in the translation of a regular expression to a state machine by the Thomson construction as it just doesn't bother to deal with `+` as the regex could be rewritten with `*`. That's all academic though :-) – Will Dec 25 '12 at 23:34
0

Yes, all four are totally equal regular expressions. [a-z]+ is the simplest one and should be chosen for readability issues.

Phil Rykoff
  • 11,403
  • 2
  • 36
  • 62
0

You're right that [a-zA-Z][a-zA-Z]* and [a-zA-Z]+ match all of the same strings so in that respect there's no difference. There's one main advantage [a-zA-Z]+ has over the other which is that it's more readable (readability counts!).

Danny Roberts
  • 3,122
  • 21
  • 27
0

Both are the same check out Pattern Reluctant quantifiers. [a-zA-Z]+ is more readable for yourself and others.

Srujan Kumar Gulla
  • 5,400
  • 9
  • 42
  • 75
0
[a-zA-Z][a-zA-Z]* Vs [a-zA-Z]*[a-zA-Z]

I think the main difference between this regular expression is that first expression will be done early than the second one. Because tree-walk for match for [a-zA-Z][a-zA-Z]* consist of steps less than another part of the expression.

edem
  • 2,666
  • 2
  • 16
  • 41