3

I'm playing the regexcrossword game and am slightly confused as to the difference between (T|E|N)* and [TEN]*.

The first, to me, reads as: Either T, E, or N zero to unlimited times.

The second, to me, reads as: From the list T, E, or N zero to unlimited times.

I don't see the difference. Surely there is one. Thanks for the help!

piratemurray
  • 865
  • 1
  • 9
  • 21
  • 1
    first one capture once but not the latter. – YOU May 16 '15 at 12:18
  • @gaussblurinc How about suggesting search terms or providing a link to a good guide? The reason I posted here is because I exhausted what I was able to search for on Google without knowing more about the problem space. You'll help other people if you posted some links or suggested search terms and it will be less arrogant of you. Otherwise your comment is just... lazy. – piratemurray May 16 '15 at 13:24
  • thousand questions asked about regular expressions. only nearly 0.1% of them really interesting for regex cowboys, but nearly all of them useless. Why so? Because encyclopedia tells you: [everything you need to know](http://en.wikipedia.org/wiki/Regular_expression) and [search](https://www.google.com/search?q=search+engine+expression&ie=utf-8&oe=utf-8#q=regular+expression+tutorial) can highlight your problem as a x-ray – gaussblurinc May 16 '15 at 16:51

3 Answers3

10

If you are considering letters only.. then there is no difference between pipeing | letters and putting them in a character set [ ].. But it is not the case with words, etc..

Example:

(batman|superman|ironman) is different from [batmansupermanironman]

  • (batman|superman|ironman) will match any of the words batman, superman or ironman

  • [batmansupermanironman] is equivalent to [abeimnoprstu] and matches any character in this set

Also character set has the property of taking range.. [a-z].. which if you want to do using pipe will be hectic..

Ofcourse.. one difference is capture group, (T|E|N) but I dont think that is what you wanted.. :)

karthik manchala
  • 13,025
  • 1
  • 27
  • 54
  • Right, right, right. Thanks so for my case because I'm only checking single letters there is no difference but were I to be checking words that's where I'd see a difference. Cheers. I will accept your answer as soon as SO lets me. – piratemurray May 16 '15 at 12:24
  • Happy to help.. and thanks :) – karthik manchala May 16 '15 at 12:27
5

They both match the same strings, but in terms of differences in output, (T|E|N)* also returns a capture group containing the last matched character.

For example, given the string TENTEN, (T|E|N)* will match and will have N in the first capture group. [TEN]* on the other hand will not have any capture group.

In terms of performance, (T|E|N)* will tend to be slower because most regex engines test the first branch before testing the second one.

For instance with TENTEN, this is what happens (spaces added for the sake of clarification):

Attempts to match T
 T E N T E N
^
Matches T, moves on
 T E N T E N
  ^
Attempts to match T
 T E N T E N
  ^
Fails, attempt to match the next, E
 T E N T E N
  ^
Matches E, moves on
 T E N T E N
    ^
Attempts to match T
 T E N T E N
    ^
Fails, attempt to match the next, E
 T E N T E N
    ^
Fails, attempt to match the next, N
 T E N T E N
    ^
Matches N, moves on
 T E N T E N
      ^

And so on, but with the character class, you could say that everything is tested at the same time:

Attempts to match T, E or N
 T E N T E N
^
Matches T, moves on
 T E N T E N
  ^
Attempts to match T, E or N
 T E N T E N
  ^
Matches E, moves on
 T E N T E N
    ^
Attempts to match T, E or N
 T E N T E N
    ^
Matches N, moves on
 T E N T E N
      ^

This means that ( ... | ... ) will always try to match the first branch before attempting to match the next, while [ ... ] does not and just 'mixes everything together'.

This means that for simple patterns (1 character), it would be best to use character class, i.e. [TEN]* instead of (T|E|N)* (or (?:T|E|N)*).

Jerry
  • 67,172
  • 12
  • 92
  • 128
2

There is no difference in the result.

However, a difference in the phases required to process may exist..


(T|E|N)* is flowed to a parallel query, in tree structure it will look like this:
(T|E|N)* -> (T|E|N) -> T|E|N -> Parallel branch T, E, N
this means the engine is passing 4 phases to process input text for match.
[TEN]* is processed as follows: [TEN]* -> [TEN]
Only 2 phase to process the input text for match.


Therefore [TEN]* is preferable over (T|E|N)*
G.Y
  • 5,341
  • 2
  • 30
  • 51