118

Here is a regular expression I created to use in JavaScript:

var reg_num = /^(7|8|9)\d{9}$/

Here is another one suggested by my team member.

var reg_num = /^[7|8|9][\d]{9}$/

The rule is to validate a phone number:

  • It should be of only ten numbers.
  • The first number is supposed to be any of 7, 8 or 9.
Tim Pietzcker
  • 297,146
  • 54
  • 452
  • 522
Jayapal Chandran
  • 8,687
  • 14
  • 62
  • 85
  • If you came here from a duplicate, perhaps note that some details in the answers here are specific to Javascript, but most of the advice you get here applies to any regex implementation. Some regular expression dialects like POSIX `grep` require backslashes like `\(7\|8\|9\)` and/or don't support the `\d` shorthand to match a digit. See also [the Stack Overflow `regex` tag info page](/tags/regex/info) which covers this as well as a number of other common beginner problems. – tripleee Mar 18 '21 at 10:41

3 Answers3

139

These regexes are equivalent (for matching purposes):

  • /^(7|8|9)\d{9}$/
  • /^[789]\d{9}$/
  • /^[7-9]\d{9}$/

The explanation:

  • (a|b|c) is a regex "OR" and means "a or b or c", although the presence of brackets, necessary for the OR, also captures the digit. To be strictly equivalent, you would code (?:7|8|9) to make it a non capturing group.

  • [abc] is a "character class" that means "any character from a,b or c" (a character class may use ranges, e.g. [a-d] = [abcd])

The reason these regexes are similar is that a character class is a shorthand for an "or" (but only for single characters). In an alternation, you can also do something like (abc|def) which does not translate to a character class.

B--rian
  • 4,650
  • 8
  • 25
  • 63
Bohemian
  • 365,064
  • 84
  • 522
  • 658
  • 31
    `(7|8|9)` and `[789]` are not equivalent, because the first is capturing, the latter not. `(?:7|8|9)` would be equivalent on the other hand (I guess you know that of course ...). – hochl Mar 21 '12 at 09:55
  • I'm seeing this regex: `[<>|\]\]|\[\[]`. Because of the context, I know that regex is trying to match `<>` or `[[` or `]]`. But from what you've said, it should be matching `` or `[` or `]`. If you use `|` between `[]`, do the brackets behave differently? – Daniel Kaplan Nov 14 '17 at 20:39
  • 1
    @DanielKaplan don't use `|` within a character class`[...]`, unless you want to match the pipe character itself. Also duplicating chars in a character class has no effect - a character class is a list of characters and will match exactly one of them. My guess is you want a *group*, which uses normal round brackets: `(<>|\]\]|\[\[)` – Bohemian Nov 14 '17 at 21:35
64

Your team's advice is almost right, except for the mistake that was made. Once you find out why, you will never forget it. Take a look at this mistake.

/^(7|8|9)\d{9}$/

What this does:

  • ^ and $ denotes anchored matches, which asserts that the subpattern in between these anchors are the entire match. The string will only match if the subpattern matches the entirety of it, not just a section.
  • () denotes a capturing group.
  • 7|8|9 denotes matching either of 7, 8, or 9. It does this with alternations, which is what the pipe operator | does — alternating between alternations. This backtracks between alternations: If the first alternation is not matched, the engine has to return before the pointer location moved during the match of the alternation, to continue matching the next alternation; Whereas the character class can advance sequentially. See this match on a regex engine with optimizations disabled:
Pattern: (r|f)at
Match string: carat

alternations

Pattern: [rf]at
Match string: carat

class

  • \d{9} matches nine digits. \d is a shorthanded metacharacter, which matches any digits.
/^[7|8|9][\d]{9}$/

Look at what it does:

  • ^ and $ denotes anchored matches as well.
  • [7|8|9] is a character class. Any characters from the list 7, |, 8, |, or 9 can be matched, thus the | was added in incorrectly. This matches without backtracking.
  • [\d] is a character class that inhabits the metacharacter \d. The combination of the use of a character class and a single metacharacter is a bad idea, by the way, since the layer of abstraction can slow down the match, but this is only an implementation detail and only applies to a few of regex implementations. JavaScript is not one, but it does make the subpattern slightly longer.
  • {9} indicates the previous single construct is repeated nine times in total.

The optimal regex is /^[789]\d{9}$/, because /^(7|8|9)\d{9}$/ captures unnecessarily which imposes a performance decrease on most regex implementations ( happens to be one, considering the question uses keyword var in code, this probably is JavaScript). The use of which runs on PCRE for preg matching will optimize away the lack of backtracking, however we're not in PHP either, so using classes [] instead of alternations | gives performance bonus as the match does not backtrack, and therefore both matches and fails faster than using your previous regular expression.

Unihedron
  • 10,251
  • 13
  • 53
  • 66
13

The first 2 examples act very differently if you are REPLACING them by something. If you match on this:

str = str.replace(/^(7|8|9)/ig,''); 

you would replace 7 or 8 or 9 by the empty string.

If you match on this

str = str.replace(/^[7|8|9]/ig,''); 

you will replace 7 or 8 or 9 OR THE VERTICAL BAR!!!! by the empty string.

I just found this out the hard way.

Alan Moore
  • 68,531
  • 11
  • 88
  • 149
Sheila
  • 131
  • 1
  • 2
  • 6
    Welcome to SO! Replacing or matching, it's just plain wrong. A lot of people make that mistake, and they usually get away with it--for years, sometimes--because their input strings never happen to contain a pipe (`|`). – Alan Moore Jun 20 '13 at 19:15