Regex what exactly does [ ] do?

Question

Regular expressions that have [] have always confused me a bit. Below are some common patterns for the use of []

/[0-9]/ Captures all numbers
/[A-Z]/ Captures all 26 uppercase letters
/[a-z]/ captures all 26 lowercase letters

But what about

/[A-Za-z0-9] captures all numbers, uppercase letters, and lowercase letters

Which could also be written as

/[0-z] which also captures all numbers, uppercase letters, and lowercase letters. But it also captures ^ and | as well, among other characters

why is this?

So basically what is that ? You asked it 9 mins ago, and answered it 9 mins ago. Is it a tutorial/blog in your opinion ? I doubt if this is what SO is for! — Rizwan M.Tuman, Apr 25 '18 at 04:01
@RizwanM.Tuman Self-answering questions is definitely a Stack Overflow practice - the "Ask a Question" form even has a specific option to include a self-answer at the same time. — RJHunter, Apr 25 '18 at 04:27
Jeff Atwood, cofounder of stackoverflow encourages it :) https://stackoverflow.blog/2011/07/01/its-ok-to-ask-and-answer-your-own-questions/. — Vincent Tang, Apr 25 '18 at 14:02

score 5 · Accepted Answer · answered Apr 25 '18 at 03:50

5

Its because of ASCII Tables

/[0-z] captures all ASCII values from 48 to 122

[A-Za-z0-9] does not

answered Apr 25 '18 at 03:50

Vincent Tang

2,648
3
28
49

2

Not just ASCII. [**Unicode too**](https://stackoverflow.com/a/280762/6647153). – ibrahim mahrir Apr 25 '18 at 04:05

score 2 · Answer 2 · answered Apr 25 '18 at 04:23

The [] in a regular expression denotes a character set. It tells the pattern matcher to match any character that appears inside the brackets. So, for instance,

/[abc]/

will match any one of 'a', 'b', or 'c'.

Inside the brackets, however, the hyphen ('-') has a special meaning: it denotes the entire range of characters between the character just before and just after the hyphen (inclusive). That is, the above regex could have been written:

/[a-c]/

If you want to include a literal hyphen in the list of characters in the set, you need to escape it. That is:

/[a\-c]/

will match any one of 'a', '-', or 'c' (and not 'b'). You can also suppress the special meaning of the hyphen by making it the first or last character in the set, so:

/[-ac]/

will also match any one of 'a', '-', or 'c'.

This explains why /[A-Za-z0-9]/ is not the same thing as /[0-z]/: the range of characters between '0' and 'z' simply includes additional characters, as you noted in your question. That's all there is to it.

As a technical detail, Javascript uses the Unicode standard to define what characters fall within a range. If you're sticking with the 7-bit ASCII character set, you'll get the same results using an ASCII chart. But don't use an ASCII chart for character codes above 0x7F. You need to consult the Unicode charts instead.

Regex what exactly does [ ] do?

why is this?

2 Answers2