Searching for UUIDs in text with regex

Question

I'm searching for UUIDs in blocks of text using a regex. Currently I'm relying on the assumption that all UUIDs will follow a patttern of 8-4-4-4-12 hexadecimal digits.

Can anyone think of a use case where this assumption would be invalid and would cause me to miss some UUIDs?

This question from 6 years ago was to help me with a project to find credit cards in a block of text. I've subsequently open sourced the code which is linked from my blog post which explains the nuance that the UUIDs were causing when searching for credit cards http://www.guyellisrocks.com/2013/11/parsing-text-for-credit-card-number.html — Guy, Apr 17 '14 at 14:15
A search for UUID regular expression pattern matching brought me to this stack overflow post but the accepted answer actually isn't an answer. Additionally, the link you provided in the comment below your question also doesn't have the pattern (unless I'm missing something). Is one of these answer something you ended up using? — Tass, Feb 03 '16 at 21:19
If you follow the rabbit warren of links starting with the one that I posted you might come across this line in GitHub which has the regex that I finally used. (Understandable that it is difficult to find.) That code and that file might help you: https://github.com/guyellis/CreditCard/blob/master/Company.CreditCard/CreditCard.cs#L98 — Guy, Feb 04 '16 at 14:20
None of these answers seem to give a single regex for all variants of only valid RFC 4122 UUIDs. But it looks like such an answer was given here: http://stackoverflow.com/a/13653180/421049 — Garret Wilson, Feb 23 '17 at 00:49

Ivelin · Answer 1 · 2018-11-28T15:16:45.077

528

The regex for uuid is:

\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b

edited Nov 28 '18 at 15:16

answered Jul 10 '11 at 11:39

Ivelin

9,269
4
32
34

25

make that `[a-f0-9]`! As it's hex! Your regex (as it is) could return false positives. – exhuma Sep 25 '11 at 09:21
13

In some cases you might even want to make that [a-fA-F0-9] or [A-F0-9]. – Hans-Peter Störr Nov 23 '11 at 12:53
2

+1 for pattern, but I'm wondering [0-9a-f] might perform better as more random hex digits will be a number instead of alphabetic character? – cyber-monk Apr 02 '12 at 15:46
29

@cyber-monk: [0-9a-f] is identical to [a-f0-9] and [0123456789abcdef] in meaning and in speed, since the regex is turned into a state machine anyway, with each hex digit turned into an entry in a state-table. For an entry point into how this works, see http://en.wikipedia.org/wiki/Nondeterministic_finite_automaton – JesperSM Jul 03 '12 at 12:07
1

@JesperSM indeed [0-9a-f] ~ [a-f0-9] but [0123456789abcdef] is ~1% slower probably because there's more "string" to get parsed. The setup: `timeit.timeit(stmt="re.match('[0123456789abcdef]{8}-[0123456789abcdef]{4}-[0123456789abcdef]{4}-[0123456789abcdef]{4}-[0123456789abcdef]{12}$','82b1510f-d735-4952-8a6d-0f7d6bfe7960')",setup='import re', number=100000)/timeit.timeit(stmt="re.match('[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{4}-[a-f0-9]{12}$','82b1510f-d735-4952-8a6d-0f7d6bfe7960')",setup='import re', number=100000)` – estani Jan 23 '13 at 08:49
11

This solution is not quite correct. It matches IDs that have invalid version and variant characters per RFC4122. @Gajus' solution is more correct in that regard. Also, the RFC allows upper-case characters on input, so adding [A-F] would be appropriate. – broofa Feb 06 '13 at 18:35
5

@broofa, I see that you are really set on everyone matching only UUIDs that are consistent with the RFC. However, I think the fact that you have had to point this out so many times is a solid indicator that not all UUIDs will use the RFC version and variant indicators. The UUID definition http://en.wikipedia.org/wiki/Uuid#Definition states a simple 8-4-4-4-12 pattern and 2^128 possibilities. The RFC represents only a subset of that. So what do you want to match? The subset, or all of them? – Bruno Bronosky Feb 25 '13 at 22:57
1

@RichardBronosky - A fair point. I guess it's not really clear from the OP's question whether or not RFC-compliance is an important distinction. (although his concern is more with false negatives so perhaps it's not.) Pick your poison, I suppose. :/ – broofa Feb 26 '13 at 02:54
I posted a shorter version of this below. – iGEL Jun 24 '14 at 13:21
3

You can compress this regex quite a bit: `[0-9a-f]{8}-(?:[0-9a-f]{4}-){3}[0-9a-f]{12}`. – Alix Axel Oct 14 '16 at 16:18
1

You can compress it further (using the `\h` hexadecimal character class) if you're using Ruby: `\h{8}-(?:\h{4}-){3}\h{12}` – aidan Nov 22 '18 at 00:15
1

That regex will match if the source string has greater than 8 characters in the first group and/or greater than 12 characters in the last group. You need to put word boundaries in there to prevent that - e.g. (also including capitals): \b[0-9a-fA-F]{8}\b-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-[0-9a-fA-F]{4}-\b[0-9a-fA-F]{12}\b Ideally, also put word boundaries around the internal groups as well: \b[0-9a-fA-F]{8}\b-\b[0-9a-fA-F]{4}\b-\b[0-9a-fA-F]{4}\b-\b[0-9a-fA-F]{4}\b-\b[0-9a-fA-F]{12}\b – Andrew Coad Nov 28 '18 at 09:07
1

@AndrewCoad The internal `\b`'s are unnecessary, and if you care about boundaries at the ends of the UUID then the outer `\b`'s should probably be replaced with `^..$` (or `\A..\z` if you're in Ruby). Depending on language, the `/i` switch removes the need for specifying both `a-z` and `A-F`. In summary: `/^[0-9a-f]{8}-(?:[0-9a-f]{4}-){3}[0-9a-f]{12}$/i`. Even this is incorrect though, because it allows invalid UUIDs through. See answer from @Gajus below. – aidan Dec 10 '18 at 06:05
It's important to check `\h` before using it. It seems non-standard, in some engines like Ruby it matches hex characters while in other languages it matches horizontal whitespace characters. If POSIX bracket expressions are available you could use `[[:xdigit:]]` as a more widely standardized alternative. – 3limin4t0r Nov 16 '20 at 11:47

score 143 · Answer 2 · answered Oct 11 '12 at 15:32

143

@ivelin: UUID can have capitals. So you'll either need to toLowerCase() the string or use:

[a-fA-F0-9]{8}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{4}-[a-fA-F0-9]{12}

Would have just commented this but not enough rep :)

answered Oct 11 '12 at 15:32

Matthew F. Robben

1,620
1
11
6

25

Usually you can handle this by defining the pattern as case insensitive with an i after the pattern, this makes a cleaner pattern: /[0-9a-f]{8}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{12}/i – Thomas Bindzus Feb 27 '16 at 09:07
@ThomasBindzus That option isn't available in all languages. The original pattern in this answer worked for me in Go. The `/.../i` version didn't. – Chris Redford May 01 '20 at 23:03

score 126 · Answer 3 · edited Nov 27 '13 at 10:38

126

Version 4 UUIDs have the form xxxxxxxx-xxxx-4xxx-yxxx-xxxxxxxxxxxx where x is any hexadecimal digit and y is one of 8, 9, A, or B. e.g. f47ac10b-58cc-4372-a567-0e02b2c3d479.

source: http://en.wikipedia.org/wiki/Uuid#Definition

Therefore, this is technically more correct:

/[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}/

edited Nov 27 '13 at 10:38

Matt Keeble

214
3
8

answered Jan 04 '13 at 22:42

Gajus

55,791
58
236
384

1

I don't think you mean a-z. – Bruno Bronosky Feb 05 '13 at 16:06
9

Need to accept [A-F], too. Per section 3 of RFC4122: 'The hexadecimal values "a" through "f" are output as lower case characters **and are case insensitive on input**'. Also `(:?8|9|A|B)` is probably slightly more readable as `[89aAbB]` – broofa Feb 06 '13 at 18:26
1

Need to copy @broofa's modification; as yours excludes lower-case A or B. – ELLIOTTCABLE May 18 '13 at 22:26
7

@elliottcable Depending on your environment, just use `i` (case-insensitive) flag. – Gajus Jan 14 '14 at 23:11
20

You're rejecting Version 1 to 3 and 5. Why? – iGEL Jun 24 '14 at 13:20
this regex fails for - 123e4567-e89b-12d3-a456-426655440001 since it's valid. – prostý člověk Jun 03 '19 at 10:05
1

@ThangavelLoganathan right this is only for version 4 which iGEL mentioned, but you've got a v1 UUID. I think the only difference between UUIDs are the version numbers in the third group (i.e. `4[a-f0-9]{3}`). I got that from Ivan's answer. – acw Sep 15 '20 at 15:58

score 109 · Answer 4 · answered Jul 04 '16 at 19:20

109

If you want to check or validate a specific UUID version, here are the corresponding regexes.

Note that the only difference is the version number, which is explained in 4.1.3. Version chapter of UUID 4122 RFC.

The version number is the first character of the third group : [VERSION_NUMBER][0-9A-F]{3} :

UUID v1 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[1][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v2 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[2][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v3 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[3][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v4 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[4][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

UUID v5 :

/^[0-9A-F]{8}-[0-9A-F]{4}-[5][0-9A-F]{3}-[89AB][0-9A-F]{3}-[0-9A-F]{12}$/i

answered Jul 04 '16 at 19:20

Ivan Gabriele

5,018
3
31
52

The patterns do not include lower case letters. It should also contain `a-f` next to each `A-F` scope. – Pawel Psztyc Jun 26 '17 at 22:21
31

The `i` at the end of the regex marks it as case insensitive. – johnhaley81 Jun 30 '17 at 03:00
A pattern modifier cannot always be used. For example, in a openapi definition, the pattern is case sensitive – Stephane Janicaud Mar 25 '20 at 13:15
1

@StephaneJanicaud In OpenAPI, you should rather use the `format` modifier by setting it to "uuid" instead of using a regex to test UUIDs: https://swagger.io/docs/specification/data-models/data-types/#format – Ivan Gabriele Mar 27 '20 at 12:03
Thank you @IvanGabriele for the tip, it was just an example,it's the same problem when you wan't to check any case insensitive pattern. – Stephane Janicaud Mar 27 '20 at 12:46

score 42 · Accepted Answer · answered Sep 25 '08 at 22:27

42

I agree that by definition your regex does not miss any UUID. However it may be useful to note that if you are searching especially for Microsoft's Globally Unique Identifiers (GUIDs), there are five equivalent string representations for a GUID:

"ca761232ed4211cebacd00aa0057b223" 

"CA761232-ED42-11CE-BACD-00AA0057B223" 

"{CA761232-ED42-11CE-BACD-00AA0057B223}" 

"(CA761232-ED42-11CE-BACD-00AA0057B223)" 

"{0xCA761232, 0xED42, 0x11CE, {0xBA, 0xCD, 0x00, 0xAA, 0x00, 0x57, 0xB2, 0x23}}"

answered Sep 25 '08 at 22:27

Panos

18,335
6
43
54

3

Under what situations would the first pattern be found? i.e. Is there a .Net function that would strip the hyphens or return the GUID without hyphens? – Guy Sep 25 '08 at 22:32
1

You can get it with myGuid.ToString("N"). – Panos Sep 25 '08 at 22:38

iGEL · Answer 6 · 2018-02-28T11:13:45.620

40

/^[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89AB][0-9a-f]{3}-[0-9a-f]{12}$/i

Gajus' regexp rejects UUID V1-3 and 5, even though they are valid.

edited Feb 28 '18 at 11:13

answered Jun 24 '14 at 13:19

iGEL

13,729
9
53
65

1

But it allows invalid versions (like 8 or A) and invalid variants. – Brice Feb 13 '18 at 10:33
1

Note that AB in [89AB][0-9a-f] is upper case and the rest of allowed characters are lower case. It has caught me out in Python – Tony Sepia Jul 19 '18 at 13:21

score 18 · Answer 7 · edited Nov 25 '14 at 01:50

18

[\w]{8}(-[\w]{4}){3}-[\w]{12} has worked for me in most cases.

Or if you want to be really specific [\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}.

edited Nov 25 '14 at 01:50

Whymarrh

11,635
13
55
96

answered Oct 22 '10 at 16:45

JimP

956
13
26

3

It it worth noting that \w, in Java at least, matches _ as well as hexadecimal digits. Replacing the \w with \p{XDigit} may be more appropriate as that is the POSIX class defined for matching hexadecimal digits. This may break when using other Unicode charsets tho. – oconnor0 Mar 07 '11 at 21:41
1

@oconnor `\w` usually means "word characters" It will match much more than hex-digits. Your solution is much better. Or, for compatibility/readability you could use `[a-f0-9]` – exhuma Sep 25 '11 at 09:23
1

Here is a string that looks like a regex and match those patterns, but is an invalid regex: 2wtu37k5-q174-4418-2cu2-276e4j82sv19 – Travis Stevens Dec 01 '16 at 19:37
@OleTraveler not true, works like a charm. `import re def valid_uuid(uuid): regex = re.compile('[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}', re.I) match = regex.match(uuid) return bool(match) valid_uuid('2wtu37k5-q174-4418-2cu2-276e4j82sv19')` – Tom Wojcik Dec 01 '17 at 09:25
3

@tom That string (2wt...) is an invalid UUID, but the pattern given in this answer matches that string indicating falsely that it is a valid UUID. It's too bad I don't remember why that UUID is invalid. – Travis Stevens Dec 02 '17 at 15:01
@OleTraveler That's interesting. I don't know much about UUIDs in general but my UUIDs were generated by the [UUID 4 generator](https://www.uuidgenerator.net/) and it matches what [wikipedia says](https://en.wikipedia.org/wiki/Universally_unique_identifier#Format). EDIT: I read again what you wrote. I may understand what you mean, this code just counts the number of characters but UUID also consists of version and variant within itself. For me this code is sufficient, but indeed there are cases where invalid UUID will match this pattern. Thanks for contributing to the discussion. – Tom Wojcik Dec 04 '17 at 13:01

Bruno Bronosky · Answer 8 · 2017-10-30T14:04:46.820

In python re, you can span from numberic to upper case alpha. So..

import re
test = "01234ABCDEFGHIJKabcdefghijk01234abcdefghijkABCDEFGHIJK"
re.compile(r'[0-f]+').findall(test) # Bad: matches all uppercase alpha chars
## ['01234ABCDEFGHIJKabcdef', '01234abcdef', 'ABCDEFGHIJK']
re.compile(r'[0-F]+').findall(test) # Partial: does not match lowercase hex chars
## ['01234ABCDEF', '01234', 'ABCDEF']
re.compile(r'[0-F]+', re.I).findall(test) # Good
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-f]+', re.I).findall(test) # Good
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-Fa-f]+').findall(test) # Good (with uppercase-only magic)
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']
re.compile(r'[0-9a-fA-F]+').findall(test) # Good (with no magic)
## ['01234ABCDEF', 'abcdef', '01234abcdef', 'ABCDEF']

That makes the simplest Python UUID regex:

re_uuid = re.compile("[0-F]{8}-([0-F]{4}-){3}[0-F]{12}", re.I)

I'll leave it as an exercise to the reader to use timeit to compare the performance of these.

Enjoy. Keep it Pythonic™!

NOTE: Those spans will also match :;<=>?@' so, if you suspect that could give you false positives, don't take the shortcut. (Thank you Oliver Aubert for pointing that out in the comments.)

[0-F] will indeed match 0-9 and A-F, but also any character whose ASCII code is between 57 (for 9) and 65 (for A), that is to say any of :;<=>?@'. — Olivier Aubert, Oct 19 '15 at 08:40
So do no use the abovementionned code except if you want to consider :=>;?==@?>=:?=@; as a valid UUID :-) — Olivier Aubert, Oct 19 '15 at 08:48

score 9 · Answer 9 · answered Sep 25 '08 at 22:14

9

By definition, a UUID is 32 hexadecimal digits, separated in 5 groups by hyphens, just as you have described. You shouldn't miss any with your regular expression.

http://en.wikipedia.org/wiki/Uuid#Definition

answered Sep 25 '08 at 22:14

pix0r

30,601
18
82
102

2

Not correct. RFC4122 only allows [1-5] for the version digit, and [89aAbB] for the variant digit. – broofa Feb 06 '13 at 18:36

score 7 · Answer 10 · edited Sep 18 '13 at 16:06

7

So, I think Richard Bronosky actually has the best answer to date, but I think you can do a bit to make it somewhat simpler (or at least terser):

re_uuid = re.compile(r'[0-9a-f]{8}(?:-[0-9a-f]{4}){3}-[0-9a-f]{12}', re.I)

edited Sep 18 '13 at 16:06

Wolkenarchitekt

17,351
28
102
166

answered Apr 15 '13 at 23:09

Christopher Smith

4,947
1
30
18

1

Even terser: `re_uuid = re.compile(r'[0-9a-f]{8}(?:-[0-9a-f]{4}){4}[0-9a-f]{8}', re.I)` – Pedro Gimeno May 12 '14 at 11:01
If you're looking to use capture groups to actually capture data out of a string, using this is NOT a great idea. It looks a little simpler, but complicates some usages. – std''OrgnlDave Dec 04 '20 at 16:07

score 6 · Answer 11 · answered Apr 16 '14 at 18:23

Variant for C++:

#include <regex>  // Required include

...

// Source string    
std::wstring srcStr = L"String with GIUD: {4d36e96e-e325-11ce-bfc1-08002be10318} any text";

// Regex and match
std::wsmatch match;
std::wregex rx(L"(\\{[A-F0-9]{8}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{4}-[A-F0-9]{12}\\})", std::regex_constants::icase);

// Search
std::regex_search(srcStr, match, rx);

// Result
std::wstring strGUID       = match[1];

score 5 · Answer 12 · answered Jul 02 '16 at 17:23

5

For UUID generated on OS X with uuidgen, the regex pattern is

[A-F0-9]{8}-[A-F0-9]{4}-4[A-F0-9]{3}-[89AB][A-F0-9]{3}-[A-F0-9]{12}

Verify with

uuidgen | grep -E "[A-F0-9]{8}-[A-F0-9]{4}-4[A-F0-9]{3}-[89AB][A-F0-9]{3}-[A-F0-9]{12}"

answered Jul 02 '16 at 17:23

Quanlong

19,782
10
62
75

score 3 · Answer 13 · edited Nov 09 '18 at 11:20

3

$UUID_RE = join '-', map { "[0-9a-f]{$_}" } 8, 4, 4, 4, 12;

BTW, allowing only 4 on one of the positions is only valid for UUIDv4. But v4 is not the only UUID version that exists. I have met v1 in my practice as well.

edited Nov 09 '18 at 11:20

rjh

46,345
3
47
60

answered Jan 17 '16 at 17:04

abufct

71
4

score 3 · Answer 14 · answered Jul 13 '20 at 08:34

3

Here is the working REGEX: https://www.regextester.com/99148

const regex = [0-9a-fA-F]{8}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{4}\-[0-9a-fA-F]{12}

answered Jul 13 '20 at 08:34

gildniy

1,604
17
12

Walf · Answer 15 · 2020-07-02T00:07:34.440

2

If using Posix regex (grep -E, MySQL, etc.), this may be easier to read & remember:

[[:xdigit:]]{8}(-[[:xdigit:]]{4}){3}-[[:xdigit:]]{12}

Edit: Perl & PCRE flavours also support Posix character classes so this'll work with them. For those, change the (…) to a non-capturing subgroup (?:…).

edited Jul 02 '20 at 00:07

answered Apr 03 '20 at 23:57

Walf

6,713
2
36
52

asherbar · Answer 16 · 2019-11-13T14:07:13.780

1

For bash:

grep -E "[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}"

For example:

$> echo "f2575e6a-9bce-49e7-ae7c-bff6b555bda4" | grep -E "[a-f0-9]{8}-[a-f0-9]{4}-4[a-f0-9]{3}-[89aAbB][a-f0-9]{3}-[a-f0-9]{12}"
f2575e6a-9bce-49e7-ae7c-bff6b555bda4

edited Nov 13 '19 at 14:07

answered Nov 13 '19 at 08:57

asherbar

4,024
4
27
47

You need to include grep's `-i` option for case-insensitive matching. – Alastair Irvine Jun 30 '20 at 11:37

score 0 · Answer 17 · answered Dec 15 '20 at 18:55

Wanted to give my contribution, as my regex cover all cases from OP and correctly group all relevant data on the group method (you don't need to post process the string to get each part of the uuid, this regex already get it for you)

([\d\w]{8})-?([\d\w]{4})-?([\d\w]{4})-?([\d\w]{4})-?([\d\w]{12})|[{0x]*([\d\w]{8})[0x, ]{4}([\d\w]{4})[0x, ]{4}([\d\w]{4})[0x, {]{5}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})[0x, ]{4}([\d\w]{2})

Searching for UUIDs in text with regex

17 Answers17

Linked

Related