109

Is it possible to use a RegEx to validate, or sanitize Base64 data? That's the simple question, but the factors that drive this question are what make it difficult.

I have a Base64 decoder that can not fully rely on the input data to follow the RFC specs. So, the issues I face are issues like perhaps Base64 data that may not be broken up into 78 (I think it's 78, I'd have to double check the RFC, so don't ding me if the exact number is wrong) character lines, or that the lines may not end in CRLF; in that it may have only a CR, or LF, or maybe neither.

So, I've had a hell of a time parsing Base64 data formatted as such. Due to this, examples like the following become impossible to decode reliably. I will only display partial MIME headers for brevity.

Content-Transfer-Encoding: base64

VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

Ok, so parsing that is no problem, and is exactly the result we would expect. And in 99% of the cases, using any code to at least verify that each char in the buffer is a valid base64 char, works perfectly. But, the next example throws a wrench into the mix.

Content-Transfer-Encoding: base64

http://www.stackoverflow.com
VGhpcyBpcyBzaW1wbGUgQVNDSUkgQmFzZTY0IGZvciBTdGFja092ZXJmbG93IGV4YW1wbGUu

This a version of Base64 encoding that I have seen in some viruses and other things that attempt to take advantage of some mail readers desire to parse mime at all costs, versus ones that go strictly by the book, or rather RFC; if you will.

My Base64 decoder decodes the second example to the following data stream. And keep in mind here, the original stream is all ASCII data!

[0x]86DB69FFFC30C2CB5A724A2F7AB7E5A307289951A1A5CC81A5CC81CDA5B5C1B19481054D0D
2524810985CD94D8D08199BDC8814DD1858DAD3DD995C999B1BDDC8195E1B585C1B194B8

Anyone have a good way to solve both problems at once? I'm not sure it's even possible, outside of doing two transforms on the data with different rules applied, and comparing the results. However if you took that approach, which output do you trust? It seems that ASCII heuristics is about the best solution, but how much more code, execution time, and complexity would that add to something as complicated as a virus scanner, which this code is actually involved in? How would you train the heuristics engine to learn what is acceptable Base64, and what isn't?


UPDATE:

Do to the number of views this question continues to get, I've decided to post the simple RegEx that I've been using in a C# application for 3 years now, with hundreds of thousands of transactions. Honestly, I like the answer given by Gumbo the best, which is why I picked it as the selected answer. But to anyone using C#, and looking for a very quick way to at least detect whether a string, or byte[] contains valid Base64 data or not, I've found the following to work very well for me.

[^-A-Za-z0-9+/=]|=[^=]|={3,}$

And yes, this is just for a STRING of Base64 data, NOT a properly formatted RFC1341 message. So, if you are dealing with data of this type, please take that into account before attempting to use the above RegEx. If you are dealing with Base16, Base32, Radix or even Base64 for other purposes (URLs, file names, XML Encoding, etc.), then it is highly recommend that you read RFC4648 that Gumbo mentioned in his answer as you need to be well aware of the charset and terminators used by the implementation before attempting to use the suggestions in this question/answer set.

LarryF
  • 4,617
  • 4
  • 27
  • 40
  • I guess that you have to define the task better. It is completely unclear what is your aim: be strict? parse 100% of the samples? ... – ADEpt Jan 23 '09 at 23:55
  • You first example should be 'VGhpcyBpcyBhIHNpbXBsZSBBU0NJSSBCYXNlNjQgZXhhbXBsZSBmb3IgU3RhY2tPdmVyZmxvdy4=' – jfs Jan 24 '09 at 01:01
  • Why don't use a standard solution in your language? Why do you need hand-written parser based on regexs? – jfs Jan 24 '09 at 01:05
  • @JF - Well, I don't. I have looked at other methods, and didn't have a lot o luck, so I thought I'd give RegEx a try. This is all C/C++, if it matters. And I already do the Pre-parsing of ANYTHING non-b64, toss it, and decode the rest. – LarryF Jan 24 '09 at 01:59
  • @ADEpt - The aim is to be able to parse 100% of the time regardless of how badly formatted, or damaged the source is. (I've event delt with viruses that put random BINARY data inside the b64 data)... – LarryF Jan 24 '09 at 02:01
  • How can I replace non Base64 chars with empty strings? – Sapphire Feb 07 '13 at 21:19
  • @Sapphire - That depends. What you are asking is worthy of a whole new question. There are three ways you can do it, that I see. 1) Eat the bad chars as you are decoding. 2) Use a RegEx replace, to replace any non Base64 char with "", or 3) Use a function in code to walk your buffer, and test each char against a Base64 table, and if the char isn't there, simply replace the instance with char(32), or " "... Contact me off SO, and I'd be happy to share some C code to do what you are trying to do. – LarryF Jul 16 '14 at 20:22
  • Note: According to RFC 2045 the new line is added after 76 characters: "The encoded output stream must be represented in lines of no more than 76 characters each. All line breaks or other characters not found in Base64 alphabet must be ignored by decoding software". – Benny Neugebauer Jan 13 '17 at 22:32
  • 1
    Great question. Though I tried the **UPDATE** regex by running it against a base64-encoded SHA returned by NPM and [it failed](https://regexr.com/3uc7a) whereas the regex in selected answer [works just fine](https://regexr.com/3uc7a). – Josh Habdas Aug 23 '18 at 13:18
  • 1
    Not sure how the **UPDATE** regex is still posted without correction, but it looks like the author _meant_ to put the `^` outside the brackets, as a start-anchor. However, a much better regex, without getting as complicated as the accepted answer, would be `^[-A-Za-z0-9+/]*={0,3}$` – kael Nov 01 '19 at 22:23

7 Answers7

158

From the RFC 4648:

Base encoding of data is used in many situations to store or transfer data in environments that, perhaps for legacy reasons, are restricted to US-ASCII data.

So it depends on the purpose of usage of the encoded data if the data should be considered as dangerous.

But if you’re just looking for a regular expression to match Base64 encoded words, you can use the following:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$
Gumbo
  • 594,236
  • 102
  • 740
  • 814
  • 1
    That doesn't deal with the 'white space' at line ends. I'm not sure whether program always put the newline at a 4-byte boundary - it would be reasonable, but then following the standard would be reasonable too, but people don't do it. – Jonathan Leffler Jan 24 '09 at 01:09
  • 11
    The simplest solution would be to strip out all whitespace (which is ignored as per the RFC) before validation. – Ben Blank Jan 24 '09 at 01:35
  • What about words that do *not* end with paddings, eg encoded data where the length is a multiple of 4? – Nico Haase Apr 17 '13 at 09:18
  • 2
    The last non-capturing group for the padding is optional. – Gumbo Apr 17 '13 at 10:08
  • 6
    At first I was skeptical of the complexity, but it validates quite well. If you'd just like to match base64-ish I'd come up with doing ^[a-zA-Z0-9+/]={0,3}$, this is better! – Lodewijk Sep 10 '14 at 00:59
  • This regexp doesn't work on word *name*, it thinks that it is Base64 encoded – Bogdan Nechyporenko Mar 26 '15 at 09:32
  • 3
    @BogdanNechyporenko That's because `name` is a valid Base64 encoding of the (hex) byte sequence `9d a9 9e`. – Marten Jun 09 '16 at 11:50
  • @Lodewijk's comment sparked an idea of sanity. If we fix his broken regex (incorrect padding-length and missing quantifier `+` after first character-class) to `^[+/0-9A-Za-z]+={0,2}$` and perform a simple multiple-of-4 check (`strBase64.length%4===0` or bitwise mask: `strBase64.length&3===0`) *before* this regex, we discard 75% of invalid string-lengths (not a multiple of 4) by a simple computation instead of 'abusing' regex to also perform a simple calculation. The regex then catches the edge-case of the empty string (and UTF-16 surrogate pairs) that slipped through the multiple-of-4 check. – GitaarLAB Dec 20 '17 at 05:21
  • Don't remember about '-' and '_' characters – Игорь Демянюк Jun 08 '18 at 06:38
  • 2
    Can I ask a question that's driving me nuts? How is "Paul" valid base64? – The Bearded Llama Jun 27 '18 at 17:23
  • 1
    This expression matches any string which has length multiples of 4 – Nithin Feb 18 '19 at 10:35
  • ^([A-Za-z0-9+\\/]{4})*([A-Za-z0-9+\\/]{3}(=){0,1}|[A-Za-z0-9+\\/]{2}(==){0,1})?$, little bit modification, "=", "==" is optional while providing. – P Satish Patro Nov 25 '19 at 13:14
  • 3
    `^(?:[A-Za-z0-9+\/]{4})*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=|[A-Za-z0-9+\/]{4})$` must escape backlash – Syed Khizaruddin Jan 27 '20 at 07:47
38
^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$

This one is good, but will match an empty String

This one does not match empty string :

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{4})$
njzk2
  • 37,030
  • 6
  • 63
  • 102
  • 2
    Why is an empty string invalid? – Josh Lee Aug 19 '11 at 20:21
  • 8
    it is not. but if you are using a regex to find out if a given string is or is not base64, chances are you are not interested in empty strings. At least i know i am not. – njzk2 Aug 22 '11 at 13:19
  • Just replace the star/asterisk operator `*` with `+` like so: `^(?:[A-Za-z0-9+/]{4})+(?:[A-Za-z0-9+/]{2}==|[A-Za-z0-9+/]{3}=)?$` to prevent empty strings from matching the pattern. – Lars Gyrup Brink Nielsen Oct 22 '13 at 07:55
  • 4
    @LayZee : if you do so, you force the base64 string to contain at least a 4-size block, rendering valid values such as `MQ==` not a match to your expression – njzk2 Oct 22 '13 at 13:47
  • It doesn't match `AQENVg688MSGlEgdOJpjIUC` – expert Oct 02 '15 at 17:32
  • 5
    @ruslan nor should it. this is not a valid base 64 string. (size is 23, which is not // 4). `AQENVg688MSGlEgdOJpjIUC=` is the valid form. – njzk2 Oct 02 '15 at 17:43
  • @njzk2 How about replacing the last `?` with `{1}`? – Jin Kwon Feb 12 '19 at 04:56
  • 1
    @JinKwon base64 ends with 0, 1 or 2 `=`. The last `?` allows for 0 `=`. Replacing it with `{1}` requires 1 or 2 ending `=` – njzk2 Feb 16 '19 at 01:21
4

The best regexp which I could find up till now is in here https://www.npmjs.com/package/base64-regex

which is in the current version looks like:

module.exports = function (opts) {
  opts = opts || {};
  var regex = '(?:[A-Za-z0-9+\/]{4}\\n?)*(?:[A-Za-z0-9+\/]{2}==|[A-Za-z0-9+\/]{3}=)';

  return opts.exact ? new RegExp('(?:^' + regex + '$)') :
                    new RegExp('(?:^|\\s)' + regex, 'g');
};
Bogdan Nechyporenko
  • 1,034
  • 10
  • 20
4

Neither a ":" nor a "." will show up in valid Base64, so I think you can unambiguously throw away the http://www.stackoverflow.com line. In Perl, say, something like

my $sanitized_str = join q{}, grep {!/[^A-Za-z0-9+\/=]/} split /\n/, $str;

say decode_base64($sanitized_str);

might be what you want. It produces

This is simple ASCII Base64 for StackOverflow exmaple.

oylenshpeegul
  • 3,366
  • 1
  • 16
  • 18
  • I can agree there, but all the OTHER letters in the URL do happen to be valid base64... So, where do you draw the line? Just at line breaks? (I have seen ones where there is just a couple random chars in the middle of the line. Can't toss the rest of the line just because of that, IMHO)... – LarryF Jan 24 '09 at 02:05
  • @LarryF: unless there's integrity checking on the base-64 encoded data, you can't tell what to do with any base-64 block of data containing incorrect characters. Which is the best heuristic: ignore the incorrect characters (allowing any and all correct ones) or reject the lines, or reject the lot? – Jonathan Leffler Jan 24 '09 at 04:08
  • (continued): the short answer is "it depends" - on where the data comes from and the sorts of mess you find in it. – Jonathan Leffler Jan 24 '09 at 04:09
  • (resumed): I see from comments to the question that you want to accept anything that might be base-64. So simply map each and every character that's not in your base-64 alphabet (note that there are URL-safe and other such variant encodings) including the newlines and colons, and take what's left. – Jonathan Leffler Jan 24 '09 at 04:11
4

To validate base64 image we can use this regex

/^data:image/(?:gif|png|jpeg|bmp|webp)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}

  private validBase64Image(base64Image: string): boolean {
    const regex = /^data:image\/(?:gif|png|jpeg|bmp|webp|svg\+xml)(?:;charset=utf-8)?;base64,(?:[A-Za-z0-9]|[+/])+={0,2}/;
    return base64Image && regex.test(base64Image);
  }
Jayani Sumudini
  • 1,164
  • 1
  • 16
  • 26
  • 1
    Thank you! Very helpful regarding the meta properties at the beginning of a base64 image string. One suggestion: There is (at least) one mime type missing, `svg+xml`, so the first capturing group should probably be extended to `(?:gif|png|jpeg|bmp|webp|svg\+xml)`. – HynekS Apr 03 '21 at 12:58
  • @HynekS. Yes. I updated my answer. Thank you :-) – Jayani Sumudini Apr 05 '21 at 04:43
3

The answers presented so far fail to check that the Base64 string has all pad bits set to 0, as required for it to be the canonical representation of Base64 (which is important in some environments, see https://tools.ietf.org/html/rfc4648#section-3.5) and therefore, they allow aliases that are different encodings for the same binary string. This could be a security problem in some applications.

Here is the regexp that verifies that the given string is not just valid base64, but also the canonical base64 string for the binary data:

^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$

The cited RFC considers the empty string as valid (see https://tools.ietf.org/html/rfc4648#section-10) therefore the above regex also does.

The equivalent regular expression for base64url (again, refer to the above RFC) is:

^(?:[A-Za-z0-9_-]{4})*(?:[A-Za-z0-9_-][AQgw]==|[A-Za-z0-9_-]{2}[AEIMQUYcgkosw048]=)?$
Pedro Gimeno
  • 2,025
  • 1
  • 18
  • 26
2

Here's an alternative regular expression:

^(?=(.{4})*$)[A-Za-z0-9+/]*={0,2}$

It satisfies the following conditions:

  • The string length must be a multiple of four - (?=^(.{4})*$)
  • The content must be alphanumeric characters or + or / - [A-Za-z0-9+/]*
  • It can have up to two padding (=) characters on the end - ={0,2}
  • It accepts empty strings
Paul
  • 41
  • 2