0

I went through the several discussions to find out how to do this. But not found any exact solution for doing this. I have used the following regular expression to check whether the string is Base64 encoded or not

^([A-Za-z0-9+/]{4})*([A-Za-z0-9+/]{4}|[A-Za-z0-9+/]{3}=|[A-Za-z0-9+/]{2}==)$  

But this is not accurate every time. I know i can use some try catch method. But that is expensive operation for java. Is there any exact way for doing this. I am using java 7.

Mohammed Javad
  • 439
  • 2
  • 13
  • 2
    Checking if the regular expression is matched is probably as expensive, if not more, than decoding the base64 text in the first place. – JB Nizet Oct 22 '18 at 06:04
  • Please take a look at this articles https://stackoverflow.com/questions/475074/regex-to-parse-or-validate-base64-data/475217 – Vikrant Kashyap Oct 22 '18 at 06:05
  • Thanks Vikrant Kashyap. Its working better. But if there is no = sign at the end of an encrypted string would be a problem. – Mohammed Javad Oct 23 '18 at 06:12

2 Answers2

4

I would advise caution on this. There are two problems:

The first problem is that regexes like the one you have shown us can suffer from performance problems when the string is not a match. In particular, you get a lot of unnecessary backtracking before match failure.

(It is possible to avoid the backtracking by using "reluctant" or "possessive" quantifiers rather than "greedy" quantifiers, but you need to understand what you are doing.)

Even so, unless the string is short, it is likely to be more efficient to attempt a base64 decode using a Base64.Decoder::decode method and catch a possible exception, than to use a regex to validate. And you have the potential bonus that you have the decoded data.

(Maybe as a speedup you could check the first 4 and last 4 characters before attempting a full base64 decode.)


The second problem is that (in theory) a string may be syntactically valid as Base64, but it have been produced by another "process". Thus, when you decode the string, you may get garbage. Therefore, it may worth decoding the string and checking what is inside ... as part of your validation.


I know i can use some try catch method. But that is expensive operation for java.

It is all relative. Furthermore, newer JVMs can throw and handle exceptions more efficiently due to some optimizations introduced in (I think) Java 8.

Stephen C
  • 632,615
  • 86
  • 730
  • 1,096
  • 1
    +1 for the second point more than the first. As Base64 is just upper and lower alphanums plus two symbols (which ones depend on type) spurious decodings are likely. – Boris the Spider Oct 22 '18 at 06:34
0

A base64 rendering of any given string is just another string consisting of an alphabet of 64 tokens. Can a string be regex-checked for consisting of only tokens of that given alphabet ? Yes. Does that imply that such a string is indeed the result of an intentional base64 encoding ? No. Also note that the very fact of consisting only of an alphabet of 64 tokens does not imply being a legitimate base64 encoding of some other string. Due to issues of string length and possible padding and the way it is dealt with, it might or might not be the case that the string "a" is itself not a valid base64 encoding for anything, even if the alphabet it consists of might suggest otherwise.

"Try to detect from actual content" is in general a very poor (because utterly error prone) strategy. Avoid whenever possible.

Erwin Smout
  • 17,245
  • 4
  • 28
  • 49