862

I've heard people talking about "base 64 encoding" here and there. What is it used for?

Slothworks
  • 1,003
  • 13
  • 17
MrDatabase
  • 39,447
  • 40
  • 105
  • 151
  • 4
    From the manual for [base64_encode()](http://php.net/manual/en/function.base64-encode.php): "This encoding is designed to make binary data survive transport through transport layers that are not 8-bit clean, such as mail bodies." – still_dreaming_1 Feb 28 '19 at 16:36

18 Answers18

1038

When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).

So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings.

Why 64?
Because you can generally rely on the same 64 characters being present in many character sets, and you can be reasonably confident that your data's going to end up on the other side of the wire uncorrupted.

illusionist
  • 7,996
  • 1
  • 46
  • 63
Dave Markle
  • 88,065
  • 20
  • 140
  • 165
  • 112
    (In theory you could do base-80 encoding or something similar, but it would be significantly harder. Powers of two are natural bases for binary.) – Jon Skeet Oct 14 '08 at 15:08
  • 13
    @yokees: There is no guarantee, they're just characters that are *almost always* safe. This is why there are multiple forms of Base-64 (http://en.wikipedia.org/wiki/Base-64). –  Jan 11 '13 at 21:28
  • 2
    @Jon - which variant does the browser use when I put Base64 in an image tag? – employee-0 Sep 19 '13 at 12:22
  • That's the best discussion of the topic I see; there's no absolute answer, but that's where I'd go to ask it. –  Sep 19 '13 at 15:02
  • 9
    Does that mean that all network type data passing should use some kind of encoding? – Tanner Summers Aug 09 '16 at 03:27
  • 6
    But why is base64 method used to encode string data? eg in javascript atob function Is there meaning the server to encode a json file to base64 format? Special characters could be a use case but why not utf8 in that case, are they equibalent? Any further resource regarding that would be greatly appreciated thank you. – partizanos Sep 30 '16 at 15:33
  • 1
    I assumed `base 64` refers to the number system rather than the count of characters in the set. No? – Tom Russell May 04 '17 at 04:11
  • 4
    @TomRussell: Base64 refers to the number of characters used for encoding. In theory, you could represent a single number this way using these 64 characters as a base-64 number instead of the 10 characters we normally use to represent base-10 number. – Dave Markle May 04 '17 at 12:14
  • @DaveMarkle Nice. Thanks for clearing that up for me. I would rephrase it as "using a *subset* of these 64 characters..." – Tom Russell May 04 '17 at 17:05
  • 4
    @TomRussell - I'm not sure where you are getting the "subset" idea from. The term "base-64" **does** refer to the number system. In ordinary decimal (base-10), we have 10 distinct symbols. In hexadecimal (base-16), we have 16 distinct symbols. In binary, we have 2 distinct symbols. Well, in base-64, we have 64 distinct symbols. So, it really is just the number system, in exactly the same way as those other number systems. Now, WHICH 64 symbols to use is a completely different matter, and in some cases, you need to use a different set of 64 symbols than in other cases. – John Y Aug 03 '17 at 20:57
  • @John Y: Yes, I'm kind of fuzzy on number systems, so my comment possibly just didn't make sense. I think I was picking a nit about a number nearly always being represented by something less than the full set of digits in the number system. E.g., 2334 consisting of the subset (2, 3, 4). LOL. – Tom Russell Aug 04 '17 at 09:20
  • 1
    Base-85 is used on some systems since it can encode 32-bit chunks directly as five characters. Not only is this more compact than base-64, but it keeps 32-bit chunks together. By contrast, base-64 requires that data which originates as 32-bit chunks be split into groups of four bytes. and then grabbed in groups of 3 bytes. – supercat Sep 13 '17 at 21:25
  • 5
    A list of at least some protocols which would fail would be nice to have if someone knows. – Tadej Jan 26 '18 at 12:41
  • Does it make sense to base-64 encode a plain ASCII 7-bit text file? – stephanmg Apr 18 '19 at 05:39
  • 1
    @stephanmg. It might in some circumstances, yes. ASCII contains a bunch of characters which some protocols might interpret as control codes (e.g., NUL, DEL, BEL, LF). Other encodings (such as quoted-printable) might be more efficient here, but Base64 is certainly an option. – TRiG Nov 28 '19 at 11:15
  • FYI this answer quoted in css-tricks article ["Probably don't base64 svg"](https://css-tricks.com/probably-dont-base64-svg/) – ashleedawg Oct 06 '20 at 12:04
225

It's basically a way of encoding arbitrary binary data in ASCII text. It takes 4 characters per 3 bytes of data, plus potentially a bit of padding at the end.

Essentially each 6 bits of the input is encoded in a 64-character alphabet. The "standard" alphabet uses A-Z, a-z, 0-9 and + and /, with = as a padding character. There are URL-safe variants.

Wikipedia is a reasonably good source of more information.

Jon Skeet
  • 1,261,211
  • 792
  • 8,724
  • 8,929
  • In a langange like php, were will binary data come from. We almost always work with string data which is text. – Cholthi Paul Ttiopic Sep 05 '16 at 14:02
  • 5
    @CholthiPaulTtiopic: The results of encryption or compression, or sound/images/video. – Jon Skeet Sep 05 '16 at 14:07
  • what about storage, php doesn't seem to have binary data type – Cholthi Paul Ttiopic Sep 06 '16 at 03:26
  • 1
    @CholthiPaulTtiopic: I'm afraid I have no idea what you mean by "what about storage" but at this point I think we're somewhat off-topic. – Jon Skeet Sep 06 '16 at 05:39
  • Sure we are. I guess what I wanted was "string binary" which effectively was still binary data as far internal representation was concerned. thanks for your time. – Cholthi Paul Ttiopic Sep 06 '16 at 07:28
  • 3
    @CholthiPaulTtiopic: I'd strongly avoid thinking in terms of "string binary". Binary data should be treated as binary data, and *not* treated as text. I've seen literally hundreds - possibly thousands - of questions on SO which basically boil down to people not taking enough care over this distinction. – Jon Skeet Sep 06 '16 at 07:29
  • @CholthiPaulTtiopic you just work with it as a string and it is fine. I don't know if there are gotchas to watch out for as a result of working with it as a string, but I know PHP normally pulls in binary data as a string and passes it around as a string. For example you can read from a binary file and save the data to a different filepath, and it should just work as far as I know. – still_dreaming_1 Feb 28 '19 at 04:44
  • 2
    @still_dreaming_1 PHP calls them `binary strings`. (source)http://php.net/manual/en/function.pack.php – Cholthi Paul Ttiopic Feb 28 '19 at 06:07
  • @CholthiPaulTtiopic Ultimately they are just strings that happen to be known by the programmer to contain binary data. In the method signature information provided for that pack function you linked to, it says it returns a "string", not a "binary string". I believe pack() and unpack() are only needed if you need to analyze or modify the binary data. – still_dreaming_1 Feb 28 '19 at 15:29
137

Base-64 encoding is a way of taking binary data and turning it into text so that it's more easily transmitted in things like e-mail and HTML form data.

http://en.wikipedia.org/wiki/Base64

Brad Wilson
  • 61,606
  • 8
  • 70
  • 82
124

It's a textual encoding of binary data where the resultant text has nothing but letters, numbers and the symbols "+", "/" and "=". It's a convenient way to store/transmit binary data over media that is specifically used for textual data.

But why Base-64? The two alternatives for converting binary data into text that immediately spring to mind are:

  1. Decimal: store the decimal value of each byte as three numbers: 045 112 101 037 etc. where each byte is represented by 3 bytes. The data bloats three-fold.
  2. Hexadecimal: store the bytes as hex pairs: AC 47 0D 1A etc. where each byte is represented by 2 bytes. The data bloats two-fold.

Base-64 maps 3 bytes (8 x 3 = 24 bits) in 4 characters that span 6-bits (6 x 4 = 24 bits). The result looks something like "TWFuIGlzIGRpc3Rpb...". Therefore the bloating is only a mere 4/3 = 1.3333333 times the original.

Ates Goral
  • 126,894
  • 24
  • 129
  • 188
  • 12
    Do I understand correctly, that 64 is the best choice as it is the highest power of two that is convertible to a printable ASCII character (there is 95 of them)? – voho Jan 18 '17 at 09:21
  • If in both cases they're 24 bits, then isn't the bloating 1:1? Or When you say 4 characters that span 6 bits, do you mean that there's actually 8 bits per char but the first two are padded 0s? – David Klempfner Feb 27 '19 at 05:43
  • 1
    @Backwards_Dave Each 6 bits are expressed in 8 bits. So the bloating is 8:6, or 4:3. – Ates Goral Feb 27 '19 at 19:24
  • @AtesGoral am I right in my assumption, that when you use Base256, you could map it 1:1 ? because 1 byte= 8bit = 256 possible characters? – ChillaBee Jan 08 '21 at 11:32
  • @user2774480 As a thought experiment, yes. But there's probably no practicality in using Base256. – Ates Goral Jan 13 '21 at 20:06
100

Years ago, when mailing functionality was introduced, so that was utterly text based, as the time passed, need for attachments like image and media (audio,video etc) came into existence. When these attachments are sent over internet (which is basically in the form of binary data), the probability of binary data getting corrupt is high in its raw form. So, to tackle this problem BASE64 came along.

The problem with binary data is that it contains null characters which in some languages like C,C++ represent end of character string so sending binary data in raw form containing NULL bytes will stop a file from being fully read and lead in a corrupt data.

For Example :

In C and C++, this "null" character shows the end of a string. So "HELLO" is stored like this:

H E L L O

72 69 76 76 79 00

The 00 says "stop here".

Now let’s dive into how BASE64 encoding works.

Point to be noted : Length of the string should be in multiple of 3.

Example 1 :

String to be encoded : “ace”, Length=3

1) Convert each character to decimal.

a= 97, c= 99, e= 101

enter image description here

2) Change each decimal to 8-bit binary representation.

97= 01100001, 99= 01100011, 101= 01100101

Combined : 01100001 01100011 01100101

3) Seperate in a group of 6-bit.

011000 010110 001101 100101

4) Calculate binary to decimal

011000= 24, 010110= 22, 001101= 13, 100101= 37

5) Covert decimal characters to base64 using base64 chart.

24= Y, 22= W, 13= N, 37= l

“ace” => “YWNl”

enter image description here

Example 2 :

String to be encoded : “abcd” Length=4, it's not multiple of 3. So to make string length multiple of 3 , we must add 2 bit padding to make length= 6. Padding bit is represented by “=” sign.

Point to be noted : One padding bit equals two zeroes 00 so two padding bit equals four zeroes 0000.

So lets start the process :–

1) Convert each character to decimal.

a= 97, b= 98, c= 99, d= 100

2) Change each decimal to 8-bit binary representation.

97= 01100001, 98= 01100010, 99= 01100011, 100= 01100100

3) Separate in a group of 6-bit.

011000, 010110, 001001, 100011, 011001, 00

so the last 6-bit is not complete so we insert two padding bit which equals four zeroes “0000”.

011000, 010110, 001001, 100011, 011001, 000000 ==

Now, it is equal. Two equals sign at the end show that 4 zeroes were added (helps in decoding).

4) Calculate binary to decimal.

011000= 24, 010110= 22, 001001= 9, 100011= 35, 011001= 25, 000000=0 ==

5) Covert decimal characters to base64 using base64 chart.

24= Y, 22= W, 9= j, 35= j, 25= Z, 0= A ==

“abcd” => “YWJjZA==”

Rajesh Prajapati
  • 1,141
  • 1
  • 6
  • 4
92

Aside from what's already been said, two very common uses that have not been listed are

Hashes:

Hashes are one-way functions that transform a block of bytes into another block of bytes of a fixed size such as 128bit or 256bit (SHA/MD5). Converting the resulting bytes into Base64 makes it much easier to display the hash especially when you are comparing a checksum for integrity. Hashes are so often seen in Base64 that many people mistake Base64 itself as a hash.

Cryptography:

Since an encryption key does not have to be text but raw bytes it is sometimes necessary to store it in a file or database, which Base64 comes in handy for. Same with the resulting encrypted bytes.

Note that although Base64 is often used in cryptography is not a security mechanism. Anyone can convert the Base64 string back to its original bytes, so it should not be used as a means for protecting data, only as a format to display or store raw bytes more easily.

Certificates

x509 certificates in PEM format are base 64 encoded. http://how2ssl.com/articles/working_with_pem_files/

Despertar
  • 19,416
  • 7
  • 69
  • 74
  • 4
    It's actually easier, processingwise, to store bytes as bytes in a lot of cases. Even in a database, and *especially* in a file (if fixed-length records are used, or the bytes are the only content). Base64 is typically used when those bytes are intended to be *transmitted* somewhere, particularly over a channel that might lop off bits or interpret some of the bytes as control codes. – cHao Aug 25 '12 at 05:51
  • I've never seen a hash written as unsigned 8 bit integers, 0,1,255,36...and displaying it with UTF-8 or any other encoding wouldn't make sense, how else would you display it other than with base64? Encryption keys and encrypted data are often stored in configuration and XML files where you cannot store the raw bytes. I agree if you can store it as raw bytes then by all means, but base64 is for those situations when you cannot. There are many uses of base64 beyond transmitting. These are simply two common scenarios where you will see it. – Despertar Aug 25 '12 at 06:23
  • 1
    You'd display the hash as hex, not decimal. For hashes, that is in fact far more common than base64. – cHao Feb 23 '14 at 09:51
  • @cHao Yes, this is also common. Hex digits can represent any binary data, but base 64 has the advantage of taking up a lot less space since it uses more characters. – Despertar Feb 23 '14 at 19:43
  • You've got the size of SHA and MD5 reversed; SHA is usually (but not always) 256, and MD5 is 128. – The Daleks Jun 11 '20 at 19:43
31

In the early days of computers, when telephone line inter-system communication was not particularly reliable, a quick & dirty method of verifying data integrity was used: "bit parity". In this method, every byte transmitted would have 7-bits of data, and the 8th would be 1 or 0, to force the total number of 1 bits in the byte to be even.

Hence 0x01 would be transmited as 0x81; 0x02 would be 0x82; 0x03 would remain 0x03 etc.

To further this system, when the ASCII character set was defined, only 00-7F were assigned characters. (Still today, all characters set in the range 80-FF are non-standard)

Many routers of the day put the parity check and byte translation into hardware, forcing the computers attached to them to deal strictly with 7-bit data. This force email attachments (and all other data, which is why HTTP & SMTP protocols are text-based), to be convert into a text-only format.

Few of the routers survived into the 90s. I severely doubt any of them are in use today.

David Klempfner
  • 6,679
  • 15
  • 47
  • 102
James Curran
  • 95,648
  • 35
  • 171
  • 253
  • 2
    This is an excellent point of discussion and an interesting history lesson, thanks. – Dan Bechard Jun 05 '15 at 13:43
  • But I think the adoption of 7-bit ASCII was primarily driven by punched paper tape formats, and its origins lie in telegraphy rather than inter-computer communication, – Michael Kay Jul 01 '20 at 20:41
27

From http://en.wikipedia.org/wiki/Base64

The term Base64 refers to a specific MIME content transfer encoding. It is also used as a generic term for any similar encoding scheme that encodes binary data by treating it numerically and translating it into a base 64 representation. The particular choice of base is due to the history of character set encoding: one can choose a set of 64 characters that is both part of the subset common to most encodings, and also printable. This combination leaves the data unlikely to be modified in transit through systems, such as email, which were traditionally not 8-bit clean.

Base64 can be used in a variety of contexts:

  • Evolution and Thunderbird use Base64 to obfuscate e-mail passwords[1]
  • Base64 can be used to transmit and store text that might otherwise cause delimiter collision
  • Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management

  • Spammers use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded messages.

  • Base64 is used to encode character strings in LDIF files
  • Base64 is sometimes used to embed binary data in an XML file, using a syntax similar to ...... e.g. Firefox's bookmarks.html.
  • Base64 is also used when communicating with government Fiscal Signature printing devices (usually, over serial or parallel ports) to minimize the delay when transferring receipt characters for signing.
  • Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
  • Can be used to embed raw image data into a CSS property such as background-image.
Amal Murali
  • 70,371
  • 17
  • 120
  • 139
warren
  • 28,486
  • 19
  • 80
  • 115
14

Some transportation protocols only allow alphanumerical characters to be transmitted. Just imagine a situation where control characters are used to trigger special actions and/or that only supports a limited bit width per character. Base64 transforms any input into an encoding that only uses alphanumeric characters, +, / and the = as a padding character.

Konrad Rudolph
  • 482,603
  • 120
  • 884
  • 1,141
10

The usage of Base64 I'm going to describe here is somewhat a hack. So if you don't like hacks, please do not go on.

I went into trouble when I discovered that MySQL's utf8 does not support 4-byte unicode characters since it uses a 3-byte version of utf8. So what I did to support full 4-byte unicode over MySQL's utf8? Well, base64 encode strings when storing into the database and base64 decode when retrieving.

Since base64 encoding and decoding is very fast, the above worked perfectly.

You have the following points to take note of:

  • Base64 encoding uses 33% more storage

  • Strings stored in the database wont be human readable (You could sell that as a feature that database strings use a basic form of encryption).

You could use the above method for any storage engine that does not support unicode.

Basil Musa
  • 6,637
  • 6
  • 52
  • 58
  • 7
    "You could sell that as a feature that database strings use a basic form of encryption" I like your style :D – Ercan Sep 17 '15 at 17:51
  • 10
    "You could sell that as a feature that database strings use a basic form of encryption" what a horrible thing to say :D – Alex Dec 28 '16 at 14:17
  • 2
    basic form of encryption against anyone who doesn't have the base64 decode algorithm rofl :D – Eladian Oct 09 '17 at 13:18
  • 2
    @Alex Not at all a "horrible thing to say". Second degree sensitive data is okay to be base64 encoded to make it unreadable by db administrators. It's not always necessary to have the highest level of encryption for every piece of data. For example, if you want to hide "comments" from a db administrator, then base64 is suitable for the job. Gratcias! – Basil Musa Jan 27 '18 at 14:49
  • 1
    It's worth mentioning that MySQL does now have support for all of Unicode, though for purposes of backwards compatibility, their `utf8` type is still three-bytes only; if you want the real thing, use `utf8mb4`. Nice hack, but no longer necessary. – TRiG Apr 03 '18 at 10:49
  • since MySQL 5.6.1 `SELECT FROM_BASE64('YmFzZTY0IGVuY29kZWQgc3RyaW5n');` create a temp table or a view and you can read those "encrypted" comments =P – alo Malbarez Oct 24 '18 at 20:15
  • 1
    I like this hack very much. In fact I use it myself. I'm so tired of poor database drivers that can't handle utf-8 properly. So I do this: insead of `select c from t` I do `select encode_as_base64(c) from t`, then decode it in the client. It's an ugly hack but works with even the worst odbc drivers. – Juraj Jun 05 '20 at 14:24
  • 1
    "You could sell that as a feature that database strings use a basic form of encryption" I like your style :-) – Alexander Jun 09 '20 at 04:56
10

“Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport”(Wiki, 2017)

Example could be the following: you have a web service that accept only ASCII chars. You want to save and then transfer user’s data to some other location (API) but recipient want receive untouched data. Base64 is for that. . . The only downside is that base64 encoding will require around 33% more space than regular strings.

Another Example:: uenc = url encoded = aHR0cDovL2xvYy5tYWdlbnRvLmNvbS9hc2ljcy1tZW4tcy1nZWwta2F5YW5vLXhpaS5odG1s = http://loc.querytip.com/asics-men-s-gel-kayano-xii.html.

As you can see we can’t put char “/” in URL if we want to send last visited URL as parameter because we would break attribute/value rule for “MOD rewrite” – GET parameter.

A full example would be: “http://loc.querytip.com/checkout/cart/add/uenc/http://loc.magento.com/asics-men-s-gel-kayano-xii.html/product/93/

jmr333
  • 173
  • 1
  • 8
9

It's used for converting arbitrary binary data to ASCII text.

For example, e-mail attachments are sent this way.

Can Berk Güder
  • 99,195
  • 24
  • 125
  • 135
8

I use it in a practical sense when we transfer large binary objects (images) via web services. So when I am testing a C# web service using a python script, the binary object can be recreated with a little magic.

[In python]

import base64
imageAsBytes = base64.b64decode( dataFromWS )
Alfred
  • 19,306
  • 58
  • 155
  • 232
Andrew Cox
  • 9,902
  • 2
  • 31
  • 38
4

Mostly, I've seen it used to encode binary data in contexts that can only handle ascii - or a simple - character sets.

Eric Tuttleman
  • 1,270
  • 1
  • 8
  • 16
3

To expand a bit on what Brad is saying: many transport mechanisms for email and Usenet and other ways of moving data are not "8 bit clean", which means that characters outside the standard ascii character set might be mangled in transit - for instance, 0x0D might be seen as a carriage return, and turned into a carriage return and line feed. Base 64 maps all the binary characters into several standard ascii letters and numbers and punctuation so they won't be mangled this way.

Paul Tomblin
  • 167,274
  • 56
  • 305
  • 392
2

Base64

Base64 is a generic term for a number of similar encoding schemes that encode binary data by treating it numerically and translating it into a base 64 representation. The Base64 term originates from a specific MIME content transfer encoding.

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport. Base64 is used commonly in a number of applications including email via MIME, and storing complex data in XML.

mugil k
  • 21
  • 1
2

One hexadecimal digit is of one nibble (4 bits). Two nibbles make 8 bits which are also called 1 byte.

MD5 generates a 128-bit output which is represented using a sequence of 32 hexadecimal digits, which in turn are 32*4=128 bits. 128 bits make 16 bytes (since 1 byte is 8 bits).

Each Base64 character encodes 6 bits (except the last non-pad character which can encode 2, 4 or 6 bits; and final pad characters, if any). Therefore, per Base64 encoding, a 128-bit hash requires at least ⌈128/6⌉ = 22 characters, plus pad if any.

Using base64, we can produce the encoded output of our desired length (6, 8, or 10). If we choose to decide 8 char long output, it occupies only 8 bytes whereas it was occupying 16 bytes for 128-bit hash output.

So, in addition to security, base64 encoding is also used to reduce the space consumed.

Jainabhi
  • 47
  • 1
  • 9
0

Base64 can be used for many purposes.

The primary reason is to convert binary data to something passable.

I sometimes use it to pass JSON data around from one site to another, store information in cookies about a user.

Note: You "can" use it for encryption - I don't see why people say you can't, and that it's not encryption, although it would be easily breakable and is frowned upon. Encryption means nothing more than converting one string of data to another string of data that can be either later decrypted or not, and that's what base64 does.

Jody Fitzpatrick
  • 349
  • 1
  • 10
  • 7
    [The difference between encryption and encoding](http://stackoverflow.com/questions/4657416/difference-between-encoding-and-encryption). – Hawkeye Parker Nov 14 '14 at 10:57
  • 2
    You're interpreting the definition of "encryption" *far* too literally. The word has evolved into something a fair bit more specific than its origins. – Dan Bechard Jun 05 '15 at 13:45