Base64 Encoding

Base64 encoding appears here and there in web development. Perhaps its most familiar usage is in HTML image tags when we inline our image data (more on this later):

Python Base64 URL and Filename safe Encoding. The default b64encode functions uses the standard Base64 alphabet that contains characters A-Z, a-z, 0-9, +, and /.Since + and / characters are not URL and filename safe, The RFC 3548 defines another variant of Base64 encoding whose output is URL and Filename safe. One of the extra features built into Notepad is a Base64 encoder and decoder. Base64 is an encoding scheme that is designed to be a safe standard for transmission of binary data over channels that only reliably support text data.

In this article, I will share both a simple and a slightly more advanced understanding of Base64 encoding. These are the methods that I use to both encode and decode in my daily work. These are the methods that I use to both encode and decode in my daily work. Here Base64 encoding is used for the mappings field. The comma and semicolon delimited snippets are the Base64 encoded binary data of integers encoded as variable-length quantities (VLQ). Images and source maps are just a couple places Base64 encoding is used. Base64 is a binary-to-text encoding scheme. It represents binary data in a printable ASCII string format by translating it into a radix-64 representation. Base64 encoding is commonly used when there is a need to transmit binary data over media that do not correctly handle binary data and is designed to deal with textual data belonging to the 7-bit US-ASCII charset only.

As a programmer, it is easy to accept this random-looking ASCII string as the “Base64 encoded” abstraction and move on. To go from raw bytes to the Base64 encoding, however, is a straightforward process, and this post illustrates how we get there. We’ll also discuss some of the why behind Base64 encoding and a couple places you may see it.

A visualization

Base64 Encoding Characters

The gist of the encoding process is captured in the following interactive visualization. Type in some ASCII characters in the top input and hit the “Encode” button.

If you run a few strings through this visualization, you may notice that the encoding process is simply a pair of nested loops. The outer loop iterates over the data in 24-bit increments; the spec refers to these as “input groups.” The inner loop iterates over each input group 6 bits at a time. Each 6-bit value is interpreted as an unsigned integer that is used to index an alphabet of 64 characters. The indexed alphabet value is the output. With the help of ES6 generators, this encoding process can be implemented with just a handful of functions:

If there is an “interesting” part to the encoding process, it is the ending conditions where we must apply padding. Each input group is required to be 24 bits long (or equivalently three 8-bit bytes). (It seems likely the spec writers chose 24-bit input groups since 24 is the least common multiple of 6 and 8.) In the implementation given above, we pad the final group with bytes of zeroes when the final input group is only 1 or 2 bytes long. As we iterate over this final input group, if the 6-bit value consists entirely of padding bits, then = is the output character, the designated padding character. If, however, the 6-bit value straddles “real” bits and padding bits—as can be seen in the input “foob”—then the alphabet is still indexed and the padding bits are taken to be zeroes.

A couple usages

You will not find any mention of “HTML” in the Base64 spec. Instead, the authors simply mention that Base64 encoding is used in environments where, “perhaps for legacy reasons,” the “storage or transfer” of data is limited to ASCII characters. More or less, this idea sums up the browser and its heavy consumption of HTML, JSON, CSS, and JavaScript. Increasingly, this text is encoded using UTF-8, a superset of ASCII. In this text-heavy ecosystem, Base64 encoding finds various niche applications.

Data URLs

The first part of a URL is the scheme. It is the prefix string that goes before the first colon; for example, it is the https in https://example.com or the beginning ftp in ftp://ftp.funet.fi/pub/standards/RFC/rfc4648.txt. The scheme tells the client (a browser or a different network app) how to retrieve the resource and what protocol to follow. The scheme prefix also makes URLs extensible and suitable for future protocols. If a new protocol comes along, we can create a new URL scheme for it and still identify resources by URL.

The data scheme is one such extension, which we saw in the image encoded in the introduction. This scheme tells clients, “My resource’s data is located right here in the rest of this URL string.” URLs that use the data scheme follow this format:

Base64 Encoding Python

You can find more particulars about this format in the spec, but we will focus on the image “data URLs” we mentioned at the outset. Here are a couple examples of the big three binary image formats used in a few different contexts:

Source maps

Base64 Character Set

Another common but less visible usage of Base64 encoding is in source maps. Below is a source map generated by Google’s Closure compiler:

Base64 Encoding Powershell

Here Base64 encoding is used for the mappings field. The comma and semicolon delimited snippets are the Base64 encoded binary data of integers encoded as variable-length quantities (VLQ).

Base64 Encoding Java

Images and source maps are just a couple places Base64 encoding is used. If you know of others or any novel uses of Base64 encoding, please mention them in the comments below. It also might be worth “inspecting” page sources to find others. For example, in Chrome, if you go to chrome://dino you can find that the offline dinosaur game’s image assets (and it appears sound assets) are Base64 encoded. (Examining these assets—which are also embedded on YouTube’s homepage—is how I discovered the dinosaur can duck under the low-flying pterodactyls.)