What is UTF-8?

Unicode, a character set, maps human characters to natural numbers, and UTF-8, a character encoding maps strings of those numbers to strings of bytes.

An old character set, ASCII, has 128 characters, which it represents using byte values 0-127. For example, the character ‘a’ takes number 97, and this is simply encoded using a byte with value 97. Strings of characters are encoded by simple concatenation of those bytes. For example, “hello world” is encoded as the character codes [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]. “ASCII” today refers to both the ASCII character set (including the mapping of ‘a’ to 97) and the ASCII encoding of those characters (including the mapping of 97 to 97).

The Unicode character set is a superset of ASCII: a character’s code in ASCII is the same as its code in Unicode. For example, the ASCII character ‘a’ still maps to 97, but the non-ASCII character ‘😊’ maps to 55357.

UTF-8 has the important property that ASCII text (text using the ASCII character set) has the same byte encoding in UTF-8 as it with the ASCII encoding. For example, the string “hello world” is encoded to the same bytes as above. This means new programs can interact with old programs, as long as they only use the ASCII character set.

Each non-ASCII number (character code) is encoded to between 2 and 6 bytes. The first byte contains a prefix stating the number of bytes. For example, a four-byte encoding has the five-bit prefix 11110. Every following byte has the prefix 10. For example, one pattern is 11110___ 10______ 10______ 10______. The remaining bits (here shown as underscores) contain the binary encoding of the character code.

Thus, by looking at the bit prefix of any byte, we can determine its type: a byte beginning with 1 is an ASCII character, a byte beginning with 10 is the continuation of a non-ASCII character, and other bytes (beginning with 1..10) are the beginning of a non-ASCII character. This means that if we join some UTF-8 somewhere in the middle of a stream of bytes, we can quickly find the boundaries. This property is referred to as “self-synchronizing”.

How many bits are available in a UTF-8 pattern to encode a character? An n-byte character uses n+1 bits to indicate the byte length, and 2 bits in each remaining n-1 bytes, making a total overhead of 3n-1 bits.

# bytes overhead remaining
2 bytes 5 bits 11 bits
3 bytes 8 bits 16 bits
4 bytes 11 bits 21 bits
5 bytes 14 bits 26 bits
6 bytes 17 bits 31 bits

To encode a character code, we first find how many bits the code requires, then choose the smallest encoding which will contain it. For example, 121579 in binary is 11101101011101011, which requires 17 bits, and so we choose the 4-byte encoding, which gives us 21 bits.

I wrote this because I felt like it. This post is not associated with my employer.