What is UTF-8?

Unicode, a character set, maps human characters to natural numbers, and UTF-8, a character encoding maps strings of those numbers to strings of bytes.

An old character set, ASCII, has 128 characters, which it represents using byte values 0-127. For example, the character ‘a’ takes number 97, and this is simply encoded using a byte with value 97. Strings of characters are encoded by simple concatenation of those bytes. For example, “hello world” is encoded as the character codes [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]. “ASCII” today refers to both the ASCII character set (including the mapping of ‘a’ to 97) and the ASCII encoding of those characters (including the mapping of 97 to 97).

The Unicode character set is a superset of ASCII: a character’s code in ASCII is the same as its code in Unicode. For example, the ASCII character ‘a’ still maps to 97, but the non-ASCII character ‘😊’ maps to 55357.

UTF-8 has the important property that ASCII text (text using the ASCII character set) has the same byte encoding in UTF-8 as it with the ASCII encoding. For example, the string “hello world” is encoded to the same bytes as above. This means new programs can interact with old programs, as long as they only use the ASCII character set.

Each non-ASCII number (character code) is encoded to between 2 and 6 bytes. The first byte contains a prefix stating the number of bytes. For example, a four-byte encoding has the five-bit prefix 11110. Every following byte has the prefix 10. For example, one pattern is 11110___ 10______ 10______ 10______. The remaining bits (here shown as underscores) contain the binary encoding of the character code.

Thus, by looking at the bit prefix of any byte, we can determine its type: a byte beginning with 1 is an ASCII character, a byte beginning with 10 is the continuation of a non-ASCII character, and other bytes (beginning with 1..10) are the beginning of a non-ASCII character. This means that if we join some UTF-8 somewhere in the middle of a stream of bytes, we can quickly find the boundaries. This property is referred to as “self-synchronizing”.

How many bits are available in a UTF-8 pattern to encode a character? An n-byte character uses n+1 bits to indicate the byte length, and 2 bits in each remaining n-1 bytes, making a total overhead of 3n-1 bits.

# bytes	overhead	remaining
2 bytes	5 bits	11 bits
3 bytes	8 bits	16 bits
4 bytes	11 bits	21 bits
5 bytes	14 bits	26 bits
6 bytes	17 bits	31 bits

To encode a character code, we first find how many bits the code requires, then choose the smallest encoding which will contain it. For example, 121579 in binary is 11101101011101011, which requires 17 bits, and so we choose the 4-byte encoding, which gives us 21 bits.

Tagged #utf-8, #unicode, #encoding, #character-set, #c, #programming.

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2017. Content is not associated with my employer. Found an error? Edit this page.

What is UTF-8?

Similar posts

More by Jim