How does a distributed hash table work?

First, what is a hash table? Usually, a hash table is an implementation of a map. That is, a hash table maps keys to values. For example, a hash table h might store the map { "jim": "male", "alice": "female", "bob": "male" }. Maps provide an API, importantly including update(k,v) and lookup(k).

However, unlike a normal hash table, a distributed hash table does not really represent an arbitrary map. Instead, it represents a content-addressable set, meaning it provides this different API: insert(v), and lookup(h(v)). Notice that the API assumes a specific hash function h, such as SHA-256.

Unlike keys in a map, items in a content-addressable set are immutable; the ordering of insert operations does not matter. This greatly simplifies a distributed implementation, where operations usually have a partial order. Unlike mutable maps, the results of lookup can be verified: h(lookup(h(k))) == h(k). This too greatly simplifies a distributed implementation, where the structure must be shared with untrusted parties.

The hash table implements these operations efficiently by grouping keys into buckets with an underlying hash function. The hash table might have 8 buckets in memory, numbered 0 through 7, and a hash function that says hash("jim") = 3, hash("alice") = 3, and hash("bob") = 7. In other words, the hash function, applied to a key, tells us the key’s approximate location in memory. If the hash function gives a good estimate of the memory location, and direct memory lookups are fast, then lookups and insertions will be fast.

Notice I said “approximate” location, and “location estimate”. Hash functions have collisions, e.g. "jim" and "alice" both hash to bucket 3. There are many ways to deal with collisions. For example, open addressing says, “if bucket 3 is already full, try bucket 4, then 5, and so on”. Then when the table becomes “full”, we do an expensive hash table resize, moving every key to a new, bigger table.

Suppose our hash table must be accessible by many machines. Our hash table is in the memory of one machine, which must serve all those requests. At some point, we may not be able to buy a machine big enough to serve all requests, or to store all that data. And if this machine dies, the entire hash table dies with it. We need more machines! How do we put our hash table on those multiple machines?

One way is to clone our hash table, so that every machine has a copy. When a node wishes to look up a value, it can look in its local copy. When a machine wishes to insert a value at a key, it broadcasts this to all other nodes. This scheme provides redundancy, and distributes lookups over all copies: with n nodes, each node processes 1/n lookups. But all nodes still store all data, and all nodes still process all insert operations.

Worse, this scheme leads to inconsistencies, because different nodes can process updates in different orders. There are many ways to fix this, such as leader election and consensus algorithms. But DHTs fix this.

To fix this, they can elect a “leader” machine. To do an insert, it’s sent to the leader, and the leader then broadcasts the update to all other nodes. This fixes inconsistency, but re-introduces centralization. The leader node must be powerful enough to process all updates, and if the leader node dies, who will continue the work? A new leader must be “elected”. There are many ways to do this election, but election is a topic for another day, and it doesn’t fix our two big problems: reducing storage and operations per node.

To reduce storage per node, and number of operations per node, each node must store a smaller subset of the data.

For example, each key can be stored on one node, instead of on all nodes. This reduces each node’s storage to 1/n, and its number of processed inserts to 1/n. How do we determine a key’s owner? We use a hash function again, but the hash function, instead of giving us the number of a bucket in memory, gives us machine addresses. For example, our hash could give us hash("jim") = "234.127.23.56", hash("alice") = "244.32.167.165", and hash("bob") = "23.45.231.12". To look up the value for "jim", we don’t do a memory lookup at bucket 3; we instead send the lookup request to the address 234.127.23.56, which then does the memory lookup and responds. Under this scheme, with n nodes, each node stores 1/n keys, processes 1/n lookups, and processes 1/n inserts. Our hash table scales to the moon! But this brings new problems.

Our hash can’t really give us random IP addresses. Most of those addresses won’t be machines serving the hash table. Instead, let’s give each node the list of all nodes, e.g. nodes = ["234.127.23.56", "244.32.167.165", "23.45.231.12"]. Then our hash can index into this list, modulo the number of nodes, e.g. if hash("jim") = 72452 then we go to node nodes[72452 % 3] = "23.45.231.12".

By reducing storage per node, we have removed redundancy. There is a fundamental tension between increased redundancy and reduced number of operations per node. We must find a compromise. Let’s say instead of a key having a single owner, each key is owned by two nodes. For example, we could store key k on node hash(k) % num_nodes and on node (hash(k)+1) % num_nodes. The work required by each node is now num_nodes/n, but this is only a constant factor increase.

When an insert happens at key k, this update must be sent to node hash(k) % num_nodes and to node (hash(k)+1) % num_nodes. But now our inconsistency problem is back! These owners of k can receive updates in different orders. To fix this, we can fall back to our leader/slave relationship. Each key must have a defined leader to whom we send updates, and two slaves which are issued updates from the leader.

But remember that a leader/slave relationship requires a re-election algorithm. It’s difficult to think about how re-election works per-key.

But this scheme does not help reduce memory consumption per machine.

Notice that if a node stores a key, that node must process all updates for that key, so each key demands both memory and CPU.

I just released Vidrio, a free app for macOS and Windows to make your screen-sharing awesomely holographic. Vidrio shows your webcam video on your screen, just like a mirror. Then you just share or record your screen with Zoom, QuickTime, or any other app. Vidrio makes your presentations effortlessly engaging, showing your gestures, gazes, and expressions. #1 on Product Hunt. Available for macOS and Windows.

With Vidrio

With generic competitor

More by Jim

Tagged . All content copyright James Fisher 2017. This post is not associated with my employer. Found an error? Edit this page.