What is open addressing?

Hash tables must deal with hash collisions. If “foo” and “bar” both hash to bucket 4, we must do something to store those distinct values in distinct locations. Hash tables mainly differ in how they deal with collisions. There are two main approaches: “closed addressing” and “open addressing”.

Closed addressing is the traditional approach, which solves collisions by allowing more than one element in each bucket. One way to do closed addressing is “chaining”, where each bucket stores a linked list of elements in that bucket.

The other approach, “open addressing”, only allows one element in each bucket, but allows an element to be stored in one of many possible buckets, so a collision can be resolved by trying another bucket.

Closed addressing requires pointer chasing to find elements, because the buckets are variably-sized. In contrast, open addressing can maintain one big contiguous hash table. This can improve cache performance and make the implementation simpler.

In this post, I implement a hash table using open addressing. The hash table can store a set of integers. Here’s the structure:

typedef int elem_t;

struct Set {
  elem_t * buckets;    // One big array
  size_t num_buckets;  // Size of the array
  size_t num_elems;    // How many buckets are filled?
};

Actually, this hash table can’t quite store int values. It needs a way to distinguish a full bucket from an empty bucket. For this, we sacrifice one integer to represent an empty bucket:

#define NO_ELEM INT_MIN

The hash table has a small API demonstrating its set-like behavior:

struct Set * Set_new(size_t num_buckets);
bool Set_contains(struct Set *, elem_t elem);
void Set_add(     struct Set *, elem_t elem);
void Set_del(     struct Set *, elem_t elem);

int main(void) {
  struct Set * s = Set_new(4);
  Set_add(s, 3);
  Set_add(s, 42);
  Set_del(s, 3);
  Set_add(s, 0);
  return 0;
}

After the additions and deletions in main, what are the values in the set s? 42 and 0. The value 3 was added then deleted again.

A hash table using closed addressing can store an unlimited number of elements, because each bucket can store an unlimited number of elements. (It’ll pay a performance price, though.) In contrast, how many elements can the above hash table store? Up to num_buckets. Beyond that, it will need to resize!

A hash table needs a hash function. This is not important here, but I’m told this hash function is reasonable:

size_t hash(elem_t x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    return x;
}

To create a new hash table, we allocate a new array and initialize each bucket to NO_ELEM:

struct Set * Set_new(size_t num_buckets) {
  struct Set * s = malloc(sizeof(struct Set));
  s->num_buckets = num_buckets;
  s->num_elems = 0;
  s->buckets = malloc(num_buckets * sizeof(s->buckets[0]));
  for (size_t i = 0; i < num_buckets; i++) s->buckets[i] = NO_ELEM;
  return s;
}

With open addressing, the hash function defines an element’s “preferred bucket”. If the preferred bucket is full, we must try other ones. There are many ways to define the next bucket to try, but a simple one is “linear probing”. Linear probing scans every bucket starting at the preferred bucket. To add a new element to the set, we follow the “probe sequence” until we find an empty bucket, or we find the element we’re searching for.

void Set_add(struct Set * s, elem_t new_elem) {
  if (new_elem == NO_ELEM) return;
  size_t start_ix = hash(new_elem) % s->num_buckets;
  for (size_t probe = 0; probe < s->num_buckets; probe++) {
    size_t ix = (start_ix+probe) % s->num_buckets;
    elem_t elem = s->buckets[ix];
    if (elem == NO_ELEM) {
      s->buckets[ix] = new_elem;
      s->num_elems++;
      if (0.7 < ((double)s->num_elems / s->num_buckets)) Set_expand(s);
      return;
    } else if (elem == new_elem) {
      return;
    }
  }
}

To expand the table, we make a new array double the size of the old one, then rehash every element. We can do this by creating a new set and adding all elements to it:

void Set_expand(struct Set * s) {
  struct Set * ns = Set_new(s->num_buckets * 2);
  for (size_t i = 0; i < s->num_buckets; i++) Set_add(ns, s->buckets[i]);
  free(s->buckets);
  s->num_buckets = ns->num_buckets;
  s->num_elems = ns->num_elems;
  s->buckets = ns->buckets;
  free(ns);
}

The probe sequence affects every function: contains, add and del. See how Set_contains follows the same probe sequence:

bool Set_contains(struct Set * s, elem_t needle) {
  if (needle == NO_ELEM) return false;
  size_t start_ix = hash(needle) % s->num_buckets;
  for (size_t probe = 0; probe < s->num_buckets; probe++)
    if (s->buckets[(start_ix+probe) % s->num_buckets] == needle)
      return true;
  return false;
}

Set_contains could probe every bucket before it gives up. To avoid this, we ensure that the table doesn’t get too full. A particularly important line in Set_add calls Set_expand if more than 70% of buckets are full.

There’s an optimization we could have made in Set_contains: if we find an empty bucket, stop and return false, because Set_add would have filled this bucket if the element existed. However, Set_del breaks this assumption.

void Set_del(struct Set * s, elem_t del_elem) {
  if (del_elem == NO_ELEM) return;
  size_t start_ix = hash(del_elem) % s->num_buckets;
  for (size_t probe = 0; probe < s->num_buckets; probe++) {
    size_t ix = (start_ix+probe) % s->num_buckets;
    elem_t elem = s->buckets[ix];
    if (elem == del_elem) {
      s->buckets[ix] = NO_ELEM;
      s->num_elems--;
      return;
    }
  }
}

This hash table implementation was somewhat simpler than traditional chaining, at around 80 LOC.

Tagged .

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2017. Content is not associated with my employer. Found an error? Edit this page.

What is open addressing?

Similar posts

More by Jim