How to hash multiple values

To hash a string like "hello", you choose a specific hash function like SHA-256, then pass the string to it, getting a hash like 2cf24db...38b9824. Like this (in Go):

func hash1(v []byte) []byte {
  h := sha256.Sum256(v)
  return h[:]
}

But if you have two strings like "Jim" and "Fisher", how do you hash these two values together? And if you have an array of strings, how do you produce a hash of the entire array? We want a “composite” hash function. There are multiple ways to make a composite hash function, but it’s not obvious which way you should choose, and it’s usually left up to you to implement it!

I recently stumbled over composite hash function implementation which I’ll call hash-then-XOR. It looked like this:

func hash2_hash_then_xor(v1, v2 []byte) []byte {
  return xor(hash1(v1), hash1(v2))
}

func xor(v1, v2 []byte) []byte {
  r := make([]byte, len(v1))
  for i := 0; i < len(v1); i++ {
    r[i] = v1[i] ^ v2[i]
  }
  return r
}

Hash-then-XOR first hashes each input value, then combines all the hashes with XOR. Hash-then-XOR seems plausible, but is it a good hash function? Think about it for a moment.

No, hash-then-XOR is not a good hash function! A good hash function makes it hard to find collisions, distinct inputs which produce the same hash. Let me tell you that hash2_hash_then_xor("Jim", "Fisher") produces the hash 7cf467d...a7417a0. Can you find another distinct input which produces the hash 7cf467d...a7417a0? You don’t need to brute-force it! You only need to swap the arguments: hash2_hash_then_xor("Fisher", "Jim").

Given an input to hash-then-XOR, it’s easy to find another input with the same hash. This is known as a “second-preimage attack”, and it’s bad news for a hash function! Hash-then-XOR is not collision-resistant because XOR is commutative: forall a b, xor(a,b) == xor(b,a). Hash-then-XOR may be ideal if you wish to hash a set of values in which the order does not matter, but this is not a typical requirement.

To hash an ordered pair of values, we want a non-commutative combining function. One non-commutative function is concatenation, which is used in perhaps the most common composite hash function, which I’ll call concat-then-hash:

func hash2_concat_then_hash(v1, v2 []byte) []byte {
  return hash1(append(v1, v2...))
}

In concat-then-hash, the hash of "Jim" and "Fisher" is the hash of "JimFisher". Reversing the inputs produces the hash of "FisherJim", a distinct string, so this is not vulnerable to an attack by reordering inputs. Is concat-then-hash then invulnerable to a second-preimage attack? Can you find an example of another input which collides with hash2_concat_then_hash("Jim", "Fisher")? Again, you don’t need to brute-force it! One example is hash2_concat_then_hash("Ji", "mFisher"), because "Ji" and "mFisher" also concatenate to "JimFisher".

Concat-then-hash is also vulnerable to second-preimage attacks! Intuitively, concatenation throws away the dividing line between the two values, so we can find collisions at all the other dividing lines. You might be thinking of preserving this dividing line with a separator character, like this:

func hash2_concat_sep_then_hash(v1, v2 []byte) []byte {
  return hash1(append(append(v1, ','), v2...))
}

Here, we instead hash the string "Jim,Fisher", which claims to preserve the dividing line with a comma. Can you find an example of another input which collides with hash2_concat_sep_then_hash("Jim", "Fisher")? Are we finally collision-resistant? No, we’re not collision-resistant! I can’t find a second preimage, because no other pairs of strings concatenate to "Jim,Fisher". But you can still find collisions, because the input strings can embed the separator. For example, ("a,b", "c") collides with ("a", "b,c").

The underlying problem is that concatenation is not injective, and this is not fixed by adding a separator. At Pusher, the company I work for, we were once hit by a security issue due to the non-injectivity of concatenation, even with a separator. You should take this seriously!

To guarantee injectivity, instead of concatenation, we can use a serialization function. An example is JSON serialization, so the hash of "Jim" and "Fisher" would be the hash of ["Jim","Fisher"]. We can implement serialize-then-hash in Go like this:

func hash2_serialize_then_hash(v1, v2 []byte) []byte {
  bs, _ := json.Marshal([]string{string(v1), string(v2)})
  return hash1(bs)
}

No other two strings under serialization produce ["Jim","Fisher"]. We know a serialization function is injective because it comes with a corresponding parsing function. To my knowledge, serialize-then-hash is a collision-resistant composite hash function, assuming that the underlying hash function (such as SHA-256) is collision-resistant, and your underlying serialization function (such as JSON) is in fact injective.

There are other valid composite hash functions, such as HMAC. But HMAC distinguishes between a “key” and “message”, and is not defined over variable-length arrays of strings. In my opinion, there should be a standard construction which hashes an array of values. To my knowledge, no such standard exists.

Discussion on Hacker News.

Tagged #programming, #crypto.

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2018. Content is not associated with my employer. Found an error? Edit this page.

How to hash multiple values

Similar posts

More by Jim