ICFP: Ode on a Random Urn

I visited ICFP this month. Most of it went over my head, but a talk with the title “Ode on a Random Urn” caught my interest, and its content was the only technical insight I left with. It was a presentation of this paper of the same name.

The problem is how to represent a discrete probability distribution. An example of a discrete probability distribution is an urn/bag/bucket containing two red balls, four green balls, and three blue balls. One way to represent this is:

urn :: [Color]
urn = [Blue, Blue, Red, Green, Green, Green, Red, Green, Blue]

Given an urn, we want to sample from it, which means picking a ball at random. We also want to add and remove balls from the list, e.g. “add a new green ball”, or “remove one blue ball”. Using the list representation above, we can sample the urn by indexing randomly into the list, we can add a ball to the urn by prepending them to the list, and we can remove a ball from the urn by searching the list until we find one of the specified color.

If there can be lots of balls of the same color, the above representation can be inefficient. Instead we can group the balls by color and just note how many we have of each:

urn :: [(Int, Color)]
urn = [(2, Red), (4, Green), (3, Blue)]

Both representations really start to suffer when we consider sampling without replacement. This is where we select a ball from the urn, but don’t put it back; the distribution will be missing the ball on the next sample, and eventually will become empty. These list representations suffer because of the quadratic time of repeated list traversals.

The urn representation is a binary tree where the leaves are those colored groups and each inner node keeps track of how many balls it has at its leaves. So one tree might have the leaves (2, Red), (4, Green), and (3, Blue). This tree would have two inner nodes. One might hold (2, Red) and (4, Green), and keep its size 6. The root would keep its total size 7. The order of the nodes in the tree does not matter!

The urn representation helps because the tree is balanced. Like most trees, it is mutated by insertions and deletions, and both of those operations maintain balance. How?

The node positions in an infinite tree can enumerated like this:

    0
  1   2
 3 4 5 6
.........

We can maintain the invariant that a tree with n nodes has them at positions 1..n-1. Insertion of a new node places the node at position n. This tree is balanced: all layers are full except for the bottom one. Such a tree remains balanced, because the positions fill up in layers. Deletion of any node works by replacing the deleted node with the node at position n-1. This works because order of nodes does not matter.

Actually, the urn algorithms don’t do this. In the urn, values are only stored at the leaves. Insertions happen by expanding a leaf into a branch with two new leaves. Instead of ensuring that nodes are inserted in the order above, we can ensure that they’re expanded in that order. If there are 6 leaves in the tree, expand node 6 to make room. To delete a leaf from a tree with 7 leaves, contract node 6, copying one child into node 6, and the other over the deleted leaf.

Actually, the urn algorithms don’t do this either. They expand the nodes in the following order:

        1
    2       3
  4   6   5   7
 8 c a e 9 d b f
.................

This ordering is less visually obvious, but it still works because it enumerates the nodes in layers. The ordering of the nodes within each layer does affect balance; there is only ever one non-full layer.

Why does the urn algorithm choose this strange enumeration of nodes? Take the number in binary, reverse it, ignore the last bit, then interpret 0 as left and 1 and right. This is the path to the node to expand! The path to the latest node is given by the number of leaves in the tree. By tracking the number of leaves in the tree, the algorithm knows the path to insert/delete at.

There’s one significant problem with the urn in my mind: there’s no efficient way to find a color group, given the color. The blues could be at any leaf location, because they’re arbitrarily ordered. So when inserting a new blue ball, the algorithm inserts a new group of (1, Blue) instead of finding the existing group and adding 1 to its size. This can lead to a much bigger tree than necessary.

Tagged .

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2017. Content is not associated with my employer. Found an error? Edit this page.

ICFP: Ode on a Random Urn

Similar posts

More by Jim