Low latency, large working set, and GHC’s garbage collector: pick two of three

(Originally published on the Pusher engineering blog.)

Pusher makes it easy for developers to reliably deliver data at scale. A few features we are working on required a rewrite of our internal message bus, but we knew anything we build has to perform as reliably as the existing platform. Our chosen language for this job was Haskell, compiled with GHC. Haskell, unusually obsessed with correctness, was a good choice for our focus on reliability. GHC was a good choice as a compiler with a reputation for good performance. But our performance tests showed that our new implementation had performance problems. It suffered from high worst-case latency: the time between sending and receiving a message could be in the hundreds of milliseconds.

This blog post documents our investigation of these latency problems. Our findings have some implications for designing similar systems using GHC.

The message bus

Conceptually, our in-memory message store is a buffer. New messages get pushed on one end, and old messages get popped off at the other. Systems can ask for the message history without popping them off, so the process must also be able to traverse the buffer efficiently.

Performance testing the message bus

After implementing this system, we implemented some comprehensive performance tests. Two performance characteristics we care about are throughput and latency. Throughput means messages per second, and latency means the time between sending a message and receiving it. We pushed millions of realistic messages through the system, measuring the latency of each message using wall-clock time. Finally, we charted the distribution of these using the statistics package.

Pusher’s existing message bus has high throughput and low latency. Our replacement has to achieve the same low latency, but our tests showed that the 99th percentile latency was surprisingly high. Most messages traversed the new system within 5 milliseconds. But a significant few took several hundreds of milliseconds. This is too high for Pusher’s realtime platform, which currently achieves a 99th percentile of around 100 milliseconds.

What was causing this poor worst-case? To investigate, we needed more visibility of what our program was doing. GHC provides some good observability tools. One of these, ThreadScope, gives you a microsecond-level visual history of one run of your program. Here’s the chart of one run of our performance benchmark:

threadscope_1

The chart shows the history of one run of our benchmark program. Time runs along the x-axis. The pink lines mark when the system receives messages, and so the blocks of pink lines are periods of ‘productive’ work.

What are those large gaps between sections of productive work? They represent the cause of the worst-case latencies in our tests. During these periods, sent messages were waiting to be forwarded.

The chart marks garbage collection cycles in orange, each around 8 milliseconds long. These sections exactly correspond to the periods where our program did no work! All our pauses were due to garbage collection.

Isolating the garbage collection pauses

The question we needed to answer was: what determines the length of a garbage collection pause? We took two approaches to answering this: white box and black box. Black box testing meant fiddling with our benchmarks, watching the change in garbage collection pauses. White boxing meant understanding the garbage collector in the GHC runtime.

Our black-box tests simplified our program to the simplest case which exhibited these latencies. Our real-world program can hold many interrelated buffers, with messages of many sizes, which are expired by multiple policies. The simplified program just works with 1-kilobyte messages in a Data.Map.Strict, indexed by send time. It inserts new messages in time order, and pops old ones off, to limit the structure at 200,000 messages:

module Main (main) where

import qualified Control.Exception as Exception
import qualified Control.Monad as Monad
import qualified Data.ByteString as ByteString
import qualified Data.Map.Strict as Map

data Msg = Msg !Int !ByteString.ByteString

type Chan = Map.Map Int ByteString.ByteString

message :: Int -> Msg
message n = Msg n (ByteString.replicate 1024 (fromIntegral n))

pushMsg :: Chan -> Msg -> IO Chan
pushMsg chan (Msg msgId msgContent) =
  Exception.evaluate $
    let
      inserted = Map.insert msgId msgContent chan
    in
      if 200000 < Map.size inserted
      then Map.deleteMin inserted
      else inserted

main :: IO ()
main = Monad.foldM_ pushMsg Map.empty (map message [1..1000000])

We compiled and ran this program using:

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.10.3
$ ghc -O2 -optc-O3 Main.hs
$ ./Main +RTS -s
   3,116,460,096 bytes allocated in the heap
     385,101,600 bytes copied during GC
     235,234,800 bytes maximum residency (14 sample(s))
     124,137,808 bytes maximum slop
             600 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      6558 colls,     0 par    0.238s   0.280s     0.0000s    0.0012s
  Gen  1        14 colls,     0 par    0.179s   0.250s     0.0179s    0.0515s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.652s  (  0.745s elapsed)
  GC      time    0.417s  (  0.530s elapsed)
  EXIT    time    0.010s  (  0.052s elapsed)
  Total   time    1.079s  (  1.326s elapsed)

  %GC     time      38.6%  (40.0% elapsed)

  Alloc rate    4,780,213,353 bytes per MUT second

  Productivity  61.4% of total user, 49.9% of total elapsed

GHC’s runtime can output various statistics. Above, we print out garbage collection statistics with +RHS -s. The important metric in the result is the ‘max pause’ of 0.0515s, or 51 milliseconds. This program still exhibits the same garbage collection pauses when examined in ThreadScope:

threadscope_data_map

We tried tweaking many variables to test their effects on this pause time, most of which had very little effect. Compiler optimization flags (-O, -O2) made little difference. GC options (-G, -H, -A, -c) made little difference (see below). Using different data structures made little difference (we also tried a Data.Sequence and a priority queue). Varying the size of messages made little difference (for reasons we’ll explain). Artificially increasing the amount of short-lived garbage did not increase the pause time (also for reasons we’ll explain).

Only one variable had a definite effect: the number of messages kept in the history. The relationship was roughly linear (or perhaps slightly worse):

relationship

The message history in our program is our entire working set. The results above seemed to apply more generally to other programs: the size of the working set determined the length of the maximum pause. But why?

Understanding the GHC garbage collector

The first important thing to know is that GHC’s garbage collector is stop-the-world. A stop-the-world garbage collector flips between two phases: a normal process execution phase where garbage is generated, and a garbage collection phase where garbage is collected. These are the two phases seen in the ThreadScope view of our processes, and it is these periods where ‘the world is stopped’ which cause our latency problem.

The garbage collector is generational, which is an optimization to reduce the number of long collections on the entire heap. It organizes allocated objects into buckets by ‘age’, where each bucket is called a ‘generation’. This allows GHC to perform frequent, small collections on just the young objects. Looking back at ThreadScope, you can see lots of little orange GC pauses between the problematic longer ones. While this generational design tends to improve throughput, it does not help our max-pause metric! The occasional full-heap collection in a generational collector is the same as a normal collection in a non-generational collector. It is here that we hit our worst case.

(Actually, the generational design does not improve our throughput much. It relies on the assumption that “most objects die young,” which is true of many programs. But in our use-case, most objects have the same lifetime: the time they spend in the buffer.)

Finally, you need to know that the GHC garbage collector is copying. This means it keeps two areas of memory: an active area holding all objects, and an inactive one. At each GC cycle, it flips the active and inactive areas, copying all accessible objects from the active area to the inactive area. Roughly, it does this by walking the object graph and copying each object it finds to the active area. (It actually uses a clever evacuate/scavenge algorithm, but the details are unimportant here. To see the nuts and bolts, this set of slides is great.)

copying

How long does it take to perform this copying routine on the entire heap? The object graph traversal takes time linear in the the number of object pointers in the working set. The copying takes time linear in the number of bytes. (Additionally, because objects reference each other via pointers, moving an object requires updating all pointers to it.)

Therefore it seems that GC pauses are proportional to the number of pointers in the working set plus the size of the working set. Which of these costs dominates? It depends on whether your heap is memory-heavy or pointer-heavy. In our case with messages of 1 kilobyte, the pointers seemed to dominate (changing the message size made little difference to pause times). Roughly, we can say that for a typical heap, GC pauses are proportional to the working set size.

Three non-fixes

At the beginning, we thought that since garbage collectors ‘work on garbage,’ we should reduce the amount of garbage to reduce the time taken. But the amount of generated garbage is actually irrelevant. This is because the garbage is the memory which is not touched by the copying algorithm: it takes up no time at all! This is a design decision in GHC. Functional languages like Haskell generate many short-lived objects. When the number of dead objects dwarfs the number of retained objects, it makes sense to operate on the smaller of the two sets! Reducing the amount of garbage was a non-fix for us.

We discovered that GHC’s garbage collector is parallel. This means that its garbage collection phases may exploit multiple cores to finish faster. It does not mean that garbage collection is concurrent with program execution! It turns out parallel collection was mostly irrelevant to the results here. At best, the speed-up is a constant factor, and for our particular use-case this constant factor was close to 1. Sadly, parallelism was a non-fix.

Finally, we discovered that the garbage collector can use a compacting algorithm instead of the copying algorithm. Instead of copying the working set into a new memory area, the compacting algorithm manipulates a single memory area, which potentially involves less copying. However, our experimentation showed that this was slower than the copying algorithm. This too was a non-fix.

It is likely that the compacting algorithm is slower due to immaturity. The Hotspot JVM, for example, takes a similar approach in collecting its older generations. In Hotspot, different approaches are taken to different generations. On the assumption that most new objects will be dead, the young generations use a copying algorithm. On the assumption that old objects will be retained, the old generations use a mark-and-sweep algorithm. This latter case is more complex to implement efficiently: you must retain free-lists of reclaimed memory, deal with fragmentation by merging these chunks, move objects which are standing in the way of merging nearby free chunks, and so on. Hotspot achieves impressive results here, but the person-decades of engineering poured into its implementation should not be underestimated.

After research and experimentation, our working theory was that the worst-case latencies in our message delivery were linear in the number of messages in the history, because each GC cycle has to walk over all the memory retained during a collection and copy it, which are clearly linear operations. At this point, we reached out on Stack Overflow with our conundrum. Within a few hours we received a definitive answer from Simon Marlow, the main author of GHC’s runtime system, who confirmed that our hypothesis was correct and that currently there is no way to work around this in GHC.

Conclusion

This post only repeats some basic information about GHC’s garbage collector. However, the implications of it may not be immediately obvious. We discovered in particular that its pause times relate linearly to the size of the working set (measured as a combination of byte size and object pointers). You may hear that GHC’s garbage collector is well-designed, and it is. But be sure to understand that it is designed for throughput, not latency. Understand also that it is optimised for programs where “most objects die young,” and not all programs fit this heuristic.

GHC cannot achieve low latency with a large working set. We are in the process of reimplementing our message bus with this lesson in mind. We’re exploring a few avenues for avoiding the problem. We plan to talk about the new implementation in a future blog post. Stay tuned!

Tagged #haskell, #performance, #garbage-collection, #latency, #programming.

More by Jim

What does the dot do in JavaScript?

foo.bar, foo.bar(), or foo.bar = baz - what do they mean? A deep dive into prototypical inheritance and getters/setters. 2020-11-01

Smear phishing: a new Android vulnerability

Trick Android to display an SMS as coming from any contact. Convincing phishing vuln, but still unpatched. 2020-08-06

A probabilistic pub quiz for nerds

A “true or false” quiz where you respond with your confidence level, and the optimal strategy is to report your true belief. 2020-04-26

Time is running out to catch COVID-19

Simulation shows it’s rational to deliberately infect yourself with COVID-19 early on to get treatment, but after healthcare capacity is exceeded, it’s better to avoid infection. Includes interactive parameters and visualizations. 2020-03-14

The inception bar: a new phishing method

A new phishing technique that displays a fake URL bar in Chrome for mobile. A key innovation is the “scroll jail” that traps the user in a fake browser. 2019-04-27

The hacker hype cycle

I got started with simple web development, but because enamored with increasingly esoteric programming concepts, leading to a “trough of hipster technologies” before returning to more productive work. 2019-03-23

Project C-43: the lost origins of asymmetric crypto

Bob invents asymmetric cryptography by playing loud white noise to obscure Alice’s message, which he can cancel out but an eavesdropper cannot. This idea, published in 1944 by Walter Koenig Jr., is the forgotten origin of asymmetric crypto. 2019-02-16

How Hacker News stays interesting

Hacker News buried my post on conspiracy theories in my family due to overheated discussion, not censorship. Moderation keeps the site focused on interesting technical content. 2019-01-26

My parents are Flat-Earthers

For decades, my parents have been working up to Flat-Earther beliefs. From Egyptology to Jehovah’s Witnesses to theories that human built the Moon billions of years in the future. Surprisingly, it doesn’t affect their successful lives very much. For me, it’s a fun family pastime. 2019-01-20

The dots do matter: how to scam a Gmail user

Gmail’s “dots don’t matter” feature lets scammers create an account on, say, Netflix, with your email address but different dots. Results in convincing phishing emails. 2018-04-07

The sorry state of OpenSSL usability

OpenSSL’s inadequate documentation, confusing key formats, and deprecated interfaces make it difficult to use, despite its importance. 2017-12-02

I hate telephones

I hate telephones. Some rational reasons: lack of authentication, no spam filtering, forced synchronous communication. But also just a visceral fear. 2017-11-08

The Three Ts of Time, Thought and Typing: measuring cost on the web

Businesses often tout “free” services, but the real costs come in terms of time, thought, and typing required from users. Reducing these “Three Ts” is key to improving sign-up flows and increasing conversions. 2017-10-26

Granddad died today

Granddad died. The unspoken practice of death-by-dehydration in the NHS. The Liverpool Care Pathway. Assisted dying in the UK. The importance of planning in end-of-life care. 2017-05-19

How do I call a program in C, setting up standard pipes?

A C function to create a new process, set up its standard input/output/error pipes, and return a struct containing the process ID and pipe file descriptors. 2017-02-17

Your syntax highlighter is wrong

Syntax highlighters make value judgments about code. Most highlighters judge that comments are cruft, and try to hide them. Most diff viewers judge that code deletions are bad. 2014-05-11

Want to build a fantastic product using LLMs? I work at Granola where we're building the future IDE for knowledge work. Come and work with us! Read more or get in touch!

This page copyright James Fisher 2016. Content is not associated with my employer. Found an error? Edit this page.