Jim Fisher’s blog

What is Monte Carlo integration?

2024-04-24T00:00:00.000Z

What’s the average distance that someone can jump? Let’s estimate it! I have a silly function that gives me the distance that someone can jump, based on their height h in centimeters:

function jumpDistance(h: number): number {
  return h < 0 || h > 360 ? 0 : 300 * Math.sin(h * (Math.PI / 400));
}

Next we’ll describe the heights of people in the population. To start, let’s say the population is just two people: Jane is 160 cm tall, and Peter is 180 cm tall. Then we can estimate the average jump distance:

const janeJump = jumpDistance(160);
const peterJump = jumpDistance(180);
const averageJump = (janeJump + peterJump) / 2;

With a bigger population, we can describe the population with an array of heights:

const populationHeights = [160, 180, 170, 190, 150];
let totalJump = 0;
for (const h of populationHeights) {
  totalJump += jumpDistance(h);
}
const averageJump = totalJump / populationHeights.length;

For a yet larger population, we can describe the population by keeping a tally for each height:

const populationHeights = new Map([
  [160, 23],
  [170, 45],
  [180, 32],
  [190, 12],
  [200, 8],
]);
let totalJump = 0;
let totalPeople = 0;
for (const [h, count] of populationHeights) {
  totalJump += count * jumpDistance(h);
  totalPeople += count;
}
const averageJump = totalJump / totalPeople;

For an even larger population, we can describe the population by a probability distribution, like $\text{Normal}(\mu=170, \sigma=20)$. Then the precise average jump distance is:

\[ \int_{h = -\infty}^{\infty} \text{jumpDistance}(h) \, \text{Normal}(h; \mu=170, \sigma=20) \, dh \]

Yuck! Suddenly we can’t solve the problem by just running the code, because there are infinitely many heights to consider. Our nice finite loop turned into an infinite integral. And it’s horrible to solve analytically, especially because jumpDistance is a piecewise function.

What can we do instead? If we have access to a function that gives us the population count at each height, we can do what we were doing before, but with some chosen heights:

function popCountForHeight(h: number) {
  return 1000 * Math.exp(-0.5 * ((h - 170) / 20) ** 2);
}

let totalJump = 0;
let totalPeople = 0;
for (let h = 0; h < 360; h += 1) {
  const count = popCountForHeight(h);
  totalJump += count * jumpDistance(h);
  totalPeople += count;
}
const averageJump = totalJump / totalPeople;

This approach could be called a Riemann sum. One problem with this approach is that we need to know a range of heights to consider that will cover most of the population, but not too much that we’re wasting time on heights that are very unlikely. Above, we chose to consider heights from 0 to 360 cm.

Another approach is to sample from the distribution:

const populationHeights = new NormalDistribution({ mean: 170, stdDev: 20 });

let totalJump = 0;
let totalPeople = 0;
for (let i = 0; i < 1000; i++) {
  const h = populationHeights.sample();
  totalJump += jumpDistance(h);
  totalPeople++;
}
const averageJump = totalJump / totalPeople;

Surprise, this is Monte Carlo integration!

How to escape JavaScript for a script tag

2024-04-24T00:00:00.000Z

To add JavaScript to a web page, we use a <script> tag like this:

<script>console.log("Hello!");</script>

But what if we need to add arbitrary JavaScript to our web page? Say, a valid script like this?:

if (x<!--y) { ... }

We can’t just write that in a <script> tag, because the browser will interpret the <!-- as the start of an HTML comment!

“But that’s fine,” you think. “We can just escape the string. This is how we serialize strings everywhere else in programming.”

You might reach for HTML entities, replacing < with <. After all, isn’t the JavaScript just ordinary text content? No, it’s not! Once the browser sees a <script> tag, it goes into a special JavaScript parsing mode, where HTML entities are not interpreted!

In this JavaScript parsing mode, the browser is looking for one of two strings:

</script to end the script tag
<!-- to start an HTML comment

To “escape” arbitrary JavaScript, we need to avoid those two substrings.

If we find <!-- in our JavaScript, we can’t just replace it, because its meaning is context-dependent:

// This is a comment containing <!--
let foo = x <!--y; // That's valid JS operators
const s = "This is a string containing <!--";

To “escape” the above JavaScript, we’d have to write something like:

// This is a comment containing
let foo = x < !--y; // That's valid JS operators
const s = "This is a string containing <" + "!--";

This is not a simple string replacement. To do those replacements, we need to parse the JavaScript, and handle every possible context where <!-- might appear.

Here’s the HTML spec. It’s all rather horrifying.

On that flickering blur in Chrome

2024-04-23T00:00:00.000Z

backdrop-filter: blur is a popular design for navbars. But it’s unusable in its current state, due to its flickery appearance in Chrome. Here’s an example:

A heading

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo.

Nav with a blurred background

In Chrome, notice how even a 1px scroll can cause the color to completely change. Here’s a recording:

Your browser does not support the video tag.

Here’s the general problem: a blur operation, at the edges of the image, needs to know what the content is outside the image. The correct approach would be to give the blur filter enough of the background content to work with. But in Chrome, the blur is only given the pixels immediately behind the element. So, as a hack, the blur must guess what the content is outside the nav. Wikipedia calls this “edge handling”.

Chrome seems to take the extend strategy: it guesses that the pixels at the edge of the background image extend infinitely in all directions. This is why the blur flickers so much: a tiny scroll causes the edge pixels to change, which causes the extended pixels to completely change.

One better edge-handing strategy is mirroring. Mirroring is better because, even though it’s inaccurate, it results in smoother transitions.

Neither Firefox nor Safari have this problem, although I’m not sure which approach they take to edge handling.

What is the Metropolis algorithm?

2024-04-18T00:00:00.000Z

We could estimate Bob’s lunch tomorrow by counting the previous lunches you’ve seen him eating in the cafeteria:

Lunch	Count
Apple	2
Banana	3

Now we want to simulate Bob’s future lunches. Assume Bob randomly picks a lunch each morning, with the relative proportions you’ve observed. Then we can simulate Bob’s lunch with:

type Meal = "Apple" | "Banana";

function sample(): Meal {
  return Math.random() < 2 / 5 ? "Apple" : "Banana";
}

$\tfrac25$ of the samples will be Apple, and $\tfrac35$ will be Banana.

The Metropolis algorithm gives a different way to get samples with those correct proportions. It looks like this:

function nextMeal(currentMeal: Meal): Meal {
  if (currentMeal === "Apple") {
    return "Banana";
  } else {
    if (Math.random() < 2 / 3) {
      return "Apple";
    } else {
      return "Banana";
    }
  }
}

class Chain {
  currentMeal: Meal;

  constructor() {
    this.currentMeal = "Apple";
  }

  sample(): Meal {
    const current = this.currentMeal;
    this.currentMeal = nextMeal(current);
    return current;
  }
}

Yes: this algorithm is more complicated and performs worse! But the underlying technique can help us sample from more complex distributions. So let’s see what it’s doing here, and why it works at all.

Instead of each sample call being independent, the Metropolis algorithm initializes a Chain which maintains a “current meal”. Each call to sample sets the next meal.

Here’s one run of one chain:

const chain = new Chain();
for (let i = 0; i < 10; i++) {
  console.log(`Day ${i}: `, chain.sample());
}

Day 0:  Apple
Day 1:  Banana
Day 2:  Apple
Day 3:  Banana
Day 4:  Banana
Day 5:  Apple
Day 6:  Banana
Day 7:  Banana
Day 8:  Apple
Day 9:  Banana

Notice Bob’s first meal is always Apple. And if Bob’s current meal is Apple, the next meal is always Banana, so Bob’s second meal is always Banana. Here are two problems with the Metropolis algorithm:

The first few samples are not in the correct distribution.
Samples are dependent on the previous samples.

Let’s run $10{,}000$ chains, and then log the distribution of apples on each day:

const chains: Chain[] = [];
for (let i = 0; i < 10000; i++) chains.push(new Chain());

for (let i = 0; i < 10; i++) {
  const samples = chains.map((chain) => chain.sample());
  const numApples = samples.filter((sample) => sample === "Apple").length;
  console.log(`Num apples on day ${i}:`, numApples);
}

Num apples on day 0: 0
Num apples on day 1: 6673
Num apples on day 2: 2202
Num apples on day 3: 5186
Num apples on day 4: 3184
Num apples on day 5: 4563
Num apples on day 6: 3600
Num apples on day 7: 4255
Num apples on day 8: 3846
Num apples on day 9: 4041

By day $9$, approximately $4{,}000$ of the $10{,}000$ chains have apples. This is the correct proportion of $\tfrac25$. But the number bounces around before reaching this equilibrium.

This is called convergence. To analyze this more mathematically, we can instead simulate the distribution of chains for the current meal, and calculate the distribution for the next meal:

type MealDist = [number, number];
function nextMealDist(currentMealDist: MealDist): MealDist {
  const [numApples, numBananas] = currentMealDist;

  // Consider those eating apples today. What will they eat tomorrow?
  let a2a = numApples * 0; // None will eat apples.
  let a2b = numApples * 1; // All will eat bananas.

  // Then consider those eating bananas today. What will they eat tomorrow?
  let b2a = numBananas * (2 / 3); // Two-thirds will eat apples.
  let b2b = numBananas * (1 / 3); // The rest will eat bananas.

  const numApplesTomorrow = a2a + b2a;
  const numBananasTomorrow = a2b + b2b;

  return [numApplesTomorrow, numBananasTomorrow];
}

// To start, all 10,000 chains are eating apples.
let currentMealDist: MealDist = [10_000, 0];

for (let i = 0; i < 10; i++) {
  currentMealDist = nextMealDist(currentMealDist);
  console.log(`Num apples on day ${i}: ${currentMealDist[0].toFixed(3)}`);
}

Again, we see that by day $9$, around $\tfrac25$ of the chains are eating apples:

Num apples on day 0: 0.000
Num apples on day 1: 6666.667
Num apples on day 2: 2222.222
Num apples on day 3: 5185.185
Num apples on day 4: 3209.877
Num apples on day 5: 4526.749
Num apples on day 6: 3648.834
Num apples on day 7: 4234.111
Num apples on day 8: 3843.926
Num apples on day 9: 4104.049

Actually, we don’t need to start the distribution at [10_000, 0]. We can just start with [1, 0]. This is then a probability distribution, because the sum of the two numbers is $1$. Then the output is the probability distribution of the each meal:

let currentMealDist: MealDist = [1, 0];

for (let i = 0; i < 10; i++) {
  currentMealDist = nextMealDist(currentMealDist);
  console.log(
    `Probability of apples on day ${i}: ${currentMealDist[0].toFixed(3)}`,
  );
}

Probability of apples on day 0: 0.000
Probability of apples on day 1: 0.667
Probability of apples on day 2: 0.222
Probability of apples on day 3: 0.519
Probability of apples on day 4: 0.321
Probability of apples on day 5: 0.453
Probability of apples on day 6: 0.365
Probability of apples on day 7: 0.423
Probability of apples on day 8: 0.384
Probability of apples on day 9: 0.410

After 93 days, the probability of apples reaches a stable state, at least in 64-bit floating-point. And that stable state is the correct distribution: $\tfrac25$ apple and $\tfrac35$ banana.

Because the first samples are not in the correct distribution, it’s common to discard them. This is called burn-in.

The precise claim of the Metroplis algorithm is: the correct distribution is a stable distribution. To prove this, evaluate nextMealDist([2/5, 3/5]), and you’ll see that it’s [2/5, 3/5]. How many will eat Apple for the next meal? None of the $\tfrac25$ currently eating apples will eat apples tomorrow. Of the $\tfrac35$ currently eating bananas, $\tfrac23$ will eat apples tomorrow. for a total of $\tfrac35 \times \tfrac23 = \tfrac25$. And so the probability of $\tfrac25$ is maintained.

The example algorithm above was hard-coded to generate the stable state $\tfrac25$ and $\tfrac35$. But time passes, after which we’ve counted $3$ apple meals, and $6$ banana meals. Let’s update the algorithm to work with any counts:

// Our observed counts.
// We want to generate more meals in this proportion.
const A = 3; // Count of apples
const B = 6; // Count of bananas

function nextMeal(currentMeal: Meal): Meal {
  if (currentMeal === "Apple") {
    return "Banana";
  } else {
    if (Math.random() < A / B) {
      return "Apple";
    } else {
      return "Banana";
    }
  }
}

Let’s show that, for any counts $A$ and $B$, this converges to the correct probability distribution, $\tfrac{A}{A+B}$ and $\tfrac{B}{A+B}$.

\[ \begin{aligned} \texttt{numApples} &= \tfrac{A}{A+B} \\ \texttt{numBananas} &= \tfrac{B}{A+B} \\ \\ \texttt{a2a} &= 0 \\ \texttt{b2a} &= \texttt{numBananas} \times \tfrac{A}{B} \\ &= \tfrac{B}{A+B} \times \tfrac{A}{B} \\ &= \tfrac{BA}{(A+B)B} \\ &= \tfrac{A}{A+B} \\ \\ \texttt{numApplesTomorrow} &= \texttt{a2a} + \texttt{b2a} \\ &= 0 + \tfrac{A}{A+B} \\ &= \tfrac{A}{A+B} \\ \end{aligned} \]

So far, we’ve only observed Bob eating Apple or Banana. But then one day Bob’s in the cafeteria eating Chips! We need to handle more states. We can record our observed frequencies with a function f:

function f(meal: Meal): number {
  return {
    Apple: 3,
    Banana: 6,
    Chips: 1,
  }[meal];
}

The true Metropolis sample algorithm actually starts by proposing a new meal. Then it decides whether to change to that meal, or eat the current meal again. Here’s a proposal function that picks from possible meals with uniform probability:

function proposeMeal(): Meal {
  const meals: Meal[] = ["Apple", "Banana", "Chips"];
  const i = Math.floor(Math.random() * 3);
  return meals[i]!;
}

Then the true nextMeal function looks like:

function nextMeal(currentMeal: Meal): Meal {
  const proposedMeal = proposeMeal();

  const proposedMealFreq = f(proposedMeal);
  const currentMealFreq = f(currentMeal);

  // The key line!
  const transitionProb = Math.min(proposedMealFreq / currentMealFreq);

  if (Math.random() < transitionProb) {
    return proposedMeal;
  } else {
    return currentMeal;
  }
}

This works, but why does it work? The key point in the proof is that, for any two states $A$ and $B$ in the steady state, the probability mass transferred from $A$ to $B$ is the same as the probability mass transferred from $B$ to $A$.

Let’s prove that. If we’re in steady state, every state $S$ has mass proportional to $f(S)$. For simplicity, just say the mass at $S$ is $f(S)$. Without loss of generality, let’s assume $f(A) \leq f(B)$.

How much mass is transferred from state $A$ to $B$? With our uniform proposal function, $\tfrac1N^{th}$ of the mass at $A$ is proposed to move to $B$. The probability of accepting this proposal is $\text{min}(1,\tfrac{f(B)}{f(A)})$. Since $f(A) \leq f(B)$, this is $1$, i.e. the proposal is always accepted. So the mass moving from $A$ to $B$ is $\tfrac{f(A)}{N}$.

How much mass is transferred from state $B$ to $A$? Again, $\tfrac1N^{th}$ of the mass at $B$ is proposed to move to $A$. The probability of accepting this proposal is $\text{min}(1,\tfrac{f(A)}{f(B)})$. Since $f(A) \leq f(B)$, this is $\tfrac{f(A)}{f(B)}$. So the mass moving from $B$ to $A$ is $\tfrac{f(B)}{N} \times \tfrac{f(A)}{f(B)} = \tfrac{f(A)}{N}$.

The same amount of mass, $\tfrac{f(A)}{N}$, is transferred from $A$ to $B$ as from $B$ to $A$. This condition is called detailed balance, and it implies that we are in a steady state.

The proposeMeal function above just picks a meal uniformly at random. But we want to propose meals in proportion to their frequency. The only requirement is that the proposal function is symmetric: the probability of proposing $A$ from $B$ is the same as the probability of proposing $B$ from $A. (Try to prove that this results in detailed balance.)

So far, we’ve been using discrete states. But the Metropolis algorithm is most useful for continuous distributions. Here’s a weird distribution over the real numbers:

function sinFreq(x: number): number {
  if (0 < x && x < Math.PI * 2) {
    return Math.abs(Math.sin(x));
  }
  return 0;
}

We can sample from this distribution using the Metropolis algorithm in its full generality:

class Chain<State> {
  constructor(
    // Initial state
    private state: State,

    // Function to calculate the frequency of any state
    private f: (state: State) => number,

    // Function to propose a new state - must be symmetric
    private propose: (state: State) => State,

    // Number of initial samples to discard
    burnIn = 1000,
  ) {
    for (let i = 0; i < burnIn; i++) {
      this.sample();
    }
  }

  sample(): State {
    const current = this.state;
    const proposed = this.propose(current);
    const prob = Math.min(1, this.f(proposed) / this.f(current));
    const next = Math.random() < prob ? proposed : current;
    this.state = next;
    return current;
  }
}

function propose(currentState: number): number {
  return currentState + (Math.random() - 0.5);
}

const chain = new Chain(0.5, sinFreq, propose);

const numSamples = 1000000;

const buckets: Record<number, number> = {};
for (let i = 0; i < numSamples; i++) {
  const sample = chain.sample();
  const bucket = Math.floor(sample * 10);
  buckets[bucket] = (buckets[bucket] ?? 0) + 1;
}

for (const bucket in buckets) {
  buckets[bucket] /= numSamples;
}

for (const bucket in buckets) {
  const len = Math.floor(buckets[bucket]! * 1000);
  console.log("#".repeat(len));
}

Sure enough, here’s that weird lumpy distribution:

#
###
######
########
##########
#############
##############
################
##################
####################
######################
#######################
########################
########################
#########################
#########################
#########################
#########################
########################
#######################
######################
#####################
###################
##################
################
##############
############
#########
#######
####
##

##
#####
#######
#########
###########
##############
###############
#################
###################
#####################
######################
#######################
########################
########################
#########################
########################
########################
########################
#######################
######################
#####################
###################
#################
################
##############
############
##########
#######
#####
###

Shape typing in Python

2024-04-12T00:00:00.000Z

While I was looking the other way, Python got advanced static types! Here’s matrix multiplication, describing the input shapes and its output shape:

def mat_mul[
    N, K, M
](
  m1: Mat[N, M],
  m2: Mat[M, K],
) -> Mat[N, K]:
    return m1 @ m2

There’s a lot going on here! In traditional Python, we’d write:

def mat_mul(m1, m2):
    return m1 @ m2

Then if we used the wrong shapes, we’d get a runtime error, like this:

>>> m1 @ m2
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: matmul: Input operand 1 has a mismatch
  in its core dimension 0, with gufunc signature
  (n?,k),(k,m?)->(n?,m?) (size 2 is different from 3)

Our type-safe wrapper mat_mul uses a type Mat[N, M], which I defined as:

type Mat[N, M] = np.ndarray[
    tuple[N, M],
    np.dtype[np.float64],
]

If we try to multiply matrices of the wrong shape, Pyright gives a type error.

This uses Numpy’s np.ndarray type, which takes two arguments that describe the shape and dtype. For example, we can describe a 2x3 matrix of integers as:

mat2x3: np.ndarray[
    tuple[Literal[2], Literal[3]],
    np.dtype[np.int64],
] = np.array([[1,2,3],[4,5,6]])

At the moment, most of the numpy API does not use these type parameters. For example, np.array(...) just gives you an np.ndarray[Any, Any]. So we have to make our own type-safe wrappers.

Automatic differentiation with dual numbers

2024-04-02T00:00:00.000Z

Differentiation is the heart of most machine learning, but how can we differentiate arbitrary functions? Perhaps the simplest accurate method is using dual numbers.

Here’s an example in JavaScript. Say we’re calculating the distance between two points using JavaScript:

function distance(x, y) {
  return Math.sqrt(x * x + y * y);
}

> distance(3,4)
5

Now we want to ask: how does tweaking x = 3 change the output? In math-speak, what’s the derivative of the distance with respect to x?

The poor man’s way to answer this is numerical differentiation. We add a little bit to x, and see how much it changes the output:

const changeToInput = 0.00000001;
const changeToOutput = distance(3 + changeToInput, 4) - distance(3, 4);
const derivative = changeToOutput / changeToInput;

We get that derivative = 0.5999999608263806. That’s 0.6. Well ... almost. The numerical error is due to our changeToInput = 0.00000001 not being infinitesimally small.

Now let’s calcuate the derivative without this numerical error!

We’ll start by saying that $\varepsilon$, or epsilon, is a special number that’s infinitesimally small. More precisely: $\varepsilon$ is not so small as to be zero, but $\varepsilon$ is so small that when you square it, you get zero.

Then we’ll calculate $\text{distance}(3+\varepsilon, 4)$ in JS, and see how many $\varepsilon$s are in the output. And that will be the true derivative!

Does $\varepsilon$ really exist? Not in our ordinary real numbers. We’ll just say it’s a different kind of number!

What is $42 + 7\varepsilon$? Well, because $\varepsilon$ is a different kind of number, we can’t simplify this expression, so we just leave it as $42 + 7\varepsilon$.

In general, we call these dual numbers. They’re of the form $a + b\varepsilon$, and we can represent them in TypeScript as:

type Dual = {
  // The ordinary real value part
  val: number;

  // How many tiny epsilons we have
  der: number;
};

You might vaguely remember rules from school, like the derivative of $x^n$ is $nx^{n-1}$, or something about limits. But dual numbers let us forget these rules and just use algebra! For example, let’s find the derivative of $x^2$ at $x = 5$. We’ll start by adding $\varepsilon$ to the input, to get $x = 5 + \varepsilon$. Then we simplify $x^2$ with ordinary algebra:

\[ \begin{aligned} x^2 &= x \times x \\ &= (5 + \varepsilon)(5 + \varepsilon) \\ &= (5 \times 5) + (\varepsilon \times 5) + (5\times\varepsilon) + \varepsilon^2 \\ &= 25 + (2 \times 5 \times \varepsilon) \\ &= 25 + 10\varepsilon \\ \end{aligned} \]

The value $10$ there is the derivative of $x^2$ at $x=5$! We just used ordinary arithmetic, plus the rule that $\varepsilon^2 = 0$.

Now we can write this in TypeScript:

function mul(x: Dual, y: Dual): Dual {
  return {
    val: x.val * y.val,
    der: x.der * y.val + x.val * y.der,
  };
}

If we do the same exercise for other primitive operations like add and sqrt, we end up with:

function add(a: Dual, b: Dual): Dual {
  return {
    val: a.val + b.val,
    der: a.der + b.der,
  };
}

function sqrt(a: Dual): Dual {
  return {
    val: Math.sqrt(a.val),
    der: a.der / (2 * Math.sqrt(a.val)),
  };
}

Now we can re-write our original distance function to work with dual numbers instead of ordinary numbers:

function distance(x: Dual, y: Dual): Dual {
  return sqrt(add(mul(x, x), mul(y, y)));
}

Now distance will give us the ordinary output, plus the derivative!

> distance(
  { val: 3, der: 1 },  // adding \varepsilon to the first argument
  { val: 4, der: 0 }
)

{ val: 5, der: 0.6 }

Do modern machine learning systems use this dual number trick? No, because efficiency. Above, you have to run the function once for every parameter you want to know about. For GPT4, you’d have to run it 1.76 trillion times to tweak each parameter just once!

In the next post, we’ll see reverse-mode differentation, which lets us find the derivative for each parameter, while running the function just once. If you can’t wait, take a look at Andrej Karpathy’s micrograd, a famous implementation of reverse-mode autodiff.

What is numerical differentiation?

2024-04-01T00:00:00.000Z

In school, we learned how to differentiate some functions. Maybe you remember that the derivative of x^2 is 2x. But could you differentiate an arbitrary JavaScript function? And what would that even mean?

Let’s start small:

function f(x) {
  return 7 * x;
}

How might you differentiate f? Let’s start with the stupidest thing that works!:

> f(3)
21
> f(3.01)
21.07

What happened here? We increased the input x by 0.01, and as a result, the output increased by 0.07. The output increase was 7 times more than our change to the input. In math-speak, we say that the derivative of f(3) with respect to x is 7.

We’ve just discovered the simplest, stupidest form of differentiation:

function derivative(f) {
  return (x) => {
    const changeToInput = 0.00000001;
    const changeToOutput = f(x + changeToInput) - f(x);
    return changeToOutput / changeToInput;
  };
}

With our magic derivative function, we can differentiate x^2 to get a function equivalent to 2x:

> function square(x) { return x * x; }
> const derivative_of_square = derivative(square);
> derivative_of_square(-13)
-26

But JS functions can have multiple parameters. Here’s one that multiplies its arguments:

function mul(a, b) {
  return a * b;
}

What would it even mean to find the derivative of mul(2, 3)? Which argument are we tweaking, a or b? Let’s try it with both:

> mul(2, 3)
6
> mul(2.01, 3)
6.03
> mul(2, 3.01)
6.02

Above, we see that the derivative for a is 0.03 / 0.01 = 3, and the derivative for b is 0.02 / 0.01 = 2. We can package this up nicely as the array [3, 2]. In math-speak, the values in this array are called partial derivatives, and the entire array is called the Jacobian of the sum function.

We can modify our derivative function to find the partial derivative for each parameter, and return the array:

function derivative(f) {
  return (...args) => {
    const changeToInput = 0.00000001;

    const derivatives = [];

    for (let i = 0; i < args.length; i++) {
      const changedArgs = [...args];
      changedArgs[i] += changeToInput;
      const changeToOutput = f(...changedArgs) - f(...args);
      derivatives.push(changeToOutput / changeToInput);
    }

    return derivatives;
  };
}

Using this, we can find the derivative of mul at the arguments (2, 3):

> derivative(mul)(2, 3)
[ 3, 2 ]

This is the simplest numerical differentiation method. Its biggest problem is efficiency. If the function f has a million parameters, then evaluating derivative(f)(...) calls the function f two million times! In the next post, we’ll see automatic differentiation, a technique that only calls f once.

Tell the LLM the business context

2024-03-30T00:00:00.000Z

Employees do better when they have more business context. The same is true of LLMs! To do its best work, the LLM needs to know why it’s being prompted, where its input came from, how its output will be used, and how its output will be judged. Many prompters try to tell the LLM how to achieve some task, but it’s often better to just give it the business context.

An example. Your cooking blog has a very plain homepage. Wouldn’t it be nicer, you think, if each link to a blog post included an intro paragraph to draw readers in? And isn’t this the kind of thing an LLM should be great at writing? So you write your first prompt:

Summarize the following article.

You run it on your blog posts, but the responses you get are inconsistent and mediocre. You pile more instructions into the prompt, until you end up with:

Summarize the following article.
No more than 2-3 sentences.
Make it engaging so the reader wants to know more.
Never refer to "the post"; instead summarize directly.
Never refer to "the author"; instead use "I".
Here are some examples: ...

But the LLM rarely remembers all the rules. Why is it so stupid, you wonder?

Instead, try telling the LLM the business context. Here’s an alternative prompt:

CalebCooks.com is Caleb Smith's blog about cooking.
Each post link on the homepage has the post title, plus a teaser paragraph.
You are given a post's title and content, and you write its teaser paragraph.
The goal is to convince readers to click.
Here's one example: ...

Imagine you’re describing this task to a contractor. You wouldn’t tell them how many sentences to use; you’d just tell them the business context. They’ll figure out what’s appropriate.

The problem with “Summarize the following article” is that the LLM starts out in a superposition of all the business contexts in which it might have been asked for a summary, such as:

“Bob is reading this article. He asked me for a “summary” to help understand the article. I should guess what Bob’s difficulties were in understanding, and re-explain those.”
“Jane is writing an essay that criticizes the arguments in this article. She asked me for a “summary” that she’ll scan for the argument’s weak points.”
“A search system wants to index this page. It asked me for a “summary” that it will scan for keywords to index.”
“Caleb is submitting this article to a journal. He asked me for a “summary” that he will use as the abstract.”

As the LLM produces output, it collapses into one of these contexts. So some “summaries” will look like critiques; others will look like keyword lists.

I have seen lots of LLM prompts that avoid providing the business context. One reason, I think, is that programmers treat the prompting as programming in English. Programmers are used to describing the how, but not the why. And programmers love modularity, where code is reusable and modules don’t know about each other. But the “business context” principle is anti-modular, because then the summarization module has knowledge about the whole app and business.

Auto-summarizing my blog posts

2024-03-26T00:00:00.000Z

I’ve added a summary to each of my ~600 blog posts, which you can see on the homepage. An LLM generates an initial summary, then I edit it til I’m happy. The prompt I ended up with was:

You are given an excerpt of a post from jameshfisher.com, Jim Fisher’s blog. You respond with a TL;DR of 1 or 2 sentences. The TL;DR will be added to the post front-matter. The TL;DR is shown beneath links to the post. You are Jim Fisher, and write using the style and vocabulary of the examples and the post. Paraphrase the content directly. Never mention ‘the post’. Be extremely concise, even using sentence fragments. Do not duplicate info from the title. Only include information from the post. Use Markdown for formatting. Excellent examples of TL;DRs from other posts:

TL;DR: A method for calculating a bounding circle around a head, using facial landmarks from BlazeFace. Plus a live demo that you can run on your own face.

TL;DR: const is a type qualifier in C that makes a variable unassignable, except during initialization.

...

Notice how much business context I gave to the LLM. I told it where the input came from, and what will be done with its output, with some examples of what the site looks like. Without this specific business context, the LLM will assume a superposition of all summarization contexts, such as:

Bob is reading this post. He asked me for a “summary” to help understand the post. I should guess what Bob’s difficulties were in understanding, and re-explain those.
Jane is writing an essay that criticizes the arguments in this post. She asked me for a “summary” that she’ll scan for the post’s main weak points.
A search system wants to index this page. It asked me for a “summary” that it will scan for keywords for indexing.

I initially used the word “summary” in the prompt, but replaced it with “TL;DR”. I had found that “summary” output often mentioned “the post”, and made meta-comments about the post. By contrast, the “TL;DR” summary was a direct paraphrasing of the post. “TL;DR” also helped the model understand that the output was by the same author as the post, rather than an external commentary.

Iterating was important. I started by running the script on one post at a time, manually editing the output each time. Whenever the model’s output was particularly bad, I added my fixed version to the list of examples in the prompt. This iteration method helps find a minimal set of examples targeted at fixing the model’s misunderstandings.

The LLM is Claude 3 Haiku. I spent only 40 cents in total! Despite its cheapness, Haiku was better than GPT 3.5. I was particularly impressed by Haiku’s lack of bullshitting. GPT-3.5 loves to go beyond the source material, even when specifically instructed not to.

It might not need a label

2024-03-20T00:00:00.000Z

One hallmark of “programmer UI” is using labels for everything. Here I show why this is often a design mistake, with some examples. I suggest how to identify over-labelling, and how to fix it. Finally I suggest why programmers are biased towards this design mistake.

Recently I built this “card” component showing the details of an event:

Date: 11^th March

Time: 17:10

Venue: Down Lane Park

Address: London, N17 9AU

Description

Please arrive ten minutes before the game starts. Wear a dark top; bibs are provided.

Note the labels Date, Description, et cetera. These indicate my “programmer design”. Now here’s an alternative design, which just removes the labels:

11^th March

17:10

Down Lane Park

London, N17 9AU

Please arrive ten minutes before the game starts. Wear a dark top; bibs are provided.

This re-design shows that the labels were unnecessary. We can see that the text “11^th March” is a date. And the bottom text does not need to be labelled “Description”, because what else could it be?

Worse, the labels were actively distracting. One argument for labels is to make text scannable, so users can quickly find what they’re looking for in the UI. But users are not scanning for the text “Date”, they’re scanning for something that looks like a date. And to find the event details, you don’t hunt for text like “Description”; you look for something that looks like a paragraph.

Why did I add these labels? One reason is that I was translating JSON data like this:

{
  "date": "2024-03-11",
  "time": "17:10",
  "venue": "Down Lane Park",
  "address": "London, N17 9AU",
  "description": "Please arrive ..."
}

Our code uses labels everywhere. That JSON object could instead be represented as an array of five strings, but that would be a horrible data structure! It’s tempting to apply this software design principle to UI, but it often doesn’t work.

To be clear, you shouldn’t remove all your labels. Just identify which values are self-labelling. For example, the event date was self-labelling, because the user knows:

That this box represents an event
That events have dates
What dates look like

Unless you have all three, you should use a label.