# Rounding up to the next power of two in C

I was writing a memory allocator in C.
My approach was to pad each allocation to a power of two number of bytes
so that the allocations could be cleanly packed together.
For example, if the program does `malloc(49)`

,
requesting a 49-byte memory allocation,
we round this up to 64 bytes.
For this,
I wanted a function `next_pow2`

so that `next_pow2(49) = 64`

,
`next_pow2(64) = 64`

,
`next_pow2(65) = 128`

,
et cetera.

How do we implement `next_pow2`

?
A natural implementation is to check all the powers of two,
stopping at the first one which is big enough:

```
uint64_t next_pow2(uint64_t x) {
uint64_t p = 1;
while (p < x) p *= 2;
return p;
}
```

The above algorithm uses ordinary arithmetic.
We can do better if we work with the binary representation of numbers in C.
In binary representation, adding a `0`

doubles the number,
just as adding a `0`

in decimal multiplies by ten.
So `p * 2`

can be done as `p << 1`

,
using the bitwise left shift operator `<<`

to shifts the number up by one and add a rightmost `0`

:

```
uint64_t next_pow2(uint64_t x) {
uint64_t p = 1;
while (p < x) p <<= 1;
return p;
}
```

For a 64-bit number,
the `while`

loop can still take up to 64 iterations.
The number of operations is linear in the number of bits.
We can do better and make this a logarithmic number of operations.
One logarithmic-time algorithm is
to use binary search instead of linear search:

```
uint64_t pow2(uint8_t e) { return ((uint64_t)1) << e; }
uint64_t next_pow2(uint64_t x) {
uint8_t lo = 0, hi = 63;
while (lo < hi) {
uint8_t test = (lo+hi)/2;
if (x < pow2(test)) { hi = test; }
else if (pow2(test) < x) { lo = test+1; }
else { return pow2(test); }
}
return pow2(lo);
}
```

The above logarithmic-time approach has a lot of branching.
There is another logarithmic-time algorithm which uses no branching,
which I originally found here.
It relies on a function `next_pow2m1`

,
which finds the next number of the form `2^n - 1`

:

```
uint64_t next_pow2(uint64_t x) {
return next_pow2m1(x-1)+1;
}
```

To implement `next_pow2m1`

, let’s see an example:
`next_pow2m1(0b00010100) = 0b00011111`

.
Notice that in binary representation,
this is equivalent to “fill with `1`

s everything after the first `1`

”.
We can do that with iterated bitwise or:

```
uint64_t next_pow2m1(uint64_t x) {
for (int i = 0; i < 63; i++) x |= x>>1;
return x;
}
```

We can make this branchless by expanding the loop:

```
uint64_t next_pow2m1(uint64_t x) {
x |= x>>1; // Iteration 1
x |= x>>1; // Iteration 2
...
x |= x>>1; // Iteration 63
return x;
}
```

Here’s what happens in each iteration:

```
x = 0010000000000000
x = 0011000000000000
x = 0011100000000000
x = 0011110000000000
x = 0011111000000000
x = 0011111100000000
x = 0011111110000000
x = 0011111111000000
x = 0011111111100000
x = 0011111111110000
x = 0011111111111000
x = 0011111111111100
x = 0011111111111110
x = 0011111111111111
```

This is still linear-time.
But we can do this faster by reusing the `1`

bits from previous iterations!:

```
uint64_t next_pow2m1(uint64_t x) {
x |= x>>1;
x |= x>>2;
x |= x>>4;
x |= x>>8;
x |= x>>16;
x |= x>>32;
return x;
}
```

Now see how the rightmost bit doubles in size each time:

```
x = 0010000000000000
x = 0011000000000000
x = 0011110000000000
x = 0011111111000000
x = 0011111111111111
```

This is the most efficient “portable C” implementation of `next_pow2`

I’ve seen.
However, there’s a still more efficient constant-time implementation.
The above algorithm effectively finds the rightmost `1`

bit,
but there’s a constant-time function for that:
`__builtin_clzl`

, “count leading zeros”.
This is provided by GCC (not portable C).
It does have a gotcha:
`__builtin_clzl(0)`

is undefined,
so we need to special-case for it.
Here’s our final constant-time implementation of `next_pow2`

:

```
uint64_t next_pow2(uint64_t x) {
return x == 1 ? 1 : 1<<(64-__builtin_clzl(x-1));
}
```

But how is `__builtin_clzl`

implemented?
We can see by compiling to assembly:

```
$ cc -O0 -std=c99 -S next_pow2.c
```

We get these instructions:

```
movq $42, -8(%rbp)
bsrq -8(%rbp), %rax
```

The `bsrq`

instruction is “Bit Scan Reverse”.
As usual, there is literally no official documentation on this instruction.

I wrote this because I felt like it. This post is not associated with my employer. This site is hosted by Netlify (who are great, but I'm not associated with them either).