Learn more about Russian war crimes in Ukraine.

Run-length encoding in C

Question 1.5 of Cracking the Coding Interview:

Implement a method to perform basic string compression using the counts of repeated characters. For example, the string aabcccccaaa would become a2b1c5a3. If the “compressed” string would not become smaller than the original string, your method should return the original string. You can assume the string has only upper and lower case letters (a-z).

The question doesn’t name it, but this is run-length encoding. (The exception for longer strings is not traditional run-length encoding, however. It’s also a rather ugly API.)

Here’s a solution in C. An interesting property of the C implementation is that it needs to calculate the length of the returned string before it actually generates that string, in order to malloc a string of the appropriate size. The algorithm consists of two passes which look very similar, but where the second pass actually adds characters to the string, the first pass just increments a string length counter.

A handy API for this is snprintf(NULL, 0, ...), which returns the length of the string that sprintf would have printed if given a real string to print to.

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <assert.h>

int run_length(char* s) {
  char c = s[0];
  if (c == '\0') return 0;
  int length = 1;
  for(; s[length] == c; length++);
  return length;
}

int strlen_i(int i) {
  return snprintf(NULL, 0, "%i", i);
}

int rle_len(char* s) {
  int len = 0;
  int rl = run_length(s);
  while (rl > 0) {
    len += 1 + strlen_i(rl);
    s += rl;
    rl = run_length(s);
  }
  return len;
}

char* rle(char* s) {
  int old_len = strlen(s);
  int new_len = rle_len(s);

  if (old_len <= new_len) {
    return strdup(s); // provides free()able copy
  }

  char* out = malloc((new_len+1) * sizeof(char));

  int out_next = 0;
  int rl = run_length(s);
  while (rl > 0) {
    out_next += sprintf(&out[out_next], "%c%i", s[0], rl);
    s += rl;
    rl = run_length(s);
  }
  out[out_next] = '\0';

  return out;
}

void test_rle_len(char* input, int expected) {
  assert(rle_len(input) == expected);
}

void test_rle(char* input, char* expected) {
  char* output = rle(input);
  assert(strcmp(output, expected) == 0);
  free(output);
}

int main(int argc, char** argv) {
  test_rle_len("aabcccccaaa", 8);
  test_rle_len("aaaaaaaaaabbbbbbbbbbbbbbbbbbbb", 6);
  test_rle_len("", 0);
  test_rle("aabcccccaaa", "a2b1c5a3");
  test_rle("aaaaaaaaaabbbbbbbbbbbbbbbbbbbb", "a10b20");
  test_rle("", "");
  test_rle("abcd", "abcd");
  return 0;
}

This implementation is constant-time in the length of the input, which is optimal.

What can computers do? What are the limits of mathematics? And just how busy can a busy beaver be? This year, I’m writing Busy Beavers, a unique interactive book on computability theory. You and I will take a practical and modern approach to answering these questions — or at least learning why some questions are unanswerable!

It’s only $19, and you can get 50% off if you find the discount code ... Not quite. Hackers use the console!

After months of secret toil, I and Andrew Carr released Everyday Data Science, a unique interactive online course! You’ll make the perfect glass of lemonade using Thompson sampling. You’ll lose weight with differential equations. And you might just qualify for the Olympics with a bit of statistics!

It’s $29, but you can get 50% off if you find the discount code ... Not quite. Hackers use the console!

More by Jim

Tagged #ctci, #programming, #c. All content copyright James Fisher 2020. This post is not associated with my employer. Found an error? Edit this page.