Reinventing 1973

A confession: I spent more evenings than I care to admit teaching a computer to turn a file into a single, preposterously large number.

That’s the whole trick, really. Take a small chunk of your file, line up every possible arrangement of its bytes in order — like names in a phone book — and ask which one is this chunk? The answer is an index, a number, and if you’ve been clever the number comes out smaller than the chunk it stands for. Store the index; throw the chunk away. Compression by counting. There’s something almost indecent about it — a line of text, a sliver of a song, reduced to a position in an imaginary list far too long to ever write down.

I built it because building it was the point. (I’ve written previously: producing nearly always trumps consuming.) I don’t suppose the world was crying out for another compressor — it has gzip and zstd and a dozen cleverer things — but the maths was lovely and I wanted to hold it in my hands. So I did. Alone and quietly. Sometimes on pre-iPad flights, often on sleepless nights, and, more recently, in a beachside shack — which is, near enough, what I called the venture: Shack C. (A small private pun, hidden where nobody will look for it. The sort of thing you do only for yourself. Much like obsessing over a compression scheme for over a decade.)

Then I read a paper by a man named Thomas Cover, who had set the essential idea down, before I was born, in 1973. Reinvented. It is a peculiar feeling, that — toiling up a mountain you are quietly proud of, only to find a tidy little cairn already on the summit, with a note. And to add insult to injury: when I finally raced my creation against the compressors people actually use, it lost. The same compression, an order of magnitude slower. Decades late and a few hundred megabytes per second short.

And yet. It meets the theoretical limit exactly, which is its own quiet kind of perfect. It does its finest work on small, orderly alphabets — the four letters of DNA fold down to a quarter of their size, no fuss. And because each chunk is only a number in a known range, you can encrypt it almost for nothing: add a secret number, let it wrap around, and you have compression and a lock in a single motion. That last trick — the lock — is the reason I bothered to commit the code and write the thing up at all: a method can be too slow to win the race and still be worth describing, if it can do the one thing the winners cannot. None of this will trouble the execs at FAANG. But it is mine, it is small and uncomplicated. And some days that is precisely and profoundly enough.

Leave a comment