Friday, August 26, 2016

The Elegance of Deflate

Richard Mitton (via Hacker News):

Deflate was invented by Phil Katz back in 1993, and forms the basis of the ZIP file format, the zlib library, gzip, PNGs, and probably a whole bunch of other stuff. At the time it was pretty cutting-edge.


So we have two compression algorithms. LZSS is reliant on finding previous data to match against, and Huffman coding is reliant on some letters being more common than others. Can we do better than picking one of those two? Can we weave them together?


Deflate is based around the idea of the unified alphabet. If an alphabet is just a set of choices, why not bring all the choices together under one umbrella? Deflate’s alphabet consists of 286 symbols. The first 256 are the ASCII codes for each letter, including all the ASCII control codes and other such. The remaining 30 symbols are used to represent lengths. That’s right, we’re storing the actual match length here.


There were a lot of encoders at the time using this general scheme (a few more values to indicate match length or distances). PKZIP won at the time because it was faster, and PK had the name recognition from his PKARC, which was a superfast implementation of SEA ARC (the dominant archiver on PCs at the time).


There’s another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive! This sounds simple, but back then everyone adopted the tar-style “file header then data” appendable style, which meant that just listing the files inside a 300KB zip file (almost a whole floppy disk!) meant reading that entire floppy (30 seconds at least). PKZIP could do it in about 2 seconds.

Comments RSS · Twitter

Leave a Comment