Wednesday, December 21, 2016

PyPy’s Hash Table Implementation

One surprising part is that the new design, besides being more memory efficient, is ordered by design: it preserves the insertion order.
[…]
Here, compact_array stores all the items in order of insertion, while sparse_array is a 1/2 to 2/3 full array of integers. The integers themselves are of the smallest size necessary for indexing the compact_array. So if compact_array has less than 256 items, then sparse_array will be made of bytes; if less than 2^16, it’ll be two-byte integers; and so on.
This design saves quite a bit of memory.
[…]
The obvious benefit of having more compact dictionaries is an increased cache friendliness. In modern CPUs cache misses are much more costly than doing additional simple work, like having an additional level of (in-cache) indirection.
[…]
To preserve order, when we delete an entry, we mark the entry as removed but don’t otherwise shuffle the remaining entries. If we repeat this operation often enough, there will be a lot of removed entries in the (originally compact) array. At this point, we need to do a “packing” operation, which moves all live entries to the start of the array (and then reindexes the sparse array, as the positions changed).

Previously: Accidentally Quadratic Rust Hash Tables, Exploring Swift Dictionary’s Implementation, Exposing NSDictionary.

Algorithm Optimization Programming Python