PyPy’s Hash Table Implementation
Maciej Fijalkowski (via David Smith):
One surprising part is that the new design, besides being more memory efficient, is ordered by design: it preserves the insertion order.
[…]
Here,
compact_array
stores all the items in order of insertion, whilesparse_array
is a 1/2 to 2/3 full array of integers. The integers themselves are of the smallest size necessary for indexing thecompact_array
. So ifcompact_array
has less than 256 items, thensparse_array
will be made of bytes; if less than 2^16, it’ll be two-byte integers; and so on.This design saves quite a bit of memory.
[…]
The obvious benefit of having more compact dictionaries is an increased cache friendliness. In modern CPUs cache misses are much more costly than doing additional simple work, like having an additional level of (in-cache) indirection.
[…]
To preserve order, when we delete an entry, we mark the entry as removed but don’t otherwise shuffle the remaining entries. If we repeat this operation often enough, there will be a lot of removed entries in the (originally compact) array. At this point, we need to do a “packing” operation, which moves all live entries to the start of the array (and then reindexes the sparse array, as the positions changed).
Previously: Accidentally Quadratic Rust Hash Tables, Exploring Swift Dictionary’s Implementation, Exposing NSDictionary.