Learning Lessons The Hard Way
Around this time we began a process of re-running old analyses that had failed, and were able to reproduce the issue. This was a critical learning, because it refuted the theory that the issue was ephemeral. With this information, we took a closer look at the objects in the analysis-level cache. We discovered that these marshaled Ruby objects did not in fact hold a reference to the contents of files as we originally believed. Problematically, the object held a reference to the Git service URL to use for remote procedure calls.
When a repository was migrated, this cache key was untouched. This outdated reference led to cat-file calls being issued to the old server instead of the new server. Ironically, we’ve now determined that since the cached Ruby objects did not include the Git blob data, it never generated a performance benefit of any kind. Significantly contributing to the difficulty of debugging this issue, the library we use to read Git repositories in our git service returns empty strings if the repository directory does not exist on disk, rather than an exception. Armed with a root cause -- caches containing invalid service URLs -- we discovered one other call site that could exhibit similar problematic behavior.
Via Soroush Khanlou:
The everpresent joke in computer science says there’s two hard problems: naming and cache invalidation. When I was starting out, I remember thinking “ha ha, there’s no way cache invalidation could be that hard”. Fast forward a few years, and while I’ve learned a few lessons of my own about bad caches, reading Code Climate’s incident report for a major issue made me understand on a far deeper level what it means for cache invalidation to be hard. Lesson learned, bon mot deployed.