Entropy of Big Distributed Systems
Scoop from within Twitter: small things are breaking, not enough engineers to fix them.
[…]
This is exactly what you’d expect when a large part of the workforce is laid off, another chunk quits, and those left are expected to ship new features as the #1 priority.
All large services and platforms are “built to be resilient”. But they are also extremely complicated, with countless internal interactions between microservices, configuration systems, load balancing and directing subsystems, networking fabrics, and more.
These systems are built to be reliable in the face of things like machine failures, or entire optional microservices going down. That’s not what will take Twitter down. Twitter will crash and burn when a complex interaction between systems goes wrong and causes a cascade failure.
[…]
People think of servers as things you can just reboot and be fine. That’s not how this works. If you rebooted every single $FAANG server simultaneously right now, all of $FAANG would be down for probably months. Or worse. And that’s with functional teams. This stuff is hard.
One thing that’s been interesting about recent events is seeing how people imagine big companies operate, e.g., people saying that Twitter is uniquely bad for not having a good cold boot procedure.
Multiple $1T companies didn’t or don’t have a real cold boot procedure.
In a complex system like Twitter or AWS, there is always a trade-off between doing failure automation work up front and incurring operational burden later on. It’s a decreasing ROI, and trying to automatically handle every possible failure case just isn’t worth it.
[…]
Yes, of course you try to threat model all possible failure modes. But then you only handle the 95% or so known/expected cases and don’t bother with the 5% unknown/rare cases. For those, you just throw smart humans at the problem once it arises.
Failures that seem only theoretical in a smaller system, like bit flips from cosmic rays, suddenly become very very real once you’re dealing with millions of servers and millions of rps. At that scale, you have to assume these things will happen.
[…]
The culture at AWS, for example, was hyper-aware from the beginning of circular dependencies and the need to cold boot, and it was always a big topic in any Principal-level design or operational readiness review.
Many executives fail to understand why tech companies are bloated. They are bloated because everything is held together with duck tape and “task force” teams. And it’s due to gross lack of funding when it comes to removing complexity and technical debt.
Frankly we should probably prioritize some big rewrites to combat 10+ years of tech debt and make a call on deleting features aggressively.
Leave something poorly architected, and it can give you a hundred easy-to-fix issues a month. You fix those, you have great velocity, everyone celebrates the 10x engineer. Fix the fundamental problem, you get 1 ticket closed, they fire the low-velocity engineer.
Strongly recommend going to see the list of apps you rely on Twitter for single sign-on.
If Twitter burns to the ground, which looks increasingly likely, these are the apps that you used Twitter to log in to. Set up email backups on those accounts ASAP.
Previously:
Update (2022-12-02): Cindy Sridharan:
Tech Debt is one of those things that make sense to engineering, but to leadership it sounds like “we’ve created a mess over the years that slowed the product, we did nothing to fix it, and now we need to spend even more time and people on fixing it”.
This is a consequence of tech industry consolidation. Sure, users might grumble about glitches, inconsistent UIs and poor performance, but there are no competitors they can switch to. So there’s no incentive for leadership to take a more holistic approach to software development.
Engineers don’t know how to communicate these things. To execs, valid issues in this category sound indistinguishable from desire for generic yak shaving, things to improve their own QOL, or migrating things between tech for no reason. Quick thread on how to fix this!