Cloudflare Outage Cause by Database Permissions
The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
[…]
This post is an in-depth recount of exactly what happened and what systems and processes failed. It is also the beginning, though not the end, of what we plan to do in order to make sure an outage like this will not happen again.
He also has a HN comment about the writing of the postmortem.
This is how it’s done.
See also: How Complex Systems Fail (via Thomas Ptacek).
Some affected Cloudflare customers were able to pivot away from the platform temporarily so that visitors could still access their websites. But security experts say doing so may have also triggered an impromptu network penetration test for organizations that have come to rely on Cloudflare to block many types of abusive and malicious traffic.
Cloudflare’s outage yesterday shows the mind-boggling scale of their network. The graph has 25 million HTTP 500 errors per second.
Unpopular opinion, apparently: companies like cloudflare and Amazon provide very high quality services people and enterprises actually need, with a level of uptime and security vastly superior to what most of their customers would achieve on their own or using traditional providers. Their downtimes being so visible is a consequence of their success.
2025 is the year we learn that a tiny number of large companies have become single-point failures for the Internet in the US.
I feel like this kind of consolidation is undesirable for Internet resiliency, but also inevitable as the cost of implementing “undifferentiated” (in AWS’ parlance) infrastructure is not profitable to web service owners.
Previously:
1 Comment RSS · Twitter · Mastodon
One good thing about the outage is I didn't have to see Gruber post nonsense about Trump!