Cloudflare Outage Cause by Database Permissions
The issue was not caused, directly or indirectly, by a cyber attack or malicious activity of any kind. Instead, it was triggered by a change to one of our database systems' permissions which caused the database to output multiple entries into a “feature file” used by our Bot Management system. That feature file, in turn, doubled in size. The larger-than-expected feature file was then propagated to all the machines that make up our network.
The software running on these machines to route traffic across our network reads this feature file to keep our Bot Management system up to date with ever changing threats. The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail.
[…]
This post is an in-depth recount of exactly what happened and what systems and processes failed. It is also the beginning, though not the end, of what we plan to do in order to make sure an outage like this will not happen again.
He also has a HN comment about the writing of the postmortem.
This is how it’s done.
See also: How Complex Systems Fail (via Thomas Ptacek).
Some affected Cloudflare customers were able to pivot away from the platform temporarily so that visitors could still access their websites. But security experts say doing so may have also triggered an impromptu network penetration test for organizations that have come to rely on Cloudflare to block many types of abusive and malicious traffic.
Cloudflare’s outage yesterday shows the mind-boggling scale of their network. The graph has 25 million HTTP 500 errors per second.
Unpopular opinion, apparently: companies like cloudflare and Amazon provide very high quality services people and enterprises actually need, with a level of uptime and security vastly superior to what most of their customers would achieve on their own or using traditional providers. Their downtimes being so visible is a consequence of their success.
2025 is the year we learn that a tiny number of large companies have become single-point failures for the Internet in the US.
I feel like this kind of consolidation is undesirable for Internet resiliency, but also inevitable as the cost of implementing “undifferentiated” (in AWS’ parlance) infrastructure is not profitable to web service owners.
Previously:
Update (2025-12-10): See also: Accidental Tech Podcast.
On December 5, 2025, at 08:47 UTC (all times in this blog are UTC), a portion of Cloudflare’s network began experiencing significant failures. The incident was resolved at 09:12 (~25 minutes total impact), when all services were fully restored.
A subset of customers were impacted, accounting for approximately 28% of all HTTP traffic served by Cloudflare. Several factors needed to combine for an individual customer to be affected as described below.
The issue was not caused, directly or indirectly, by a cyber attack on Cloudflare’s systems or malicious activity of any kind. Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
2 Comments RSS · Twitter · Mastodon
One good thing about the outage is I didn't have to see Gruber post nonsense about Trump!
I'm so sorry to hear that someone is forcing you to look at Gruber's blog Clockwork Orange style, Ben. Please let us know what we can do to rescue you!