Monday, October 20, 2025

AWS Outage

Amazon (Reddit, Hacker News, 2, 3):

We are investigating increased error rates and latencies for multiple AWS services in the US-EAST-1 Region.

I like how, unlike Apple’s status page, you can see a history of outages and updates.

Jess Weatherbed:

A major Amazon Web Services (AWS) outage took down multiple online services for several hours this morning, including Amazon, Alexa, Snapchat, Fortnite, ChatGPT, Epic Games Store, Epic Online Services, and more. Some of the impacted platforms, including Fortnite, Epic Games Store, and Perplexity had announced that they are fully recovered and back online earlier this morning, while others are still having issues.

The AWS dashboard first reported issues affecting the US-EAST-1 Region at 3:11AM ET, and eventually said that “The underlying DNS issue has been fully mitigated.”

I noticed this through problems with Amazon SES, which seemed to continue long after Amazon reported it as fixed. Also, the status page said the outage was confined to Northern Virginia, but I saw reports that other zones were affected, too.

caymanjim:

This is the real problem. Even if you don’t run anything in AWS directly, something you integrate with will. And when us-east-1 is down, it doesn’t matter if those services are in other availability zones. AWS’s own internal services rely heavily on us-east-1, and most third-party services live in us-east-1.

It really is a single point of failure for the majority of the Internet.

Normally, my site and store will failover to using Mailgun, but this ran into two problems:

See also: Dave Mark, Brain Webster, John Gruber, Ryan Jones, Christina Warren.

Previously:

Update (2025-10-21): The cause of my Mailgun problem was, apparently, that they disable your account if you haven’t logged in in a while. After logging into the Web interface, SMTP support was automatically reactivated.

Corey Quinn (via Hacker News):

And so, a quiet suspicion starts to circulate: where have the senior AWS engineers who’ve been to this dance before gone? And the answer increasingly is that they’ve left the building — taking decades of hard-won institutional knowledge about how AWS’s systems work at scale right along with them.

[…]

Once you reach a certain point of scale, there are no simple problems left. What’s more concerning to me is the way it seems AWS has been flailing all day trying to run this one to ground. Suddenly, I’m reminded of something I had tried very hard to forget.

[…]

You can hire a bunch of very smart people who will explain how DNS works at a deep technical level (or you can hire me, who will incorrect you by explaining that it’s a database), but the one thing you can’t hire for is the person who remembers that when DNS starts getting wonky, check that seemingly unrelated system in the corner, because it has historically played a contributing role to some outages of yesteryear.

Axel Le Pennec:

Should we have a fallback to plain StoreKit in case RevenueCat, Superwall or Adapty are down? 🤔

I guess apps that are only using StoreKit weren’t affected by the AWS outage.

Calum Patterson:

A major Amazon Web Services (AWS) outage on October 20 had the unexpected side effect of causing chaos in bedrooms across the US, as owners of Eight Sleep’s $2,000+ ‘Pod’ mattress covers found their smart beds had no offline mode and were stuck at high temperatures and odd positions in the night.

Dave Polaschek:

The outage today reminded me of July 28, 1995, when almost all of Minnesota fell off the Internet.

Update (2025-10-22): See also: Ashley Belanger, Ben Thompson, Matt Stoller.

Update (2025-10-23): Gergely Orosz:

Today, we look into what caused this outage.

Update (2025-10-28): Thomas Claburn:

Signal president Meredith Whittaker called attention to this massive dependency in a thread on the Mastodon social network, explaining how the concentration of power among cloud hyperscalers limits the options of services like Signal in terms of resiliency and network control.

Whittaker said that the concentration of power among cloud hyperscalers (AWS, Google, and Microsoft) is less widely understood than she expected, which bodes poorly for efforts to craft realistic strategies to change this dynamic.

She explained, “The question isn’t ‘why does Signal use AWS?’ It’s to look at the infrastructural requirements of any global, real-time, mass comms platform and ask how it is that we got to a place where there’s no realistic alternative to AWS and the other hyperscalers.”

7 Comments RSS · Twitter · Mastodon


> It really is a single point of failure for the majority of the Internet.

It was a single point of failure big time. I had a small Amazon order scheduled for delivery yesterday. I happened to be awake at 3am EDT and suddenly realized they missed the "6-8 pm" promised delivery. No email, nothing. After making sure it wasn't delivered, I thought I'd cancel the order. I was logged in and one page said delivery would be today (10/20) but the status page for the order sad 10/21. So I stupidly logged out, figuring I could log back in. Keep in mind, this is Amazon....

At first I couldn't get past the User ID page. I checked DownDetector and saw the spike. At least it was entertaining reading the comments. It was about an hour later that I could see things start coming back - at least I got through the password page - but couldn't get through the captcha, because after the first test it would reset no matter what. I finally did around 4:30. When I called Amazon to cancel the order (maybe an hour later) I was on hold for about 10 minutes. After the cancellation I wished the person on the other end a good day, explaining my experience, probably giving him a heads-up, as he said I was his first call....

Here's the last sentence from ABC News:

> Shares of Amazon ticked up 1.3% in midday trading, despite the outage.


Seems that EU governments have taken notice. Not like devs located in the EU didn't know or joke about the spof/dependency for ages.

Still a relatively minor incident when one considers what could happen had the region suffered a complete outage.


Beatrix Willius

Even my Fastspring shop was down. Not amused.


Yup, Amazon SES were definitely held up. Fortunately not critical and the retries were done automatically from my local MTA, but I only use it at all to get around "Policy" blocking. I shouldn't have needed to, and after I get a proxy up and running in Linode for all my SMTP traffic, I won't need Amazon anymore. But it just goes to show how much we depend on this shit, even when we're going out of our way to avoid it!


Hardik Panjwani

If a company can break a significant portion of the internet, then its too big to be allowed to fail.

Either regulate it as a utility or break it up.


Hardik Panjwani

Hardik Panjwani you’re right. Same with banks. Too big to fail means too big to exist. It’s a National security issue just like the justification for the government to buy part of Intel


"If a company can break a significant portion of the internet"

The Internet wasn't broken at all. Just some services running on the Internet that made terrible decisions.

Leave a Comment