Tuesday, October 5, 2021 [Tweets] [Favorites]

Facebook BGP Outage

Celso Martinho and Tom Strickx (Hacker News):

Social media quickly burst into flames, reporting what our engineers rapidly confirmed too. Facebook and its affiliated services WhatsApp and Instagram were, in fact, all down. Their DNS names stopped resolving, and their infrastructure IPs were unreachable. It was as if someone had “pulled the cables” from their data centers all at once and disconnected them from the Internet.

This wasn’t a DNS issue itself, but failing DNS was the first symptom we’d seen of a larger Facebook outage.

[…]

BGP stands for Border Gateway Protocol. It’s a mechanism to exchange routing information between autonomous systems (AS) on the Internet. The big routers that make the Internet work have huge, constantly updated lists of the possible routes that can be used to deliver every network packet to their final destinations. Without BGP, the Internet routers wouldn’t know what to do, and the Internet wouldn’t work.

The Internet is literally a network of networks, and it’s bound together by BGP. BGP allows one network (say Facebook) to advertise its presence to other networks that form the Internet. As we write Facebook is not advertising its presence, ISPs and other networks can’t find Facebook’s network and so it is unavailable.

Santosh Janardhan:

To all the people and businesses around the world who depend on us, we are sorry for the inconvenience caused by today’s outage across our platforms. We’ve been working as hard as we can to restore access, and our systems are now back up and running. The underlying cause of this outage also impacted many of the internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem.

See also: Brian Krebs (Hacker News), Bruce Schneier, Hacker News.

Update (2021-10-20): Santosh Janardhan:

Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.

Via Nick Heer:

For comparison, as I write this, Apple’s System Status page shows a resolved outage in Apple Pay and Wallet. For over seven hours yesterday, “users were not able to add, suspend, or remove existing cards to Apple Pay”, and this issue has simply been marked as “Resolved” but there are no more details. This explanation-free status update has been the standard for every iCloud-related outage, including serious incidents. It does not build confidence.

Reuters (via Hacker News):

Messaging app Telegram gained over 70 million new users during Monday’s Facebook outage, its founder Pavel Durov said on Tuesday, as people worldwide were left without key messaging services for nearly six hours.

Mark Zuckerberg (via Hacker News):

First, the SEV that took down all our services yesterday was the worst outage we’ve had in years. We’ve spent the past 24 hours debriefing how we can strengthen our systems against this kind of failure. This was also a reminder of how much our work matters to people. The deeper concern with an outage like this isn’t how many people switch to competitive services or how much money we lose, but what it means for the people who rely on our services to communicate with loved ones, run their businesses, or support their communities.

4 Comments

A wake-up call to everyone who has foolishly made their businesses dependent on Facebook in the first place.

Kevin Schumacher

@Brad I am the absolute last person to defend Facebook, but feel free to list a single company who has never had unexpected downtime. You can't, because it's not possible. Stuff happens.

Beatrix Willius

Companies have outages now and then. Companies are still able to access their buildings and are able to communicate if they have an outage (if the information is true). Other companies still allow their users to log in if the first company has an outage.

That's what makes the Facebook outage so funny. And yes, relying on "log in with Google" or "log in with Facebook" is stupid in the extreme.

There are literally billions of people outside US using Facebook to conduct business activity without ever touching politics.

Stay up-to-date by subscribing to the Comments RSS Feed for this post.

Leave a Comment