Tuesday, May 2, 2023

Google Cloud Services Outages

Thomas Claburn (Hacker News):

Google Cloud stopped operating in Paris early on Wednesday morning local time due to “water intrusion,” said the off-prem biz, which a day earlier reported profitability for the first time.


“Water intrusion in europe-west9-a led to an emergency shutdown of some hardware in that zone,” the company’s status page explains. “There is no current ETA for recovery of operations in europe-west9-a, but it is expected to be an extended outage. Customers are advised to fail over to other zones in europe-west9 if they are impacted.”

A short while later, the incident description changed to “a multi-cluster failure and has led to an emergency shutdown of multiple zones.”


Though more brief, the load balancing problems were far broader, affecting not just the europe-west9 zone but multiple zones in Asia, Australia, Europe, North America, and South America.

Gergely Orosz (via Drew Thaler):

I have questions. How does water intrusion into one data center take a whole zone (which should be multiple, physically separate and redundant DCs) offline?

The point of availability zones is to avoid issues in one DC taking down the whole zone.

Oh, I just see: an issue in one DC took down a whole region! So all AZs within that region are down.

Wow, this is very bad: the point of AZs is exactly for this to not happen.

Joshua Burgin:

Both Google and Microsoft don’t guarantee that all zones are physically separate buildings or separated by at least <x> km/miles. Many of their “zones” in smaller regions are just separate buildings by the same DC facility

Dylan Tack:

“[AWS] AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.”

Update (2023-09-04): Ry Crozier (via Hacker News):

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.


“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

Comments RSS · Twitter · Mastodon

Leave a Comment