Tuesday, May 2, 2023

Google Cloud Services Outages

Thomas Claburn (Hacker News):

Google Cloud stopped operating in Paris early on Wednesday morning local time due to “water intrusion,” said the off-prem biz, which a day earlier reported profitability for the first time.

[…]

“Water intrusion in europe-west9-a led to an emergency shutdown of some hardware in that zone,” the company’s status page explains. “There is no current ETA for recovery of operations in europe-west9-a, but it is expected to be an extended outage. Customers are advised to fail over to other zones in europe-west9 if they are impacted.”

A short while later, the incident description changed to “a multi-cluster failure and has led to an emergency shutdown of multiple zones.”

[…]

Though more brief, the load balancing problems were far broader, affecting not just the europe-west9 zone but multiple zones in Asia, Australia, Europe, North America, and South America.

Gergely Orosz (via Drew Thaler):

I have questions. How does water intrusion into one data center take a whole zone (which should be multiple, physically separate and redundant DCs) offline?

The point of availability zones is to avoid issues in one DC taking down the whole zone.

Oh, I just see: an issue in one DC took down a whole region! So all AZs within that region are down.

Wow, this is very bad: the point of AZs is exactly for this to not happen.

Joshua Burgin:

Both Google and Microsoft don’t guarantee that all zones are physically separate buildings or separated by at least <x> km/miles. Many of their “zones” in smaller regions are just separate buildings by the same DC facility

Dylan Tack:

“[AWS] AZs are physically separated by a meaningful distance, many kilometers, from any other AZ, although all are within 100 km (60 miles) of each other.”

Update (2023-09-04): Ry Crozier (via Hacker News):

Microsoft had “insufficient” staff levels at its data centre campus last week when a power sag knocked its chiller plant for two data halls offline, cooking portions of its storage hardware.

[…]

“We have temporarily increased the team size from three to seven, until the underlying issues are better understood and appropriate mitigations can be put in place.”

Comments RSS · Twitter · Mastodon

Leave a Comment