Thursday, April 14, 2022

The Longest Atlassian Outage of All Time

Gergely Orosz (via Hacker News):

We are in the middle of the longest outage Atlassian has ever had. Close to 400 companies and anywhere from 50,000 to 400,000 users had no access to JIRA, Confluence, OpsGenie, JIRA Status page, and other Atlassian Cloud services. The outage is on its 9th day, having started on Monday, 4th of April. Atlassian estimates many impacted customers will be unable to access their services for another two weeks. At the time of writing, 45% of companies have seen their access restored.

For most of this outage, Atlassian has gone silent in communications across their main channels such as Twitter or the community forums. It took until Day 9 for executives at the company to acknowledge the outage.

[…]

For the past week, everyone has been guessing about the cause of the outage. The most common suspicion coming from several sources like The Stack was how the legacy Insight Plug-In plugin was being retired. A script was supposed to delete all customer data from this plugin but accidentally deleted all customer data for anyone using this plugin. Up Day 9, Atlassian would neither confirm, nor deny these speculations.

[…]

  • Atlassian can, indeed, restore all data to a checkpoint in a matter of hours.

  • However, if they did this, while the impacted ~400 companies would get back all their data, everyone else would lose all data committed since that point

  • So now each customer’s data needs to be selectively restored. Atlassian has no tools to do this in bulk.

Previously:

Update (2022-05-09): Atlassian:

There was a communication gap between the team that requested the deletion and the team that ran the deletion. Instead of providing the IDs of the intended app being marked for deletion, the team provided the IDs of the entire cloud site where the apps were to be deleted.

[…]

The API used to perform the deletion accepted both site and app identifiers and assumed the input was correct – this meant that if a site ID is passed, a site would be deleted; if an app ID was passed, an app would be deleted. There was no warning signal to confirm the type of deletion (site or app) being requested.

[…]

At the start of the incident, we knew exactly which sites were affected and our priority was to establish communication with the approved owner for each impacted site to inform them of the outage.

However, some customer contact information was deleted. This meant that customers could not file support tickets as they normally would. This also meant we did not have immediate access to key customer contacts.

[…]

Our full list of action items is detailed in the full post-incident review below.

1 Comment RSS · Twitter

I'm idly wondering if this would still have happened if the teams with the "communication gap" had been sitting at adjacent banks of desks in a real office...

Leave a Comment