{"id":17320,"date":"2017-03-06T16:49:38","date_gmt":"2017-03-06T21:49:38","guid":{"rendered":"http:\/\/mjtsai.com\/blog\/?p=17320"},"modified":"2022-04-14T14:43:38","modified_gmt":"2022-04-14T18:43:38","slug":"amazon-s3-outage","status":"publish","type":"post","link":"https:\/\/mjtsai.com\/blog\/2017\/03\/06\/amazon-s3-outage\/","title":{"rendered":"Amazon S3 Outage"},"content":{"rendered":"<p><a href=\"https:\/\/aws.amazon.com\/message\/41926\/\">Amazon<\/a> (via <a href=\"https:\/\/twitter.com\/skimbrel\/status\/837358039032201216\">Sam &#x5317;&#x5CF6;-Kimbrel<\/a>, <a href=\"https:\/\/news.ycombinator.com\/item?id=13775667\">Hacker<\/a> <a href=\"https:\/\/news.ycombinator.com\/item?id=13755673\">News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/aws.amazon.com\/message\/41926\/\"><p>The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.  One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests.<\/p>\n<p>[&#8230;]<\/p>\n<p>While this is an operation that we have relied on to maintain our systems since the launch of S3, we have not completely restarted the index subsystem or the placement subsystem in our larger regions for many years. S3 has experienced massive growth over the last several years and the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected.<\/p>\n<p>[&#8230;]<\/p>\n<p>From the beginning of this event until 11:37AM PST, we were unable to update the individual services&rsquo; status on the AWS Service Health Dashboard (SHD) because of a dependency the SHD administration console has on Amazon S3.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/twitter.com\/awscloud\/status\/836656664635846656\">Amazon Web Services<\/a> (<a href=\"https:\/\/news.ycombinator.com\/item?id=13756922\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/twitter.com\/awscloud\/status\/836656664635846656\"><p>The dashboard not changing color is related to S3 issue.  See the banner at the top of the dashboard for updates.<\/p><\/blockquote>\n\n<p><a href=\"https:\/\/medium.com\/@jim_dowling\/reflections-on-s3s-architectural-flaws-71f14c05a5fa\">Jim Dowling<\/a> (via <a href=\"https:\/\/news.ycombinator.com\/item?id=13762102\">Hacker News<\/a>):<\/p>\n<blockquote cite=\"https:\/\/medium.com\/@jim_dowling\/reflections-on-s3s-architectural-flaws-71f14c05a5fa\"><p>Aside from the outage, there are many limitations of working with S3 that make it a less than ideal long term storage technology, and most of its problems relate to S3 object replication and metadata. S3 is a an eventually consistent key-value store for objects. However, eventual consistency tells us nothing about what guarantees S3 provides.<\/p>\n<p>[&#8230;]<\/p>\n<p>Netflix does not trust the metadata provided by S3. They have replaced it with their own metadata service, <a href=\"http:\/\/techblog.netflix.com\/2014\/01\/s3mper-consistency-in-cloud.html\">s3mper<\/a>, which is essentially an eventually consistent key-value store that stores a copy of the metadata in S3 [s3mper]. Netflix rewrote their applications to account for s3mper. In the diagram below, you can see that application programming becomes more complex. Creating an object in S3 becomes a write to DynamoDB and a create operation in S3. This is not done transactionally. All S3 read\/list operations need to be re-written to query DynamoDB and S3 and compare the results.<\/p><\/blockquote>\n\n\n<p>For me, the S3 outage brought down part of my <a href=\"https:\/\/fastspring.com\">FastSpring<\/a> <a href=\"https:\/\/c-command.com\/store\/\">store<\/a>, and a bunch of <a href=\"https:\/\/c-command.com\/sn\">serial number reminder<\/a> e-mails and crash reports didn&rsquo;t go out because Amazon SES kept failing. My server code had assumed that sending e-mails would always succeed. In fact, it relied on sending e-mails to myself in order to report errors with the site and store. I&rsquo;ve since added <a href=\"https:\/\/www.sparkpost.com\">SparkPost<\/a> and <a href=\"http:\/\/www.fastmail.fm\/?STKI=10293121\">FastMail<\/a> as backup SMTP providers.<\/p>\n<p>I also plan to store e-mails in a database until they&rsquo;ve been successfully sent. This seemed like it would be really easy, but I ran into a weird issue with my <a href=\"http:\/\/www.sqlobject.org\">database layer<\/a> not saving, and I haven&rsquo;t had time yet to track that down.<\/p>","protected":false},"excerpt":{"rendered":"<p>Amazon (via Sam &#x5317;&#x5CF6;-Kimbrel, Hacker News): The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"apple_news_api_created_at":"2022-04-14T18:43:40Z","apple_news_api_id":"8819b02a-04ec-4a04-9d79-1e63ff745620","apple_news_api_modified_at":"2022-04-14T18:43:41Z","apple_news_api_revision":"AAAAAAAAAAD\/\/\/\/\/\/\/\/\/\/w==","apple_news_api_share_url":"https:\/\/apple.news\/AiBmwKgTsSgSdeR5j_3RWIA","apple_news_coverimage":0,"apple_news_coverimage_caption":"","apple_news_is_hidden":false,"apple_news_is_paid":false,"apple_news_is_preview":false,"apple_news_is_sponsored":false,"apple_news_maturity_rating":"","apple_news_metadata":"\"\"","apple_news_pullquote":"","apple_news_pullquote_position":"","apple_news_slug":"","apple_news_sections":"\"\"","apple_news_suppress_video_url":false,"apple_news_use_image_component":false,"footnotes":""},"categories":[2],"tags":[19,21,602,672,596,1290,2190,481,1487,50],"class_list":["post-17320","post","type-post","status-publish","format-standard","hentry","category-technology","tag-amazon","tag-s3","tag-amazon-ses","tag-amazon-web-services","tag-fastmail","tag-fastspring","tag-outage","tag-smtp","tag-sparkpost","tag-webapi"],"apple_news_notices":[],"_links":{"self":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/17320","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/comments?post=17320"}],"version-history":[{"count":1,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/17320\/revisions"}],"predecessor-version":[{"id":17321,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/posts\/17320\/revisions\/17321"}],"wp:attachment":[{"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/media?parent=17320"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/categories?post=17320"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/mjtsai.com\/blog\/wp-json\/wp\/v2\/tags?post=17320"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}