Amazon Explains S3 Internet Outage
Some of you may have noticed part of the Internet was not working earlier this week. This is because Amazon's Simple Storage Service (S3) for the US-EAST-1 region partially went down and recently the company explained what happened. An S3 team in Northern Virginia was debugging an issue involving the S3 billing system and while following an established playbook during this work, one of the team members accidentally input an incorrect command. You see, S3 is designed to endure a loss of capacity due to removal or failure, and for the work being done, the team was going to take a small number of servers offline. However, the wrong input was used for one of these commands, causing far more servers to shut down than intended.
As it turns out, these servers that were inadvertently taken offline supported two S3 subsystems, the index subsystems and the placement subsystem. The index subsystems is necessary for all GET, LIST, PUT, and DELETE requests, while the placement subsystem is needed for allocating storage during a PUT request. Because of how many servers were removed, both of these subsystems needed to be restarted, and during that time S3 could not take requests. The restarting process also took a long time because of all the safety checks that needed to be done to validate the metadata and how much S3 has grown in recent years. Once the index subsystem was up the GET, LIST, and DELETE APIs were functional and then the placement subsystem could start recovering. Once this was done, S3 was operating normally, but some services needed to catchup on a backlog of work from the disruption.
To prevent this from happening again, the tool that shut down the servers has been changed to remove capacity more slowly and has safeguards to prevent capacity from dropping below a minimal level. Other tools are also being audited to add similar checks. Finally, a plan to partition S3 subsystems into smaller cells later this year has been moved up to happen immediately, as smaller partitions will limit the radius of future failures and improve recovery.