Sweeping AWS outage blamed on "network device impairment"
"This issue is affecting the global console landing page, which is also hosted in US-EAST-1"
16:30 BST, December 7, 2021: AWS down with services failing for millions of users. Updated 16:52: Console issues linger; problems also confirmed with EC2, DynamoDB, Amazon Connect, AWS Support Center. Updated 21:10 BST, with details on root cause, further services affected.
A number of widely used AWS services failed late Tuesday, with the cloud giant’s user console inaccessible as The Stack published shortly after 16:30 on December 7, 2021. After a delay of well over an hour AWS updated its dashboard to reflect the issues, later blaming "impairment of several network devices" for the outage.
The issue could be pinpointed to its error-prone US-EAST-1 region but had sweeping impact, taking down call centers and other services, while also hampering access to the global console landing page -- the primary portal for many to their AWS services -- which is also hosted in US-EAST-1, it confirmed earlier this evening.
The incident is the second major AWS outage in the region in 16 weeks, after an outage that affected multiple services in September, including Redshift, OpenSearch, and Elasticache. This was blamed at the time on "increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts".
McDonalds, Netflix, Slack and Tinder were among the major companies affected today.
In a 20:26 BST update AWS said: "We are seeing impact to multiple AWS APIs in the US-EAST-1 Region.
"This issue is also affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates. Services impacted include: EC2, Connect, DynamoDB, Glue, Athena, Timestream, and Chime and other AWS Services in US-EAST-1. The root cause of this issue is an impairment of several network devices... We are pursuing multiple mitigation paths in parallel, and have seen some signs of recovery, but we do not have an ETA for full recovery at this time. Root logins for consoles in all AWS regions are affected by this issue, however customers can login to consoles other than US-EAST-1 by using an IAM role for authentication."
The incident raises fresh questions about the resilience of the critical cloud region. (Users still do not know what happened in September 2021 after a prolonged outage as there has not, to the best of our knowledge, been a post-incident write-up shared; certainly not one on its post-event summaries page...)
See also: AWS US-EAST-1 suffers a critical wobble, as services fall over
US-EAST-1 has grown something of a reputation for being troublesome.
AWS's US-EAST-1 also suffered a sustained outage in November 2020 affecting multiple cloud services that left AWS promising to “increase […] thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well [and make] a number of changes to radically improve the cold-start time for the front-end fleet.”
That incident was triggered after the addition of a "small addition of capacity" to its front-end fleet of Kinesis servers. (That's a service used by developers but also other AWS services to capture data and video streams.) This triggered a truly epic cascade of issues, detailed by AWS here for the curious.
See also: Former Veterans Affairs CIO Jim Gfrerer on stewarding taxpayer’s dollars in complex IT environments.
(A snapshot of that lengthy post-event summary: "We have a number of learnings that we will be implementing immediately. In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet... We are adding fine-grained alarming for thread consumption in the service. We will also finish testing an increase in thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well. In addition, we are making a number of changes to radically improve the cold-start time for the front-end fleet. We are moving the front-end server cache to a dedicated fleet. We will also move a few large AWS services, like CloudWatch, to a separate, partitioned front-end fleet. In the medium term, we will greatly accelerate the cellularization of the front-end fleet to match what we’ve done with the back-end. Cellularization is an approach we use to isolate the effects of failure within a service, and to keep the components of the service (in this case, the shard-map cache) operating within a previously tested and operated range. This had been under way for the front-end fleet in Kinesis, but unfortunately the work is significant and had not yet been completed" AWS said in 2020.
"In addition to allowing us to operate the front-end in a consistent and well-tested range of total threads consumed, cellularization will provide better protection against any future unknown scaling limit" it added.