AWS US-EAST-1 suffers a critical wobble, as services fall over
What was that about Azure's "spotty operational performance"?
Information on the December 7, 2021 AWS outage is here. The below article was published in September...
Take a cursory look at the AWS status dashboard and you'd be forgiven for thinking that nothing had happened.
"No recent events" it proclaims proudly right at the top. You'd have to scroll down through over 680 rows (yes, we counted) of AWS services to spot the small yellow triangle over EC2 (N. Virginia) and find the hyperscaler confirming that yes, there had in fact been a sustained problem rather recently, sorry about that.
The US-EAST-1 outage (AWS's largest and oldest region) yesterday lasted for up to eight hours, taking down encrypted messenger Signal, Content Management Systems (CMS) providers, smart homes, and more.
It appears to have been the second major issue with the region this month, after users also reported issues on September 15 (although there is no post-mortem for the latter showing on AWS's incident report page.)
"The issue was caused by increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts" AWS said, somewhat opaquely; first acknowledging the issue at 08:11 PDT on September 26 and admitting five hours later that it had also affected Redshift, OpenSearch, and Elasticache. (We speculate, but similar language from Azure in 2019 turned out to be pointing to an overloaded Redis cache...)
See also: Companies could save billions by ditching ‘Hotel California’ cloud for own infrastructure: VC fund
The AWS US-East-1 outage was fixed 03:45am, September 27, as users lamented the response time.
It's the second public issue AWS has faced this month, following a service disruption to the AWS Direct Connect service in the Tokyo (AP-NORTHEAST-1) region on September 2, 2021. (This provides private connectivity between a customer’s data center and their AWS VPCs.)
US-EAST-1 has grown something of a reputation for being troublesome.
It also suffered a sustained outage in November 2020 affecting multiple cloud services that left AWS promising to "increase [...] thread count limits in our operating system configuration, which we believe will give us significantly more threads per server and give us significant additional safety margin there as well [and make] a number of changes to radically improve the cold-start time for the front-end fleet."
"AWS ec2/ebs are having issuea in us-east-1 and my entire smart home is down; Alexa, smarthings, just dead. It's amazing how we became so dependant on the cloud. Back to using physical switches like a peasant again" said one user, security company Cyberark's Ran Isenberg on Twitter, while Bitcoin developer and co-founder of Bloq Inc Jeff Garzik added: "One of the first #DevOps rules I instituted at @bloqinc was: Avoid AWS's us-east-1.
"It is a surprisingly difficult rule to follow", he added. "We are still not 100%."
Amazon has in the past hit out at Microsoft Azure for its "spotty operational performance”.
Perhaps next time it will consider a friendly "hugops" instead.