AWS promises "larger CPU and memory servers" after US-EAST-1 outage triggers stack rethink

All servers exceeded "the maximum number of threads allowed by an OS configuration"

AWS promises "larger CPU and memory servers" after US-EAST-1 outage triggers stack rethink

The post-mortem of a major outage at AWS's US-EAST-1 region has revealed a number of unexpected vulnerabilities in the hyperscale cloud provider's architecture in the region -- with structural fixes now underway.

The November 25 outage began after after AWS added new capacity to the front-end of data streaming service AWS Kinesis.

This led "all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration," AWS admitted in an unusually detailed and prompt post-mortem of the incident.

A domino effect led to numerous major associated AWS services facing sustained issues -- including Cognito, which uses Kinesis to collect and analyse API access patterns, and CloudWatch, a widely used AWS application and infrastructure monitoring service.)

AWS was also left "unable to update the [customer] Service Health Dashboard because the tool we use to post these updates itself uses Cognito, which was impacted by this event." (Yes, this perennial bugbear of online service providers is still a thing in 2020.)

The company has now promised numerous changes to its architecture and stronger safeguards to prevent recurrence, as well as the decoupling of related services that failed as a result.

Larger CPU servers promised...

"In the very short term, we will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet" AWS said.

The outage in the AWS data centre in northern Virginia knocked cloud-connected Ring doorbells and Roomba robot vacuum cleaners out of service, along with other more substantial workloads for a range of enterprise customers. (Kinesis is an AWS service that ingests, then analyses real-time streaming data -- including for customers running a wide range of Internet of Things devices, like the doorbells.)

AWS Kinesis outage: lessons learned?

The post-mortem also reveals that major AWS services like CloudWatch -- an application and infrastructure monitoring service -- were not running on a separate, partitioned front-end fleet of servers, but entangled with Kinesis's workloads.

AWS is now moving CloudWatch to a separate fleet of servers. It is also moving the Kinesis front-end server cache to a dedicated fleet, the company said, after what appears to have been an unpleasant episode for AWS' engineers, as errors spawned ("the diagnosis work was slowed by the variety of errors observed" AWS notes.)

The full post-mortem is here.