AWS’s S3 data replication falters in US-EAST-1 as hyperscaler tackles "backlog"
" our engineers have been focused on mitigating the impact of the delayed replication through changes made to replication subsystems, resource adjustments and other modifications"
Updated 14:50 October 20. AWS says the issue is fully resolved and confirmed that delays had extended from October 17-20. It fully processed the backlog of delayed replication by October 20 3:45 AM PDT.
AWS says it is processing a “backlog” of S3 objects awaiting replication after an incident at its US-EAST-1 region triggered data copying issues – customers publicly saying they were affected include cloud security company Wiz whose customers saw delays to security scans.
The incident is likely to put it in breach of S3 RTC SLAs for customers. (Replication Time Control is a service introduced by AWS in 2019 that is “designed to replicate 99.99% of objects within 15 minutes after upload, with the majority of those new objects replicated in seconds”)
AWS said it had deployed “a code change to a subsystem that aggregates replication operations and then we will scale up the traffic settings” after the incident, which began early October 18, Pacific Time.
AWS S3 replication issues resolving
“[Since] our alarms fired upon detection of replication delays our engineers have been focused on mitigating the impact of the delayed replication through changes made to replication subsystems, resource adjustments and other modifications to the replication environment.”
That’s according to an AWS service status that also shows approximately 11 hours after the first alerts it was “making progress on restarting a replication subsystem that distributes work via queues.”
“We've gradually slowed the system and are scaling it back up again so that we can more closely observe and mitigate how the software processes the increased load” it said at 11:35 PM, October 18.
Which service wobbled?
Amazon S3 Replication Time Control (RTC) in US-EAST-1. This lets customers copy critical data within or between AWS regions in order to meet regulatory requirements for geographic redundancy as part of a disaster recover plan, or for other operational reasons. Customers can copy within a region to aggregate logs, set up test and development environments, and to address compliance requirements.
Early Thursday 19 it told customers that “while we are restoring our replication system [you] can use the S3 COPY API directly or through S3 Batch Operations, or use AWS Backup. For customers who have a lifecycle policy on replicated data, we can confirm that lifecycle policies will not take action on replicated data until all the replicated data in the backlog has been delivered to destination buckets” it added.
As The Stack published, AWS was seeing “steady progress in recovery of RTC replication traffic for storage using the US-EAST-1 Region, sustaining 80% throughput for First In First Out (FIFO) RTC replication.”
AWS has periodically faced capacity and resilience issues at its aging US-EAST-1 data centre. The post-mortem of a 2020 outage there revealed a number of unexpected vulnerabilities in its architecture there; when it added new capacity to the front-end of data streaming service AWS Kinesis at the time this led "all of the servers in the fleet to exceed the maximum number of threads allowed by an operating system configuration," AWS admitted in a detailed post-mortem.
“We will be moving to larger CPU and memory servers, reducing the total number of servers and, hence, threads required by each server to communicate across the fleet” it added at the time, as a near-term fix. Another 2021 incident there also pointed to capacity challenges including "increased resource contention within the EBS subsystem responsible for coordinating EBS storage hosts".