Google Cloud services fail for the second time in 4 days
Load balancing configuration change blamed, as major sites return 404s
Google Cloud has suffered its second severe outage in less than a week, with service disruption experienced by thousands of major customers late Tuesday after what the hyperscaler described as a cloud networking issue that lasted for around two hours (18:10 BST – 20:08 BST) and left millions struggling to reach web pages.
The issue appears to have stemmed from a configuration change to Google Cloud's load balancing services, which supports "advanced traffic management capabilities" for customers. The failure briefly knocked numerous major sites like Discord, EA, Etsy, Snapchat and others offline.
The incident comes after a November 12 Google Cloud outage blamed on “an issue with… infrastructure components” that appeared to affect primarily European customers. That also lasted just over two hours.
“Customers impacted by the issue may have encountered 404 errors when accessing web pages served by the Google External Proxy Load Balancer between 09:35 and 10:10 US/Pacific” Google Cloud said on November 16.
“Customer impact from 10:10 to 11:28 US/Pacific was configuration changes to External Proxy Load Balancers not taking effect. As of 11:28 US/Pacific configuration pushes resumed. Google Cloud Run, Google App Engine, Google Cloud Functions, and Apigee were also impacted. We will publish an analysis of this incident, once we have completed our internal investigation”, it said on the November 16 Google Cloud outage.
Office Server GIFfrom Office GIFs
The incident comes on the same day the UK government published its response into a consultation around security and resilience risks related to Managed Service Providers (MSPs) and Cloud Service Providers (CSPs).
See also: Gov’t hints at tighter security requirements for MSPs
That report noted that “Responses to the Call for Views have highlighted a systemic dependence on a group of the most critical providers which carry a level of risk that needs to be managed proactively.”
GCP customers meanwhile will be closely checking their SLAs.
Software updates getting pushed to production are regularly to blame for this kind of issue.
Slack, AWS, Azure, Fastly and Facebook have all faced sweeping outages this year.
Fastly blamed a “software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances” for its June incident. Facebook's outage was down to a botched configuration change to Facebook’s BGP peering routers. Slack's September outage was down to a DNS configuration change. Azure blamed a March outage that took down Teams, Office 365, Xbox Live and other services for over two hours on an “error [that] occurred in the rotation of [cryptographic] keys" noting that “a particular key was marked as “retain” for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that “retain” state, leading it to remove that particular key. ”
AWS customers meanwhile are still awaiting a post-incident write-up on the cause of a sustained outage in September 2021 that also left major customers offline, as well as taking down smart home appliances.