About that flurry of HTTP 500/nginx errors...
Brief Cloudflare outage sets teeth on edge...
IT folks facing a brief horror show of endless HTTP 500/nginx errors this morning have a Cloudflare outage to thank. The network transit, proxy, security provider supports over 24 million websites and Discord users, crypto traders and innumerable others briefly had a panic this morning when everything dropped.
The Cloudflare outage was fixed in around an hour. The company is typically swift with a post-mortem. Thus far users have just been told that a "a critical P0 incident was declared at approximately 06:34AM UTC. Connectivity in Cloudflare’s network has been disrupted in broad regions. Customers attempting to reach Cloudflare sites in impacted regions will observe 500 errors. The incident impacts all data plane services in our network."
(A P0 incident is a critical, shit-has-gone-badly-south incident.)
See: NSA: DNS-over-HTTPS “no panacea”. NCSC: Handy if *we* run it...
A small handful of content distribution network (CDN) companies hold up vast swathes of the internet and outages do periodically happen. An August 2021 Fastly outage for example -- caused by a “service configuration” failure -- briefly knocked some of the internet’s biggest news and other sites offline today, including the Financial Times, GitHub, and GOV.UK. A major Cloudflare outage in 2019 occurred because a new rule triggered a bug that caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide.
The regex-powered borkage brought down Cloudflare’s core proxying, CDN and WAF functionality.
Cloudflare later explained the outage as follows: "We were rolling out a change to standardize the BGP communities we attach to a subset of the prefixes we advertise. Specifically, we were adding informational communities to our site-local prefixes.These prefixes allow our metals to communicate with each other, as well as connect to customer origins. As part of the change procedure at Cloudflare, a Change Request ticket was created, which includes a dry-run of the change, as well as a stepped rollout procedure. Before it was allowed to go out, it was also peer reviewed by multiple engineers. Unfortunately, in this case, the steps weren’t small enough to catch the error."
"03:56 UTC: We deploy the change to our first location. None of our locations are impacted by the change, as these are using our older architecture.
06:17: The change is deployed to our busiest locations, but not the locations with the MCP architecture.
06:27: The rollout reached the MCP-enabled locations, and the change is deployed to our spines. This is when the incident started, as this swiftly took these 19 locations offline.
06:32: Internal Cloudflare incident declared.
06:51: First change made on a router to verify the root cause.
06:58: Root cause found and understood. Work begins to revert the problematic change.
07:42: The last of the reverts has been completed. This was delayed as network engineers walked over each other's changes, reverting the previous reverts, causing the problem to re-appear sporadically.
08:00: Incident closed."
Full post-mortem is here.