Nationwide Aussie telco outage cause "too technical" to explain: The answer may be in a (heavily redacted) Canadian report
How not to share a root cause analysis: Lessons from Australia's Optus and Canada's Rogers...
The CEO of Australian telco Optus – which managed to simultaneously knock out both broadband and mobile services for five hours – has declined to comment on the Optus outage cause, saying it is too “technical” to explain, as Optus sources told local press said it was “routing-related.”
Optus CEO Kelly Bayer Rosmarin told the Australian Financial review that a “technical network fault” caused the outage but would not specify what exactly it was, or how it occurred: “It’s a very technical explanation for what happened. There is no soundbite that is going to do it justice.”
Optus, owned by Singapore Telecommunications, earlier said engineers were investigating a “network fault” after the nationwide outage of Australia’s second-largest telco, which lasted for five hours and hit millions relying on its networks as well as preventing large numbers of enterprise users logging in to apps dependent on 2FA codes shared via the telco for authentication.
See also: Supplier hack had “scope to impact entire telco industry”: Vodafone
Asked if it was a cybersecurity related incident, Bayer Rosmarin earlier said there was “no indication that it is anything to do with spyware – a peculiarly specific choice of phrasing that may be intentional or may just reflect low levels of executive cyber-literacy around the risk landscape.
Early speculation by informed experts suggests that a BGP routing howler may have been to blame but details remain thin on the ground for now.
Optus outage cause: Lessons from Rogers?
Doug Madory, Director of Internet Analysis at network observability platform Kentik said that “thus far” the incident looks similar to a nationwide multi-services outage of Canadian telco Rogers in 2021.
Rogers described the case of that incident in a letter to regulators as follows: “A specific coding was introduced[causing] routing configuration change to three Distribution Routers in our common core network.”
“Unfortunately, the configuration change deleted a routing filter and allowed for all possible routes to the Internet to be distributed; the routers then propagated abnormally high volumes of routes throughout the core network. Certain network routing equipment became flooded, exceeded their memory and processing capacity and were then unable to route and process traffic, causing the common core network to shut down. As a result, the Rogers network lost connectivity internally and to the Internet for all incoming and outgoing traffic for both the wireless and wireline networks for our consumer and business customers.”
Unfortunately and unhelpfully the company redacted technical details.
Madory added on X: “I see a lot of citations of this Cloudflare Radar page showing a spike in BGP announcements around the time of the outage, as if it was this spike that caused the outage. In fact, this is simply a natural consequence of AS4804 [Optus] withdrawing half of its routes.
“In this outage, Optus (AS4804) withdrew ~150 of the 271 BGP prefixes it normally announces. The withdrawal of a prefix triggers a flurry of messages as ASes search in vain for a route to replace a lost route. The more routes withdrawn, the larger the flurry of messages…”
OK, telco network folks, so how do you architect to massively reduce the risk of this kind of thing ever happening? Share the knowledge...