NYSE left with tough controls questions after disaster recovery error

4,341 trades cancelled. But why does DR system need manual shutdown?

NYSE left with tough controls questions after disaster recovery error

The New York Stock Exchange (NYSE) has been left with tough questions to answer after “a manual error involving the Exchange’s Disaster Recovery configuration” resulted in billions-worth of trades being cancelled.

The NYSE initially blamed a “systems issue” for its failure to conduct opening auctions in a subset of its listed securities on January 24 – an incident that triggered huge volatility and wild price swings before triggering the temporary freeze of trad.ing in stocks of companies including Exxon, Mastercard, and McDonalds.

It later attributed the incident (described by Dennis Dick, founder of  Triple D Trading, as “a real mess…I’ve traded for 22 years, and I’ve never seen the opening cross at NYSE go haywire”) to a manual error in its disaster recovery (DR) configuration; raising real questions about its broader operational resilience and controls.

NYSE trades cancelled after manual disaster recovery shutdown error

Fleshing out the NYSE’s own description of the incident cause, Bloomberg reported that an NYSE employee failed to shut down a DR system at the exchange's secondary Chicago data centre and as it was left running overnight, NYSE trading systems believed that Tuesday’s trades were a continuation of Monday’s trade.

That meant the exchange – in the NYSE’s own words – saw “continuous trading in 2,824 of 3,421 NYSE-listed securities without attempting to conduct an opening auction due to a technical issue, following which approximately 84 of these impacted symbols entered Limit Up-Limit Down (“LULD”) pauses.”

Follow The Stack on LinkedIn

NYSE said as a result: “Approximately 4,341 trades in 251 symbols should be busted” (cancelled.)

(The incident came three days after another issue on the exchange that NYSE attributed to “failed hardware”. This successfully failed over to a backup but caused open orders in symbols DOCU to FSMD to be cancelled.)

“Such events are extremely rare, and we are thoroughly examining the day’s activity to assure the highest level of resilience in our systems,” NYSE Chief Operating Officer Michael Blaugrund said in a statement.

A full post-mortem, please...

Former Amazon VP Al Lindsay noted: “I’d love to see the COE (correction of error) on this one.

“Amazon uses a COE process to uncover why something failed, and ensure it never happens again. This starts with the Five Whys. I wonder why it requires a human to stop a process before regular daily business processes can begin? Why isn’t this automated? And when that process is still running when it shouldn’t be (which is on a fixed schedule easily known in advance) why isn’t there an alarm, handled by an oncall who has a play book and escalation paths to quickly resolve this easily detectable problem?” he asked in a LinkedIn post.

“When you run mission critical systems you need to have a structured approach to operational excellence whereby you establish monitors, alarms, procedures, escalation paths, and automated recovery. You need to regularly test all of the above with simulations of large scale events, and routinely evaluate and revisit what you monitor and what your thresholds are to avoid silent failures due to stale alarms. Your team members need training to know how to handle events. And when bad stuff does happen you need a COE process to understand what happened, and create and track corrective actions to ensure it never happens again. The COE also serves as an effective way to communicate all this to senior leadership and affected partners.”

NYSE President Stacey Cunningham has previously reflected on the need for more automation to reduce risk, in late 2021 noting: “We process 330 billion messages on a busy day. Market makers are setting the prices for securities when markets can move so quickly [so managing] risk becomes an important part of that. That comes back to that observability, what are we seeing in the data… you need to have automated responses when you’re talking about that kind of scale, so that you can manage risk in an automated fashion.”

BoE demands firms test “severe” operational resilience scenarios