Poor Knight Capital. A software release in the summer of 2012 went badly, and the result was a nearly immediate $460 million dollar loss. Liquidity became an issue and they were acquired shortly thereafter.
For the next year, it seems like every vendor with a tool that helps with software deployment, testing, configuration management, or monitoring has used them as an example. The problem was that other than knowing that a new software release resulted in massive numbers of trades and a lot of lost money, we didn’t really know what went wrong. I was certainly guilty of this. After all a release went spectacularly, and I work on deploy and release software. As we’ll see, I can breath a sigh of relief. Better deployment and release tools would have helped, but there’s much more to the story.
Now, thanks to this report (http://www.sec.gov/litigation/admin/2013/34-70694.pdf) from the Securities and Exchange Commission we have a decent picture of what went wrong. The SEC also announced that it fined Knight another $12 million for poor controls to bring the total bill to $472 million. Knight has suffered enough. I don’t want to pick on them. I do want to look at this report and see what we can learn from it and apply to our own businesses and releases operations.
We must acknowledge that the failure was of a complex system that had some safeguards in place to prevent failure. The Knight team look to have done some good things. As Richard Cook describes in his essay and presentation “How Complex Systems Fail” these types of failures require multiple smaller failures.* So here’s what went wrong.
I see dead code
In section B.13 of the SEC report, we find the following:
13. Upon deployment, the new RLP code in SMARS was intended to replace unused code in the relevant portion of the order router. This unused code previously had been used for functionality called “Power Peg,” which Knight had discontinued using many years earlier. Despite the lack of use, the Power Peg functionality remained present and callable at the time of the RLP deployment. The new RLP code also repurposed a flag that was formerly used to activate the Power Peg code. Knight intended to delete the Power Peg code so that when this flag was set to “yes,” the new RLP functionality—rather than Power Peg—would be engaged.
In section 14, the report clarifies that “many years earlier” means in 2003. So functionality that was no longer needed sat around for nearly a decade. The “Power Peg” became a powder keg ready to blow and was primed in 2005 when controls over how many orders to process were refactored elsewhere.
The SEC complains about a lack of written policies for testing unused code that remains callable. That strikes me as terrible guidance. The solution is simpler: delete dead code. You have source control and can get it back if you need it. Tools from test code coverage, to code reviews, to runtime analytics should help surface dead code. Trash it before it trashes you.
Hey, there’s a flag we can borrow
Did you notice how a flag was re-purposed the quote above? This smells of rushing and/or laziness, but we don’t know why it was repurposed only that it was. Basically, you have one sub-system creating a request that would go to the dead code in Power Peg. Assuming the service with the Power Peg code has been upgraded, the testing often this hand-off is valid. If a server with the old code gets called, the Power Peg code will be activated and some unknown behavior will be triggered by the zombie code.
If a new flag in the request was created for the new code, requests to an upgraded server would have error’d off. Knight may have failed to process some orders and suffered some embarrassment and minor losses, but the odds of catastrophe would be vanishingly small.
87.5% Correct is a Failing Grade
Neither dead code, nor reused flags are a problem if the code is deployed correctly. The good news is that it was deployed correctly on seven servers. The bad news is in section 15 of the report:
During the deployment of the new code, however, one of Knight’s technicians did not copy the new code to one of the eight SMARS computer servers. Knight did not have a second technician review this deployment and no one at Knight realized that the Power Peg code had not been removed from the eighth server, nor the new RLP code added.
With the Power Peg code activated, and order control systems refactored into another part of the code base, huge numbers of orders flowed.
Dousing a fire with gasoline
Alerts fired that Knight was spending more money in it’s order processing account than it should be and the problem was quickly traced to the problematic sub-system. That worked, although the SEC is fair in its suggestion that the monitoring system should have been able to automatically shut-down trading.
While the SMARS order placing component was identified as the trouble-maker, the response was problematic. They rolled back the new code putting old code out there for the problematic component. This is a totally natural response – we released a new version of SMARS, and SMARS is acting very badly, so go back to an old version of SMARS to stabilize.
Had the team had visibility into which versions of each component had been deployed, and a red mark identifying a missed deployment of one server, they likely wouldn’t have gone that route.
Further, what had been released was not just an update of a single component (SMARS) the sub-systems calling into it, and recycling the old flag should also have been rolled back. They were not. The result was a system with a collection of component versions that had never been tested together. With their powers combined, they failed spectacularly. The SEC gently describes this in section 27:
This action worsened the problem, causing additional incoming parent orders to activate the Power Peg code that was present on those servers, similar to what had already occurred on the eighth server.
I started by saying that Knight Capital did a lot of things correctly. The report details places where checks validated counts of orders, they were there but not mostly on the front end of the system rather than the back. We also have no reason to believe that Knight failed to test the code. There’s no indication that when the proper versions of each component where running together, the behavior wasn’t correct. Expense monitoring detected the flow of money out of account the trading system used and alerted people who quickly found the component that was to blame.
But in a complex system, a series of small errors can result in catastrophic failure. We can look at what happened and say:
- If only the dead code was detected and removed, everything would be fine
- If only the flag hadn’t been repurposed, everything would be fine
- If only the deployment had pushed the new component to all eight servers, everything would be fine
- If only the monitoring had been able to shut down trading the losses would be minimal
- If only the rollback had been of the full system, not just a component much less money would have been lost.
To release code changes frequently and with confidence we need to build a chain processes and tools that protect us. There will always be errors. Servers, tools and people will fail. We build complex systems so that things go mostly ok even when there are failures. The more safeguards, especially automated safeguards, that are in place the less fragile our systems are and the less likelihood of catastrophe we face.
* You should also read the chapter Richard co-authored with Allspaw in the “Web Operations” book – still my favorite DevOps book.