As with telephone switches, you end up needing a large fraction of your code to be monitoring the state of the system to make sure everything's working. But also, you don't get that kind of reliability by having systems that flake out for a few milliseconds - you end up with multiple systems in parallel, and you know failure probabilities of the individual systems and build the thing to try to eliminate common-mode failures. So Box A, Box B, and Box C are in parallel, and each one has alarms that detect whether the other ones are down and the backup needs to take over for the primary, so if Box A is down, Box B takes over while you fix Box A, and Box C is there to take over if Box B also fails (or was already down when Box A failed.) Or alternatively, you've got an A/B pair, and a C/D pair, and if A fails, the C/D pair takes over while you fix A, reducing the risk that B will fail before you've done that.
We were part of the project because we're good at systems integration and government projects, but also because we had processor chips that did trig functions really really fast (for late-80s definitions of fast.) Turns out that the data was all coming from sensors with 12-bit A/D converters, and the fastest way to do trig functions on them isn't a floating-point chip - it's a lookup table :-)