Comment Re:Not the engineers fault (Score 1) 383
The Therac-25 incidents happened partly because there were hardware interlocks on previous versions, but not on the updated version. However, a simple "don't kill the patient" interlock would not have worked. The basic problem is that it handled both e-beam and X-ray dosage on the same machine. And you get X-rays by hitting a target with an e-beam of much, much greater power. This absorbs the e-beam and emits a much weaker X-ray beam. If I remember what I read about this incident correctly, all of the incidents were some form of "we wanted X-rays, but the target was rotated out as if we wanted an E-beam, so the entire E-beam was applied to the person instead of the X-ray target". In standard X-ray operation (which was by far the majority of the doses that were requested), the beam had to be active at a high level in the majority of cases. Since this beam was more than strong enough to kill anyone if the target was improperly placed, almost every single treatment would involve someone bypassing a "don't kill the patient" safeguard. That is just begging to be bypassed each and every time without thinking about it.
The fact that there were several bugs that led to similar results with no backup is the major issue. There are various ways to fix this issue, including hardware interlocks, actual software review, and exhaustive test methodology (including designing the software so that it can be tested exhaustively). In the end, they cut corners and this killed patients. They reduced the cost by removing "extraneous" hardware interlocks found on the Therac-20 model, because they didn't realize that they were activating and saving lives. They reduced the cost by hiring programmers who clearly did not understand proper code design and by reusing old code that depended on the interlocks. They reduced the cost by not requiring exhaustive testing, and code that supported exhaustive testing. In particular, the hardware interlocks were not the simple "low power or else" checks, but more complicated checks on what valid powers vs. other settings were appropriate. More expensive than a simple "don't go to high power without authorization" check, and thus more expensive.
I can remember two examples of errors that caused problems. One of the incidents involved an 8-bit integer that was incremented when it was checked and found not ready in a continuous loop. This integer was part of what checked to see if the target was in place. So using a testing procedure where you make a slight mistake, fix that mistake but then forget to rotate the target back in would be stopped by this check.... 255 out of 256 times. The other 1 out of 256 times it had just rolled over and gave an incorrect output. Someone lost that game of Russian roulette.
Another of the incidents involved fast data entry. You enter the dosage as if you were going to give the patient an X-ray beam (which was much more common than E-beam treatments and became a habit to some operators), and hit enter at the bottom of the setup form. This starts the beam strength calibration. If you then realize you really wanted an E-beam of the same strength for this patient, go back to the top, change one entry from X-ray to E-beam and fly through hitting enter on the rest of the form in 8 seconds to get to the bottom. The beam strength calibration finishes 8 seconds after you hit enter the first time, exits its loop and checks to make sure the form is still properly filled out (which by now it is). Then it removes the target because you asked for E-beam and it doesn't double-check the power setting which was originally set for X-rays. Since it doesn't go back to double check the power setting vs. E-beam/X-ray and just checks the single "form properly filled out" variable, it is inherently dangerous. This was fixed by the infamous "remove the up key on the keyboard" hack by the company, forcing people to take more than 8 seconds to fill out the form again.
While I'm more of a hardware engineer than a software one, even I can see where both of these errors should not have been made by anyone who know what the heck they were doing. The fact that they were not reviewed exhaustively before going into a product as potentially dangerous as a radiation treatment machine is... well, a case study in how to do things wrong.