Forgot your password?
typodupeerror

Comment Distilled from the Therac-25 article (Score 1) 383

"There is always another software bug."

Changes made by the operator, confirmed to the operator via feedback from the display did not get carried forward into the execution of the job itself. In addition, this problem was time dependant, i.e. if the same sequence of events had been executed at a different rate (slower), the resulting operation would have been different (not always easy to anticipate / test for).

There was an improper overlap of instruction input and execution. This was due to improper use of concurrency and shared variable space (keyword: atomic variables).

Failures became lethal due to permissive hardware and software allowing non-sensical operations to proceed. (Failure to provide application specific common-sense hardware and software interlocks).

Frequent non-critical error codes trained staff to ignore failures. Nonsensical error codes made it difficult for staff to follow up on failures. Misclassified error codes (combined with nonsensical error codes) permitted staff to proceed even in the cases where proceeding could cause harm.

Those writing software, because they are dealing with logic, can fall into a trap of overconfidence brought about by the binary ideology represented by the statement: (!wrong == right) && (!right == wrong). In reality, this can be a dangerous assumption.

(256 - 1) + 1 = 0 in eight bits. It may be innappropriate to increment a flag value without checking for special cases BEFOREHAND (especially in a concurrent environment).

It was assumed that this design was as reliable as past designs despite the removal of certain hardware interlocks. It is therefore presumed that the modern designers did not have any statistics with regards to how often the hardware interlocks actually operated. If they'd known how important there were to the safety record of the previous machines, perhaps they would have considered them a higher priority in the design.

On the fourth page of the Therac_25 article (http://courses.cs.vt.edu/cs3604/lib/Therac_25/Therac_4.html), one of the letters AECL sent to the users group, dated 7/6/1987 mentioned an improvement to the software that caught my eye because I don't understand why it is an improvement. Quote: "preventing copying of the control program on site". Was this to prevent unexpected alterations, or was it to cover AECL in case of further liability?

It seems to me there is a conflict of interest with regards to software infrastructure in life-critical environments (i.e. where lack of fail-safe is unforgiving) and the idea of a company's right to the privacy of their intellectual property. It is in our interests both individually (we as patients who wish to avoid fatal radiation burns) and as companies (our products sell better and we are less liable because many eyes have brought our errors to our attention) for this type of embedded software to be open to public scrutiny. In terms of competition, in this case it does not seem to me that there is much of a risk of other companies using this IP to their advantage with AECL because the specifics of the hardware are what dictate the details of the software solution.

It terms of liability, it sucks to live in a world where we have to hide our ignorance, mistakes, and misteps from others in order to temporarily maintain our reputations instead of presenting them for review so that we can learn from what others have already learned. I'd hate to die of a problem that you have already solved.

Slashdot Top Deals

Beware the new TTY code!

Working...