Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×
Bug

Debug your Code, or Else! 486

Trevor Lovett writes "I ran across a collection of famous software bugs that have caused large scale disasters including the explosion of the Ariane 5 rocket due to integer overflow and the misfiring of a US Patriot missile that caused 28 deaths because of accumulated floating point error. "
This discussion has been archived. No new comments can be posted.

Debug your Code, or Else!

Comments Filter:
  • by rosewood ( 99925 ) <<ur.tahc> <ta> <doowesor>> on Thursday May 02, 2002 @01:17PM (#3451818) Homepage Journal
    Remember that time when that kid dialed into NORAD and used that security exploit to get into the Thermo-nuclear war simulator and everyone thought it was real until he and the inventor were able to trick the computer into playing Tic-Tac-Toe? I see a LOT of bugs in the software there but no one ever seems to care about that...
    • by btellier ( 126120 )
      Also, I remember hearing about this good looking hacker chick who used extremely large fonts and a camoflaged computer. Due to Penn Jillette's incompetance he failed to notice this good looking individual's ability to literally fly through the network, superman style, before crashing into the garbage file. Teller had neglected to take out the trash and was summarily beaten. In the resulting hilarity, more large fonts are exchanged and a virus is disassembled via Matrix-style code-fu. Near the end of this caper one of the hax0rs in question has sex with another human being, possibly as a result of his cult following of thousands of IRC kiddies boasting knockoffs of his nick.
  • by BiggestPOS ( 139071 ) on Thursday May 02, 2002 @01:17PM (#3451820) Homepage
    1) 1999 - Buffer Overflow causes Half-Life to crash while I'm in an important clan match (counter-strike) we lose the match, and I lose many friends.

    2) 2000 - Poorly coded garbage collection causes Word 97 to crash, lose last 2 hours of research paper. Class was in 30 minutes, paper was late. I lost my scholarship.

    3) 2002 - IE Crashes while writing AWESOME first post for /., My karma never recovered.

  • by SplendidIsolatn ( 468434 ) <splendidisolatn@yah[ ]com ['oo.' in gap]> on Thursday May 02, 2002 @01:20PM (#3451835)
    The surprise isn't how many situations have cropped up because of software bugs, but rather how few. If you think of all the things that code is written for, and yet there hasn't been any major 'disaster'. Yes, the deaths and accidents are tragic, but on the grander scale of things, it's amazing that nothing truly catastrophic has happened.
    • Umm define catastrophic, because I define it as loss of life. I'm sure the fly-by-wire Airbus passengers who went down would consider it catastrophic, or the medical patients that recieved lethal doses of radiation. If you don't think death due to a bug is catastrophic send me your resume, I have some dosage control software I need a test subject for.
  • Millennium Bridge (Score:4, Interesting)

    by rde ( 17364 ) on Thursday May 02, 2002 @01:22PM (#3451852)
    I'd take issue with the inclusion of the London Millennium Bridge; that wasn't so much a failure of software, but a flawed model, that failed to take into account the effects of swaying pedestrians. After it was rectified, there were new data - never used in any bridge model - incorporated into such models so that it won't happen again. That's science; not a bug.
    • I'd take issue with the inclusion of the London Millennium Bridge; that wasn't so much a failure of software, but a flawed model, that failed to take into account the effects of swaying pedestrians

      Pedestrians were also asked not to sway anymore.

    • That's science; not a bug.

      Actually the phenomenon of resonance is well documented in science. Engineering involves disgarding the pieces of science that are unimportant. For example, most of quantum theory can be ignored when modelling a bridge, which simplifies things a bit. In the case of the millenium bridge, they oversimplified, and ignored the horizontal force exerted by people walking, which is what caused the swaying. So this is not science, not software, just an engineering error.

      not_cub

    • by victim ( 30647 ) on Thursday May 02, 2002 @02:44PM (#3452390)
      Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion. more info [uh.edu]

      Armys break cadence when marching across bridges, at least as far back as Napoleon's time. Presumably they learned that the hard way.

      On a more personal note, I have participated in the unintentional destruction of a gymnasium. 80 or so people crowded together in the middle, bouncing up and down, and then "down and down". We fractured the engineered wooden joists. Fortunately it failed gracefully. Just sagged down about 4 feet in the middle.

      What I'm trying to say, not particularly directly, is "don't give the designers of the bridge a pass because this new phenomenon struct their bridge". Chastise them for risking people's lives and wasting resources by neglecting the loads placed on bridges.
      • Your assumption about the nature of pedestrian motion that caused the bridge wobble is incorrect:

        They did take into account pedestrian movement on the bridge; they didnt take into account pedestrian motion on the bridge locking in to the motion of the bridge:

        1) Pedestrians walk on bridge
        2) Bridge wobbles slightly
        3) Pedestrians adjust their walking to be in phase with bridge
        4) Bridge wobbles more

        This was a new phenomenon, due to the lightness of the construction of the bridge. It is now fixed, by the addition of dampers.
      • by igrek ( 127205 ) on Thursday May 02, 2002 @03:25PM (#3452701)
        In the old USSR (Stalin times), there was a standard bridge acceptance test:
        1) put project managers, lead architects and engineers under the bridge;
        2) put heavy loaded trucks on the bridge.

        That was real extreme testing.
      • Human effects on bridges is hardly a surprise. Recall in 1981 when the Kansas City Hyatt's skywalk collapsed, killing 114, because the pedestrians were dancing (and the design was altered to ease construction). You'd think that would have been enough of a wake up call to the millenium designers to consider human motion.

        The Hyatt's skywalk collapsed soley because of the change in design. The design change caused the walkway to fail to meet building code [unl.edu]. Some civil engineers who studied the disaster were surprised it could support its own weight, much less the weight of the pedestrians.

        Quoting from a Kansas City Star article [kcstar.com].

        The National Bureau of Standards concluded failure was just a matter of time. "The walkways," its probe found, "had only minimal capacity to resist their own weight."

        The dancing people were by and large on the floor below the skywalk, participating in a dance contest.

        The mistake that caused the Hyatt disaster was not one of failing to consider human motion in the design, but failing to consider the effects of seemingly minor changes in design.

  • by teambpsi ( 307527 ) on Thursday May 02, 2002 @01:24PM (#3451875) Homepage
    It really amazing how many software project managers that don't fully understand what regression testing is all about.

    Software engineers simply cannot be trusted to do more than small unit level testing! We get into a pattern of behavior, we know what to expect, and simply do not stress test the system.

    Thats why I like hiring sales people and 2-year olds to test my code at the unit/integration level.
    • by dgb2n ( 85206 )
      Testing is critical.

      Others would argue that testing alone may not suffice. Particularly for these kinds of mission critical applications, nothing short of formal methods of software engineering [sbu.ac.uk] will suffice. Formal as opposed to natural language specifications can reduce ambiguity. Safety conditions can then be derived and verified through rigourous mathematical proofs.

      Of course none of this obviates the need for testing but it can lead to a more predictable system.
    • by billnapier ( 33763 ) <{moc.xobop} {ta} {reipan}> on Thursday May 02, 2002 @02:38PM (#3452348) Homepage

      Thats why I like hiring sales people and 2-year olds to test my code at the unit/integration level

      You didn't need to repeat yourself

    • by Junks Jerzey ( 54586 ) on Thursday May 02, 2002 @02:49PM (#3452425)
      It really amazing how many software project managers that don't fully understand what regression testing is all about.

      Not in important fields like telecom. In those fields you live and die by testing, and you can be held accountable for bugs found in your code. If there are too many, you might be in for it.

      What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.
      • Not saying they are *good* tests, but many source install instructions have the basic steps
        • ./configure
        • make
        • make test
        • make install

        I always run these.

        I had a job writing regression tests. I have not looked into any of these install tests. I doubt they are as thorough as they could be, but they have failed once in a while, and I have always investigated the failures.
      • by slamb ( 119285 ) on Thursday May 02, 2002 @05:16PM (#3453607) Homepage

        What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc. As much as I hate to say it, there's a good chance that the relative inexperience of most open source authors is a factor here.

        Perl is really good about this. The Test::Harness [uwinnipeg.ca] and Test::More [uwinnipeg.ca] modules make it very easy to write test suites, so CPAN modules have lots of automated tests. It might even be a requirement to get a module into CPAN; I'm not sure.

        PostgreSQL has regression tests.

        There's a really nice test environment for Java code called JUnit [junit.org]. Lots of stuff is using it. Lots of articles about how to write effective tests. There's a project [mockobjects.org] to develop mock versions of common objects (servlet requests, SQL queries) that fail in interesting, predefined ways. I'm using a C++ workalike called CppUnit [sourceforge.net] in one of my projects.

        The Boost [boost.org] code has automated testing.

        There's a project called qmtest [codesourcery.com].

        The Wine people have recently started using regression tests.

      • What's shocking to me is that almost no open source authors or advocates give a hoot about automated testing of any kind. The only free software I've found with a test suite is gcc.

        Bullshit [junit.org]
  • another bug page (Score:5, Insightful)

    by blooher ( 40990 ) on Thursday May 02, 2002 @01:25PM (#3451881)
    Software Horror Stories [tau.ac.il] linked from the post's link
  • by DeadSea ( 69598 ) on Thursday May 02, 2002 @01:27PM (#3451896) Homepage Journal
    Of all of them this is my favorite. It doesn't say if it was a software bug or not though.
    [Source:
    A 51-year-old woman was subjected to a harrowing two-hour ordeal [on 16 Apr 2001] when she was imprisoned in a hi-tech public convenience. Maureen Shotton, from Whitley Bay, was captured by the maverick cyberloo during a shopping trip to Newcastle-upon-Tyne. The toilet, which boasts state-of-the-art electronic auto-flush and door sensors, steadfastly refused to release Maureen, and further resisted attempts by passers-by to force the door. Maureen was finally liberated when the fire brigade ripped the roof off the cantankerous crapper. Maureen's terrifying experience confirms that it is a short step from belligerent bogs to Terminator-style cyborgs hunting down and exterminating mankind.
  • the BSOD. I still have nightmares.
  • by Alomex ( 148003 ) on Thursday May 02, 2002 @01:32PM (#3451929) Homepage
    Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters".


    The pentium bug is certainly famous because every idiot and its brother think it is rare for a CPU to be buggy. The second condition in the list is "caused a large scale disaster". This condition is, sadly, also met. It caused a large scale public relations disaster for Intel because once again said idiots thought that a CPU bug is rare.

    • Just to be clear, all processors out there have bugs. The pentium bug is in no way exceptional. The only reason it deserves to be there is beacuse the list is called "a collection of famous software bugs that caused large scale disasters.

      What is exceptional is that instead of just announcing a new erratum (which is what Intel and most cpu makers normally do in such a case), Intel tried to bury the problem, initially denying that it existed and then denying that anyone would ever run into the problem. This really pissed off the numerical computing community and destroyed confidence in the accuracy of intel's floating point unit. That's why it was a public relations fiasco.

      see:

  • by mesocyclone ( 80188 ) on Thursday May 02, 2002 @01:34PM (#3451938) Homepage Journal
    Back in 1973 we built a system for hotel reservations that had over 1000 mini-computers distributed in hotels all over the US. These computers periodically dialed an 800 number in to get outstanding messages (it was cheaper for them to dial in than for us to dial out to them).


    I wrote the algorithm that scheduled the dialins. It used a pseudo-random approach during the day, weighted by outstanding traffic.


    But at night, there was period during which we had to unload all messages before the next day's processing. During this time, the pseudo-random algorithm was replaced by a deterministic one that assigned computers time slots.


    The computers also had auto-rety in the case of failure, so each call could result in several if it were blocked.


    Unfortunately, during coding I had put in the number of modems answering phones at 20 (as an arbitrary number for testing). During the hectic rollout, this never was changed to the actual number which was much smaller.


    Once the system came on line, every night at 1AM portions of Omaha (which included lots of call centers) would lose all long distance service for a couple of hours, as all these computers called in and retried several times.


    Eventually the phone company figured it out and contacted us, and we discovered and corrected the discrepancy.


    Another issue was that we had a number of hotels that were using pulse dialing (this was a long time ago in a galaxy far far away). Sometimes these would be off by one due to the inherent unreliability of pulse dialing, and the result was a lot of calls to certain numbers related to the 800 number, all in the middle of the night.


    BTW... as far as I know, this was the first large widely distributed commercial computing system to use switched telephone circuits for communications (but no doubt some other grey-haired slashdotter knows of another).

    • We had a similar situation where we accidentally ddos'd our university's engineering school. We were working on a file-sharing service that had over 600 people sharing at any one time. The lead programmer made a change to how the clients and main server pinged each other in order to make it more compatible with firewalls. The way he did it was that the client would send out a ping, the server would catch it, and throw it back, and so on. The problem was that he forgot to set a delay for this.

      One night our system vanished from the web. Our clients couldn't connect, the website was gone, and we couldn't ssh in. Later on we found out what happened. As more and more clients auto-updated to the new version they began pinging the server to alert it to its presence. It in turn responded, and soon it was doing nothing but sending and receiving pings. To over 700 computers. As fast as it could.

      Somewhere between 700-800 clients the router died, bringing with it the internet connection to the entire engineering school. Somehow we were never disciplined and everything was brought back online within the next day or so. Now that's something to put on a resume: Effectively launched a 700+ system DDoS on own university. Now remember kids, make sure you trust the company that makes your P2P software!

  • Click on the links, otherwise you don't get all the details. French-friendly exocet missile? Huh? Unless you click through you don't realize that the British radar thought the exocet missile was a friendly munition since the British arsenal included exocets. Any munition headed straight at you is probably not friendly.
    • Re:links (Score:3, Funny)

      by scott1853 ( 194884 )
      <private> Incoming missile sir! What do we do?
      <officer> Don't worry, it's one of ours.
      <private> But sir, it's still going to HIT us!

      This not only sounds like something that belongs in a Dilbert strip, but also the basis for the logic that allows the spreading of all these e-mail viruses.
  • by delphin42 ( 556929 ) on Thursday May 02, 2002 @01:39PM (#3451972) Homepage
    He was considering making Fatal Defect required reading for the C programming course I took. From Amazon.com:

    In Fatal Defects: Chasing Killer Computer Bugs, Ivars Peterson describes dozens and dozens of hoary computer bugs and gives biographical sketches of the bug detectives who located and fixed them. This book, which reads like a novel, is both entertaining and informative. Many of the bugs that Peterson discusses are not in computer programs per se but in the human systems that run and operate the computers. Very often the operator fails to understand what the computer program requires as input and types in an incorrect command. The computer then executes the command, with potentially disastrous results. Fatal Defects has important lessons for both those who design computers and those who use them.

    He also insisted that we not call them bugs. "They are ERRORS, calling them bugs makes it sound like they are cute little accidental things that pop up when actually they are programming mistakes."
    • I bet he's one of those people that refers to automobile accidents as 'collisions'. It's just a term that has been in use so long that we still use it. I'm sure most people know that nearly all automobile 'accidents' are preventable at some point, just like we all know that a bug is the result of a human error. It's just the origin of the word that made it the way it is today. 'Bug' more defines how it appears from the user's perspective. They are seen as odd quirks, etc. It is notable that most people still know where to place blame for them, most of the time (i.e. blaming Windows for a bug in a particular piece of software running on it).
  • Read comp.risks (Score:5, Informative)

    by kzinti ( 9651 ) on Thursday May 02, 2002 @01:43PM (#3452001) Homepage Journal
    Make reading the ACM's RISKS digest a part of your regular routine, and you'll hear about these kind of software-related problems and many others - usually shortly after they happen. The RISKS digest is available on Usenet as comp.risks, as a mailing list, and on the WWW at http://catless.ncl.ac.uk/Risks [ncl.ac.uk]. A new issue is published on a semiregular basis, every one to two weeks. It's not only informative but interesting too.

    --Jim
  • by Anonymous Coward on Thursday May 02, 2002 @01:43PM (#3452005)
    Sure, some people here gripe about this not being newsworthy. But as a hardware guy, I am happy to see that software guys are finally going to be held to some sort of standard.

    In electronics, if your hardware has ONE little problem, it's almost bankruptcy time. Remember the Pentium FP bug? And how it would have affected very little? Remember the hoopla, people wanted new processors, etc..

    But software bugs? Who cares! It's NORMAL, it's EXPECTED. Well, geeks and nerds, time to get your asses in gear and live up to the same standards mechanical and electrical engineers have been living up to for decades.

    I'm tired of being held to a standard of perfection that the software people (who make more money than me!) don't even KNOW about.
    • by jc42 ( 318812 ) on Thursday May 02, 2002 @02:41PM (#3452360) Homepage Journal
      There is one highly relevant difference between the way that we deal with hardware and software. With hardware, inner details, schematics, and the like are usually easily available. Often this is required by law in any critical applications.

      With software, most programmers are writing code to run on systems (kernels, runtime libraries, and the like) that are usually proprietary. The inner details are not just neglected; the companies intentionally keep them secret and prosecute people who leak them.

      As a result, software can't be made reliable, not even in principle.

      We do have a few exceptions, e.g. linux and all the GNU stuff. If *everything* underneath your code is Open Source, then in principle you can examine it and find problems. (It ain't easy, but at least it's doable if your employer will permit the time that it takes).

      But we're facing a major battle just getting Open Source software accepted by a tiny part of the market. In most jobs, you are required to write code for systems whose inner working you are not permitted to know.

      The US government is even using proprietary, binary-only computer systems in secure and mission-critical situations. Anyone who expects the code in such situations to be reliable is either utterly ignorant or actively malicious.

      Myself; I'd welcome rules that make me and other software developers responsible for bugs in our code. If there were such a legal requirement, I could point to it when someone denies me access to the information that I need, and say "I can't possibly write correct code when you are keeping vital information from me. Show me the inner details of these parts of the system, and I'll agree to write reliable code for it."

      Of course, in a couple of cases, when I've gotten my hands on such details, I've proceeded to write a proof that certain things could not be done reliably on that system. "Fix that bug in that library, and I'll vouch for my code. Until then, here's my bug report describing exactly how it will fail."

      Unfortunately, when I've done this, the usual result was that I was looking for another job soon thereafter.

      (One such lost job was when I proved that certain sensors in a nuclear power plant could not be made to work reliably due to their software. But that was 20 years ago; maybe they've fixed it by now. ;-)
  • GPL (Score:2, Insightful)

    by lostchicken ( 226656 )
    GPL raises this to new levels of concern. You can never know where your code will be used. It might just find itself in an cruise missile.
  • So much for poor Visual Basic Programmers :)

    Damn, they never get to do fun stuff like this.

  • This one time I missed closing an /a tag on a post, and missed getting a wicked killer First Post.
  • I hope you don't mind a little nit-picking. The thread is titled "Debug Your Code." A lot of the problems listed in the article were for errors that occurred in situations outside the parameters that the programmers were expecting.

    I personally see debugging is the art of making sure the code works and is fulfilling the logical expectations of the programmer.

    These problems show that there is a need to go way beyond traditional debugging, and do aggressive testing outside the programmer's box. Debugging ain't enough. Those dregs toiling away in the testing department might be worth their skin afterall.

    kd [slashdot.org]
  • Phoenix (Score:2, Interesting)

    by The Bungi ( 221687 )
    It's interesting the list includes the Denver airport baggage system breakdown but not the Phoenix Sky Harbor one. A system designed by I forget which consulting firm over the course of three years at a cost of millions of dollars finally had to be scrapped and replaced with custom software done by IBM.

    It delayed the re-opening of the airport for about seven months or so. After it did finally open the system wasn't working yet so the baggage system had to be operated manually for a couple of months.

  • I was a consultant at a major bank 3-4 years ago. An FTE made a one line error in a Cobol program for printing bank statements. Everyone in a small town of about 6000 got Their first page of statement and pages 2,3,4 etc. of someone else's statement.
  • How about that horrid bug that extends the width of the window data by a factor of 8, making it impossible to read?

    Does Slashdot want me to just waste more time at work or what?!
  • by MountainLogic ( 92466 ) on Thursday May 02, 2002 @01:54PM (#3452071) Homepage
    I'm sure we all have those bugs that we catch in bench testing. Mine was forgeting to add a cancel button to the following dialog box:

    "OK to delete database"

    When I caught that one I had visions of a user who had his/her million dollar database deleted charging into our office with a shotgun and ... well, you read the papers. Glad I caught that one before I released it to test.

    • by thrillbert ( 146343 ) on Thursday May 02, 2002 @02:31PM (#3452301) Homepage
      When after sitting down for 36 hours straight when I first learned to program in C, I wrote a small, but usefull, payroll program. By the end, during the function that would print out the check, I added "Press any key to continue, any other key to abort". Lucky for me I never released that program.

      ---
      All comments are not factual unless stated otherwise.
  • A good number of these incidents are NOT due to bugs in software but in faulty assumptions input into that software.

    If I misestimate the mass of a planet, is that a software bug?

    If my software sells stock when a certain threshold is hit and yours does the same, and that least do a financial industry meltdown, did my software not work as planned, or is the issue more the dynamics of the market being somewhat unpredictable?

    The tacoma narrows and london milennium bridges are both listed here, yet neither one is a software issue - hell the tacoma bridge collapsed in 1940!


    That said, it is a pretty interesting list, but calling it a list of software bugs and using it to underscore the importance of regression testing software is a bit of a stretch. If anything, it underscores the importance of editing and proofreading your content.

  • by T.E.D. ( 34228 ) on Thursday May 02, 2002 @01:59PM (#3452103)
    I'd call it a bad sign when the first two entries on a page that proports to show famous software bugs are not, in fact, software bugs.

    The bug that caused Airane explosion was a requirements analysis bug. The Pentium FP bug was a hardware bug.

    A quick skim of the rest nets me at least 6 more non-software software bugs
    • 4. Mars Climate Orbiters, Loss (Mixture of pounds and kilograms, 1999) - Specification bug
    • 27. Distributed denial-of-service attacks - Malicious people
    • 31. Florida Voting Chaos - not a damn thing to do with computers
    • 34. Wall Street Crash, October 1987 (Acceleration of the crash) - computers did precisely what their users wanted them to do
    • 42. Great Concert Disasters - WTF?!
    • 43. Tacoma Bridge (not a computer bug)(collapse, 1940) - he said so himself

    After seeing that, I can't really trust the list on things I don't have a good knowledge about.

    Here's a challenge for someone: Go through the list and find out how many (if any) of the listed software bugs are actually software bugs.
    • We have to accomodate our users and unknown environments. When a reasonable user makes your program do something bad, that's either a user training problem or a software bug. You didn't check input data carefully enough, or you didn't provide a good user interface, or your requirements were bad, etc. All of those are software bugs.

      The Pentium bug, I'll agree with. Florida voting, I'll agree with. DDOS is software bug. Software run in an environment like the web has to accomodate malicious users. Wall Street Crash, software bug. There was a set of conditions in which the software did bad things.
  • Patriot and Scud (Score:3, Interesting)

    by vondo ( 303621 ) on Thursday May 02, 2002 @02:00PM (#3452111)
    The claim for this one is that a Patriot during the Gulf War failed to intercept a Scud missle and the Scud missile killed 26. Ergo, a software bug killed 26 people.

    Considering that even the military now admits that no Patriot *ever* intercepted an Iraqi scud, this inference is unfounded.
    • What was going on in this case was that the launching system had a minor but cumulative rounding error in its time measuring. After cumulating for several days, the deviation was big enough to let the launched patriot completely miss its target (timing is essential when traveling at several times the speed of sound) and slam into the ground at the wrong place.

      Whether is a direct consequence of the bug is debatable. But it would maybe have hit its target if that bug had not been present and not slammed into the ground on the wrong side of the front-line.
      • What was going on in this case was that the launching system had a minor but cumulative rounding error in its time measuring.

        What's interesting is that this isn't a bug. The system in question was designed for a maximum duration of 48 hours between resets, and they ran it for a week.

  • The patriot missile failure was blamed on a roundoff error causing an accumulating time error, resulting in a miss.

    But the bug was more fundamental: The missile and radar computers synchronized clocks when the system was booted, then drifted apart. After a hundred hours the drift from the roundoff was enough to make it lose a target.

    But had the missile synchronized its clock upon launch (or better: target acquisition, to give it time to settle), the tiny roundoff error accumulated in flight wouldn't have mattered. Meanwhile, had the calculation been perfect, differential clock speeds still would have caused a drift.
  • Nice (Score:4, Insightful)

    by pete-classic ( 75983 ) <hutnick@gmail.com> on Thursday May 02, 2002 @02:06PM (#3452139) Homepage Journal
    The actual article links to http://www.byte.com/art/9509/sec7/art20.htm which says:

    THE BUG THAT KILLED

    1985-1987: At least four people died when they were exposed to lethal doses of radiation from Therac-25 linear accelerator machines (made by Atomic Energy of Canada Ltd.), used for radiation treatment of cancer. Software errors caused the machines to incorrectly calculate the amount of radiation being delivered to the patient. The most tragic incident to date of death or injuries to human beings due to defective computer software, [emphasis mine] this incident is a reminder that, as we entrust human lives and health to computers, the seriousness of eliminating bugs becomes a life-or-death proposition.


    and goes on to say:


    SIN OF OMISSION

    1991: American Patriot missiles were fairly successful. However, the failure of some Patriot missiles to track and destroy Iraqi Scud missiles during the Persian Gulf War may have been due to a software problem of the system. During one such Iraqi missile attack, 28 American soldiers were killed in their barracks in Dhahran, Saudi Arabia.


    seven times the loss of human life, but less of a tragedy? I guess they are soldiers so fuck 'em, eh?

    This story is over two years old, so they have had ample opportunity to correct it. The "comment" button on that page just takes me to the front page. Nice.

    Also on that page, "The DoubleSpace automati hard disk comparision software included in Microsoft MS-DOS 6.0 [. . .]" WTF is "automati"? "Comparision" isn't even a word as far as I know, but it looks a lot like comparison. DoubleSpace is disk compression software.

    Ironic that there are such glaring errors in an article about buggy software.

    Well, I wasn't particularly a fan of Byte before, but now I'm convinced that they suck.

    -Peter
  • by jc42 ( 318812 ) on Thursday May 02, 2002 @02:09PM (#3452153) Homepage Journal
    Some years back, as a grad student, I saw a bunch of colleagues do a rather unnerving experiment. Much of the number crunching was, as usual, done in Fortran. So they instrumented the compiler to silently test for integer overflow, report when it happened, and also report whether the program tested for it.

    Their result was that roughly 50% of the Fortran programs on the mainframe computer produced at least one number in the output that was wrong due to undetected integer overflow.

    This itself would be bad enough. But a bunch of us followed this up by asking Fortran programmers about it. What we did specifically was to point out that, unlike floating point, where there's an interrupt, integer arithmetic required a separate instruction to test the overflow flag. So testing for integer overflow took extra cpu cycles. Then we asked them whether they thought that software should be modified to always test for integer overflow, as is done with floating point.

    The answer was overwhelmingly that if it took extra cpu cycles, the software should not check for overflow.

    When we pointed out that this introduced the risk of programs producing incorrect results, the Fortran programmers invariably said that didn't matter. Faster is better, even if some of the results are wrong.

    I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.

    • I think of this whenever I read about computers used in medical, transportation, or other areas where malfunctioning software could put lives at risk.I don't believe that the "software culture" has changed significantly in this respect since then.


      That's precisely why people developing safety-critical apps should be (and quite often are) using Ada, rather than Fortran or C. Not only does the languge put in all the checks you mention (and more), but the "software culture" among Ada programmers is significantly better where bugs and safety are concerned.

      Take a look at Praxis' SPARK [sparkada.com] for a look at how responsible people develop safety-critical software. The approach takes more effort than the typical "hack something together then bash it into shape with the debugger" approach. But in many cases, it is well worth the cost.
  • So is it a webpage bug when you see:
    "Pentium Prozessor" and "Pentium Porcessors" in the writeup... [rant] This is the same kind of sloppy work that causes cars to explode, missiles to veer off course, and a busload of nuns to get blown up by a rogue robot (Nun Soup?) [/rant]

    Oh well :)
  • Coupla Notes (Score:4, Informative)

    by StormyMonday ( 163372 ) on Thursday May 02, 2002 @02:20PM (#3452215) Homepage
    1. The Patriot time-drift was caused by the system being operated outside of its dsign parameters. It was designed to operate during a Soviet invasion of Western Europe, and expected to have to relocate every 8 hours or so. The spec, therefore, assumed that the software would reboot every 8-12 hours. From my experience with the military, if a programmer had put in a clock algorithm that would track indefinitely, he or she would have been ordered to take it out. (Been there. Done that. Broke the coffee mug.)
    2. The Yorktown crash was the result of mixing mission-critical and non-mission-critical programs on the same box. Big no-no.

    So we have a specification problem and a system design problem. Neither is a pure "programming problem".

    Software crashes are like airplane crashes -- blame the lowest guy on the totem pole. In air crashes, it's the pilot. In software, it's a coder.

    • From my experience with the military, if a programmer had put in a clock algorithm that would track indefinitely, he or she would have been ordered to take it out.


      Any particular reason why? Is it just because the specs assume a reboot every 8-12 hours?

  • "Wrong Starting Estimate of Uranus mass in
    Iteration, Data Compression, 1986"


    Caused wife 1.0 to go into panic and terminate all sex threads for the next three weeks.
  • a fucking scud did. The patriot bug prevented it from helping, but it didnt kill anyone. Sheesh.
  • by nomadicGeek ( 453231 ) on Thursday May 02, 2002 @02:27PM (#3452272)
    My software always works perfectly on my system. Zero bugs.

    I have no idea what the hell the users do to it to screw it up.
  • by kylef ( 196302 ) on Thursday May 02, 2002 @02:33PM (#3452314)
    There were many things that went wrong during the incident, but one of the FEW things that worked correctly was the AEGIS weapons system on board the guided missile cruiser. The error lay in the crew's mistaking the range information reported on the radar screen with altitude information. As a result, the CO thought that the incoming contact was flying straight towards his ship and decreasing in altitude (preparing to attack).

    Blaming a "cryptic display" is hardly a software bug if anyone is familiar with radar screens. That's why we train people to read them!
  • ...the ice cream [snopes2.com] bug!
  • During live coverage of a Scud attack, one of the Patriots veered sharply to the left and hit an insurance office. The cause was said to be an error due to a time leak. The longer the Patriots were "online", the more leakage occured.
  • by Malic ( 15038 ) on Thursday May 02, 2002 @02:50PM (#3452430)
    This IS the text on this very sort of thing. I love techno-"oops, that's not right, is it?"-horror stories and this book is filled with them. I REALLY recommend this book! Here's an example of the page after page of entries it contains:

    Making Rupee!

    Due to a bank error in the current exchange rate, an Australian man was able to purchase Sri Lankan rupees for (Australian) $104,500, and then to sell them to another bank the next day for $440,258. (The first banks' computer had displayed the Central Pacific franc rate in the rupee position.) Because of the circumstances surrounding the bank's error, a judge ruled that the man had acted without intended fraud, and could keep his windfall of $335,758.

    Computer Related Risks - Peter G. Neumann - ACM Press - 1995


    The bottom line here is "computing is, in a technical sense, a risk". Actually, technology - of any kind - is a risk. Which I suppose leads us to remember that life is a risk.

    At which point, I'll just stop rambling and point you all to Amazon to buy the book [amazon.com].
  • CUI (Score:5, Insightful)

    by Ozan ( 176854 ) on Thursday May 02, 2002 @02:56PM (#3452468) Homepage
    I think most of the bugs in software are the result of "Coding Under Influence". Wether it is a strict time-limit, ambiguous specifications, no sleep or other disturbances, it leads to blatant dumb assumptions or similar faults. Everyone knows that driving under influence is dangerous and can lead to accidents. Why do "software architects" think this is different when someone writes important programs?
    I think part of the problem is that writing software is a rather new handwork in comparison to e.g. metalworking. Programmers don't have a union, often they work under poorer confitions than workers at conveyor belts if you consider the higher responsibility they have.
  • more here (Score:3, Informative)

    by 3-State Bit ( 225583 ) on Thursday May 02, 2002 @02:59PM (#3452485)
  • The web page has it as being the F14, but I remembered a posting from that time that said it was the F15 (and it makes more sense, since the F15 was one of the first fly-by-wire aircraft, while the F14, is (I think), pretty much fly by cable).

    In any case, the SERIOUS problem was that when you flew over the equator, the computer would suddenly 'realize' that you were upside down from where you wanted to be and try to immediately turn the aircraft over to the 'proper' orientation. It was said that the aircraft would have survived the maneuver, but the pilot's neck would not.

    Luckily this was found during simulations If it had happened during a real flight, it could have taken a long time (and lots of fatalities) to figure it out.

    On a lighter note, there is apparently a subroutine -- phonetically referred to.. It was either wait_on_wheels() or weight_on_wheels(). In either case, it was added after some slap-happy test pilot tried retracting the landing gear while sitting on the runway (resulting in millions of dollars of repairs).

  • I remember reading about the USS Yorktown a couple years back. I laughed so hard I almost came apart.


    I wonder what experiences anyone else has had with divide by zero "glitches." Anybody else have a similar experience?

Those who can, do; those who can't, write. Those who can't write work for the Bell Labs Record.

Working...