Tracking the Blackout Bug 207

Posted by michael on Saturday April 10, 2004 @03:02PM from the hindsight-is-20-20 dept.

Alien54 writes "This earlier Slash story cited a CNN news report on how the August blackout was preventable. But, as seen in this Security Focus article, things are not so simple. 'In the initial stages, nobody really knew what the root cause was,' says Mike Unum, manager of commercial solutions at GE Energy. 'We test exhaustively, we test with third parties, and we had in excess of three million online operational hours in which nothing had ever exercised that bug,' says Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.' Which leads to a number of other questions."

This discussion has been archived. No new comments can be posted.

Tracking the Blackout Bug

Load All Comments

Search 207 Comments Log In/Create an Account

Comments Filter:

Software bug was just one part of bigger problem (Score:5, Informative)

by bonnyman ( 662966 ) * writes: on Saturday April 10, 2004 @03:04PM (#8825908) Homepage

The software bug was just one piece of a much bigger problem; I wouldn't want to overstate its' role. There were many other factors; here are just a few:

Poor vegetation management probably played an even bigger role as overloaded power lines warmed up, expanded and sagged into trees and bushes that were supposed to have been cut back.

Poor communications between utilities played a major role.

This whole section of the transmission system was known to be unstable.

An inadequate regulatory structure lacked teeth to deal with known problems.

Lack of adequate transmission line capacity

If all these other problems hadn't been in place, the software bug might never have surfaced. And certainly, the rpoblems would have been contained within a much smaller area -- maybe just First Energy's service area.

An article [tipmagazine.com] featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"

Share
twitter facebook
- Re:Software bug was just one part of bigger proble (Score:5, Interesting)
  
  by Raindance ( 680694 ) * writes: <johnsonmx.gmail@com> on Saturday April 10, 2004 @03:07PM (#8825926) Homepage Journal
  
  I agree that there's more to this than just one line of code, as some folks seem to believe- I think referring to it as 'one bug' is rather misleading.
  
  As well refer to the things leading up to WWII as 'one problem'.
  
  Parent Share
  twitter facebook
  - This Defines All Catastrophic Failures (Score:2)
    
    by Allen Zadr ( 767458 ) * writes:
    
    They say that no airline crash was ever the result of a single failure. There are always at least three systems, sub-systems, either human, computer (but usually both) that lead to an airline crash.
    In the case of HVAC fire systems, there are probably over 500,000 installations of HVAC systems, and these are tested under real fire conditions several times a year (where the type of feedback seen in this blackout investigation is made, each time).
    I think this should support Raindance's point
- Re:Software bug was just one part of bigger proble (Score:2, Flamebait)
  
  by Vancorps ( 746090 ) writes:
  
  I believe you can trace it all to one problem. Lack of management...
  Realistically none of these problems had to happen and wouldn't have happened if the people in charge were doing their jobs. Maybe they were working on a way to make cold fusion feasible, I don't know but if they were negligent then they need to be removed from their position. If they were just too busy with other aspects of the system then they need to bring more people in so the system can be properly maintained. A power outage is a big
  - Re:Software bug was just one part of bigger proble (Score:3, Funny)
    
    by Detritus ( 11846 ) writes:
    
    They were doing their job, cutting budgets and payroll costs. Oh, you wanted the system to operate reliably too?
- World's largest machine (Score:5, Interesting)
  
  by stefanb ( 21140 ) * writes: on Saturday April 10, 2004 @03:30PM (#8826050) Homepage
  
  An article featured on Slashdot last year lays out the underlying complexity of the power grid very well: "The World's Largest Machine"
  
  OK, it's nitpicking, but the largest machine is arguably the telephone system. Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.
  
  Parent Share
  twitter facebook
  - Re:World's largest machine (Score:2)
    
    by Creepy Crawler ( 680178 ) writes:
    
    Well then, it's the Internet too. Tele lines are just data lines with splitters for a/d and d/a coxes for the "phone" part.
    - Re:World's largest machine (Score:2)
      
      by the unbeliever ( 201915 ) writes:
      
      There was no concept of 'data lines' when most of the current telephone infrastructure was laid out.
      - Re:World's largest machine (Score:2)
        
        by Creepy Crawler ( 680178 ) writes:
        
        Stupid. They WERE data lines that sent analog voice data. That's all they could send.
        
        When we went more digital, we assigned large optic ring networks (SONET) where pearts were for digital sending of analog and the others were for true digital data. Ever since the switch of ESS back in '87, we've been all using digital phone lines. They just happen to have a D/A hooked to them (thats what leads to our houses).
  - Re:World's largest machine (Score:3, Insightful)
    
    by IncohereD ( 513627 ) writes:
    
    Among other things, it maintains a synchronized clock (8 kHz base), even across oceans and continents.
    
    It's actually plesiochronus, and only synchronized within certain (relatively large) regions. And I don't know where you're getting that 8 kHz figure from.
    
    Basic relativity (not to mention propogation) will tell you that what you're describing is impossible.
- de centralised power (Score:2, Insightful)
  
  by zogger ( 617870 ) writes:
  
  I think the nation/region would be served better if we stepped back a bit and took another look at more decentralised power generation as a full bore government encouraged option. Not as a complete replacement, but frankly, I see no reason we can't have millions more solar panels and wind generators out there. Economy of scale in manufacturing, spurring on even more R&D, etc, works for everything else it appears. And having a lot more points of production, spread out, would help to mitigate cascading fa
- Comment removed (Score:5, Insightful)
  
  by account_deleted ( 4530225 ) writes: on Saturday April 10, 2004 @04:20PM (#8826338)
  
  Comment removed based on user account deletion
  Read the rest of this comment...
  
  Parent Share
  twitter facebook
  - Re:Software bug was just one part of bigger proble (Score:2, Interesting)
    
    by Grayswan ( 260299 ) writes:
    
    Why don't we point out the real problem that likely caused this to happen. Energy deregulation in the first place.
    
    I think it is more accurate to say that deregulation enabled, not caused, the problem. Certainly First Energy used deregulation to put in place much of the pieces of the problem. You just don't hear about all the well run deregulated power systems.
    - Re:Software bug was just one part of bigger proble (Score:2, Insightful)
      
      by wintermute1974 ( 596184 ) writes:
      
      You just don't hear about all the well run deregulated power systems.
      Yes, we do not hear about them, because they do not exist.
      Sure, it was First Energy's lines that failed initially, but if it wasn't First Energy, some other utility would have failed eventually. The engineering and the legal descriptions of the current electrical generation and distriubtion system in North America are at odds with one another.
      There's a good technical discussion [tipmagazine.com] on the failings of the power grid that may interest you
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
  - Re:Software bug was just one part of bigger proble (Score:2)
    
    by bluGill ( 862 ) writes:
    
    Perhaps because the municipal power authorities don't pay any attention to the future, take new lines the the non-municipal paid to install without paying for it, has many more customers per mile, and does minimal maintenance.
    At least in my area it is like that. I'm a member of an electric co-op. We have 16 customers per mile of line on average, the nearest investor owned utility has ~45, and the municipal ~115. The municipal takes the high profit lines, and leaves the rest to someone else. Both the
    - Re: (Score:3, Insightful)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
  - - Re: (Score:3, Interesting)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
- Re:Software bug was just one part of bigger proble (Score:2)
  
  by RobinH ( 124750 ) writes:
  
  There were many other factors; here are just a few:
  
  Yeah, and don't forget the biggest cause: Canada! We all knew immediately that it was their fault. They probably wrote this software too.
Well if you've got no warning... (Score:2, Insightful)

by mindless4210 ( 768563 ) writes:

how can you respond to an incident? It just goes to show the need for multiple monitoring systems in mission critical systems.
For the 21st century... (Score:3, Funny)

by Anonymous Coward writes: on Saturday April 10, 2004 @03:10PM (#8825935)

If a bug exists in the code, but it's never triggered, is it really a bug?

Share
twitter facebook
- Re:For the 21st century... (Score:5, Insightful)
  
  by Raven42rac ( 448205 ) writes: on Saturday April 10, 2004 @03:14PM (#8825966)
  
  Yes, yes it is. If a mime gets hit by a tree in the forest, does anyone care? Sometimes, no matter how much testing you do, shit just happens. It is a fact of life. Show me one perfect, bug free, piece of software. Stuff breaks all the time, we only notice it when it affects us. We take for granted sometimes how good we have it. Power in this country is extremely reliable. We act as if a bomb dropped when the power goes out. Some parts of the world do not have power, clean water, etc. We should think of that before we start whining about having to actually talk to each other, use candles, read books, etc.
  
  Parent Share
  twitter facebook
  - Re:For the 21st century... (Score:2)
    
    by Vancorps ( 746090 ) writes:
    
    Those parts of the world don't rely on power for virtually every aspect of their lives. Electricity is used everywhere, its is like a bomb dropped if power goes out in the U.S.
    Many businesses can't function, businesses of all types from banks to some hotels, to retailers, printing presses, the list goes on.
    
    We are very fortunate to have a power grid as stable as it is. For the most part things do just work, although there is no telling how much damage is done every year to electronics because someone turne
  - Re:For the 21st century... (Score:3, Interesting)
    
    by timeOday ( 582209 ) writes:
    
    I suppose one silver lining in having an outage once a year or so is that it forces us to keep backup systems for hospitals etc in place. If we only lost power once every 10 years, probably nobody at the hospital would even know what to do when power was lost, and people could die. It's just so hard to keep a backup system maintained and working if you are never forced to really use it once in a while. Like planning ahead for a weeklong camping trip, if you don't work up to it by taking shorter trips you
  - Bug free! (Score:5, Funny)
    
    by Ghoser777 ( 113623 ) writes: <fahrenba@NosPaM.mac.com> on Saturday April 10, 2004 @03:43PM (#8826125) Homepage
    
    int main()
    {
    return 0;
    }
    
    Because I have shown you bug free software, does that invalidate the rest of your argument?
    
    Matt Fahrenbacher
    
    Parent Share
    twitter facebook
    - Re:Bug free! (Score:2)
      
      by Creepy Crawler ( 680178 ) writes:
      
      Then your compiler must've fucked up.
      
      See you relied on BUGGY software to make a binary of your "perfect program"
  - Re:For the 21st century... (Score:5, Insightful)
    
    by evilviper ( 135110 ) writes: on Saturday April 10, 2004 @04:15PM (#8826301) Journal
    
    Power in this country is extremely reliable.
    
    Actually, that's statistically untrue. We have, perhaps, the least reliable power system in all the countries of the first-world. Sure, 3rd-world countries have worse-off power systems, but the comparison isn't valid at all.
    
    Some parts of the world do not have power, clean water, etc. We should think of that before we start whining about having to actually talk to each other, use candles, read books, etc.
    
    Since when does the hardship of others make an unreliable power system a plus? Some places may be worse, but so what? We pay a lot for power, and expect our money is being spent on making sure we DO NOT have many outages.
    
    Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?
    
    My point is this. If something is broken, we want to fix it. We don't want to sit around saying "Well, it isn't as broke as that one". If we do, pretty soon it will get worse, and worse, and worse, until we have no other countries to point at.
    
    How about our medical system, and water utility? Should we accept thousands of deaths due to malpractice, or contaminated water, by just saying "Well, it's not as bad as country XYZ"? No, I don't think anyone would believe that, but it's really the same thing. Power outages do mean deaths, and do mean losses of lots of money. Businesses can't run, food can't be properly preserved, or even delivered. People die of heat-stroke, or hypothermia due to power loss. Ambulances can't get through dense traffic caused by traffic signals loosing power, etc.
    
    A power outage is a lot more serious than people "whining" about not being able to watch TV... And yet you get moderated up anyhow... Amazing.
    
    Parent Share
    twitter facebook
    - Re:For the 21st century... (Score:5, Informative)
      
      by Mark_in_Brazil ( 537925 ) writes: on Saturday April 10, 2004 @06:07PM (#8826954)
      
      Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?
      
      Good point. I live in Brazil, and there's a real sick tendency among people here to kiss American ass and fantasize that the United States are a place where everything works perfectly and nobody has to pay for anything. When they do that, I chuckle and point out things like the difference in the electrical power systems in the two countries.
      NOTE: I AM NOT SAYING BRAZIL IS BETTER THAN THE USA... JUST THAT IT'S NOT WORSE EITHER.
      Brazil's electrical power, as of 2001, was about 97% hydroelectric. Because of years of below-average rainfall, this system was threatened, and in 2001, we were told there might be "rolling blackouts" here (except that the Brazilian government, unlike the US government, was honest enough to call it what it was: power rationing). We ended up not getting any "rolling blackouts," and a regression toward the mean in rainfall has left us sufficiently well off that we don't even have to use the new polluting thermo plants that were built around the time of the crisis. Electrical power here is cheap and reliable, especially compared to places like California, where a lot of my friends had to endure "rolling blackouts" because the folks at the deregulated power companies decided to put more money on their bottom line by not investing in infrastructure upgrades and maintenance. So the execs who made those decisions increased profits in the short term, increasing their bonuses and the value of their stock. When the $#!+ hit the fan, guess who had to pay, both in damages from "rolling blackouts" and in higher rates? The consumers, of course!
      The only power problems I've had here in São Paulo were a neighborhood issue, not a city-wide, state-wide, or nation-wide problem. Basically, the new condo across the street overloaded the local grid 3 times in a 2-week span. The worst thing is that the new condo has its own generator, so the newcomers would knock out the neighborhood power and then not even notice, because their generator kicked in. Meanwhile, those of us who had already been in the neighborhood were screwed. Even those problems have been resolved, though. With even more people moving into the new condo, it's been about 6 weeks since we had a problem. The power companies here are pretty efficient. Yeah, I'd have liked for somebody to stop people from moving into the new condo until the local power grid was adequately updated, but they responded pretty quickly once the problem did present itself in an inconvenient way.
      
      --Mark
      
      Parent Share
      twitter facebook
      - Re:For the 21st century... (Score:2)
        
        by evilviper ( 135110 ) writes:
        
        I live in Brazil, and there's a real sick tendency among people here to kiss American ass and fantasize that the United States are a place where everything works perfectly and nobody has to pay for anything.
        I'm sure it's not just Brazil.
        
        In the US, the big thing we have going for us is a lot of competition between companies. That means a lot of choice in products, and low prices. However, there are a lot of exceptions to that, and we are completely screwed when it comes to anything monopolized, or govern
    - Re:For the 21st century... (Score:2)
      
      by DerekLyons ( 302214 ) writes:
      
      Meanwhile, in California, prices are high, and power was VERY unreliable. "Rolling Blackouts" anyone?
      
      Sorry, but those blackouts have nothing to do with low reliability, and everything to do with lack of capacity. Nothing in the system was broken, nothing in the sysem failed, there simply wasn't enough power.
      Lack of capacity isn't lack of reliability.
      - Re:For the 21st century... (Score:2)
        
        by evilviper ( 135110 ) writes:
        
        Lack of capacity isn't lack of reliability.
        
        Yes, it is. In fact, lack of capacity is a CAUSE of the lace of reliability.
        
        If there is not power to your wall, the power system has failed. It doesn't matter how reliable the individual power plants are, we are talking about overall grid reliability.
  - Re:For the 21st century... (Score:2)
    
    by /dev/trash ( 182850 ) writes:
    
    Power is reliable in this country? Hmm, then the UPS Home Office market must be a myth.
    
    I have two units myself.
  - - Re:For the 21st century... (Score:2)
      
      by Raven42rac ( 448205 ) writes:
      
      Don't pull a "what about the children?!?!" on me. I tend to not worry about things. People who expect perfection are likely to be disappointed. Am I a medical software developer? No. If I were I would not hold the same viewpoint. I would strive to make the best code possible, but it is not possible to test for every contingency. Expecting perfection from everything is insane. As an atheist, I do not pray that often. I know it is just a figure of speech though.
B Method? (Score:5, Interesting)

by starseeker ( 141897 ) writes: on Saturday April 10, 2004 @03:12PM (#8825951) Homepage

"the bug was unmasked as a particularly subtle incarnation of a common programming error called a "race condition," triggered on August 14th by a perfect storm of events and alarm conditions on the equipment being monitored. The bug had a window of opportunity measured in milliseconds. "

Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.

That doesn't mean the DESIGN is flawless, of course. But if we start engineering software on as many levels as we can, mightn't things improve? Normal software development and testing would never have found a critical bug with rare trigger conditions and a millisecond window. If you need precision on that level, you need to (for starters) to KNOW your implimentation of your design is sound, and preferably the code you are running exactly impliments the proven logic. Isn't this what the B Method was created for?

Share
twitter facebook
- Re:B Method? (Score:5, Interesting)
  
  by mccalli ( 323026 ) writes: on Saturday April 10, 2004 @03:38PM (#8826102) Homepage
  
  Isn't this the type of problem the B Method (and maybe the Z language too) are designed to address? Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.
  Ye gods, you've frightened the hell out of me with reference to Z. I'd almost entirely forgotten it, and had hoped its cold corpse would lie in the ground undisturbed, undiscovered and most importantly of all unreferenced until the end of time. Still, "That is not dead which may eternal lie"...
  Z is a beautiful way to mathematically prove that you have design bugs at the highest level possible. You can then design your unit tests around those bugs, and confirm that they're valid.
  That's it. It provides nothing else that unit testing on its own couldn't do, with the exception of a few salaries and a research grant here and there. Whilst you can mathematically prove implementations of certain designs, the vast majority of designs have more complex interactions. Try using Z for a multithreaded real-time environment for example - my Software Engineering tutor at the time, Iain Sommerville (well known in the field due to his books, oh and 'at the time' would ~1993), basically said that Z just breaks down in those circumstances. I wouldn't know - I personally had no clue how to even make it begin in those circumstances, let alone break down.
  Please confine Z to camp-fire ghost stories used to scare new programmers. It always was a living hell, and it really shouldn't be resurrected now.
  Cheers,
  Ian
  
  Parent Share
  twitter facebook
- Re:B Method? (Score:5, Interesting)
  
  by Orne ( 144925 ) writes: on Saturday April 10, 2004 @03:39PM (#8826104) Homepage
  
  SCADA systems transport data samples. My company's system collects from several hundred thousands of meters, about half of which are expected to send in a sample about once every 10 seconds, some as fast as once every two seconds. The concept is that you have a communications buffer that collects the data, the link writes to the memory while the other EMS applications (about a dozen) read from the memory.
  
  Now admittedly, FirstEnergy's system is a little smaller in territory, but I wonder if their mergers over the recent years (Cleveland Electric and Ohio Edison became FE, and then proceeded to take Toledo Edison and GPU of PA) have outpaced the collection capabilities of their mainframe (which was already at the end of its life and was scheduled to be replaced). That could account for some of the "slowing" that the G.E. testers said they had to do to make the race condition appear.
  
  Parent Share
  twitter facebook
- Re:B Method? (Score:2, Interesting)
  
  by Mr. Slippery ( 47854 ) writes:
  
  Use proof logic initially - once you have decided on a behavior you want, design the system in such a way that it is provable it executes this design.
  
  Problem is, doing and verifying proofs is just as subject to error as creating and reviewing code. All you've really done is change your symbol set.
- Re:B Method? (Score:3, Insightful)
  
  by bruthasj ( 175228 ) writes:
  
  design the system in such a way that it is provable it executes this design
  
  Unfortunately, if you actually come out of the library and the computer labs, software has to be done -- yesterday. Flawless, provable code would cause most software houses to go bankrupt. It's a fact of life...
I don't trust this Mike Unum guy... (Score:2, Funny)

by JessLeah ( 625838 ) writes:

...I don't know him from a hole in the wall. But his cousin, E. Pluribus Unum.... that guy, I trust. :)
The American jackasses who blamed Canada (Score:5, Interesting)

by Kevin Mitnick ( 324809 ) writes: on Saturday April 10, 2004 @03:14PM (#8825963) Homepage Journal

Did anyone ever retract their statements? I know the NY Mayor was pretty quick to blame us Canucks.

Share
twitter facebook
- Canada has a history of bad grid control (Score:3, Informative)
  
  by Orne ( 144925 ) writes:
  
  From the perspective of New York, they saw a surge race through their system East to West, through the choke point into Canada at Niagra station. NY constantly has problems with IMO not following schedules, and from their perspective, this was yet another incident of bad reliability control across the border.
  
  What they didnt know is that the energy was routed through the southern bit of Canada along the lake area, back into the USA in Michigan, to feed all of the communities along the southern shores of th
  - Re: (Score:2, Insightful)
    
    by account_deleted ( 4530225 ) writes:
    
    Comment removed based on user account deletion
    - Re: (Score:2)
      
      by account_deleted ( 4530225 ) writes:
      
      Comment removed based on user account deletion
- Re:The American jackasses who blamed Canada (Score:2)
  
  by Scrameustache ( 459504 ) writes:
  
  Blame Canada, Blame Canada!
  
  With their beedy lil' eyes and flapping heads so full of lies!
  
  Xenophobic, yes.
  
  "In the initial stages, nobody really knew what the root cause was,"
  But you know, there are freedom canadians! And they put gravy and cheese on their freedom fries, those foreign weirdos...
- Re:The American jackasses who blamed Canada (Score:4, Funny)
  
  by spinkham ( 56603 ) writes: on Saturday April 10, 2004 @04:11PM (#8826280)
  
  We blame you, you blame the Newfies. It's the pecking order around here, deal with it ;-)
  
  Parent Share
  twitter facebook
Testing isn't the answer... (Score:5, Insightful)

by evilviper ( 135110 ) writes: on Saturday April 10, 2004 @03:25PM (#8826023) Journal

You can't expect just testing to reveal all bugs in a program. Even a simple program would have to be fed completely random data constantly, in every different order and circumstance concievable, for a very long time, to reveal all bugs. That's just not a real option.

The only way to have bug-free software is to write it properly. You have to modularize and simplify everything down to the point that each one is easilly understandable, and it is easy to detect when one is providing a sensless answer (in other words, cross-checking every result). Then, you have to tie them all together in a robust but simple way.

I know it's far easier to say it than do it, but it seems like nobody even tries to do it these days. Even mission-critical systems are commonly built as a single monolithic program, and when you have a lot of things going on within a single program, with no checks of the sanity of the data going into or comming out of each component, there is no way to be 100% certain that the program is theoretically and genuinely perfect. Meanwhile, by modularizing everything, you can PROVE that it is actually perfect.

But this is really just the old Macrokernel vs. Microkernel arguement all over again. A Microkernel can be perfect, while a macrokernel can never be completely bug-free, but people just find the latter to be easier to write, and then spend hundreds times more man-hours finding and removing bugs, rather than spending (less, overall) time doing it correctly in the first place.

Oh yes, almost forgot, IMHO...

Share
twitter facebook
- Re:Testing isn't the answer... (Score:2, Insightful)
  
  by Grimmtooth ( 187628 ) writes:
  
  Your comments remind me of an old QA maxim: "We can only prove the existance of bugs - not the absense of them."
  
  You invoke the magic buzzwords of "modular design" as if it were a new thing. It isn't. That concept is older - in practice, even - than the median user on /.. Edsger W. Dijkstra was one of the earliest proponents of such coding practices - you can find archives of his papers HERE [utexas.edu] and see for yourself.
  
  Magic buzzwords can't prevent defects from occurring. QA can't find them all, no matter the bu
  - Re:Testing isn't the answer... (Score:2)
    
    by evilviper ( 135110 ) writes:
    
    "We can only prove the existance of bugs - not the absense of them."
    Fortunately, that's not true. Unfortunately, nobody seems to even try to write provably bug-free code.
    
    You see, everything dealing with computers in math. Everything that happens can be simplified down to a binary-base math problem. In math (unlike the real world) you CAN prove that something is perfect.
    
    For instance, it was a while back that a /. story touted the first provably unbreakable encryption method. Now, it's not a method tha
- Re:Testing isn't the answer... (Score:3, Insightful)
  
  by Kirill Lokshin ( 727524 ) * writes:
  
  Meanwhile, by modularizing everything, you can PROVE that it is actually perfect.
  
  Umm, no. Modular design is great for theoretical process correctness, i.e. if a certain input is made to the running program, will it provably produce a certain output. The main problem with this, of course, is it assumes that the program is physically running the whole time.
  
  The systems (I assume) are being used here have to deal with more ephemeral and unpredictable conditions: failing hardware, CPUs going offline in the
Reasons for power blackouts (Score:5, Interesting)

by pcraven ( 191172 ) writes: <paul@NOSPAM.cravenfamily.com> on Saturday April 10, 2004 @03:27PM (#8826030) Homepage

I've been reading several papers on this for a grad class I'm taking. One of the several problems is no government control. If a power outage might be prevented by shedding some load (turning out power to some people), no company wants to step up to the plate and be the one to turn out the power to their customers. So they luck out, or they have a massive power outage.

This paper [computer.org] (click on the PDF link) has a good summary of the problems in keeping power outages from happening again.

Share
twitter facebook
- Re:Reasons for power blackouts (Score:2)
  
  by ctr2sprt ( 574731 ) writes:
  
  I'm not really seeing where government control would change that. If they were quicker to pull the trigger and cut power to 100,000 homes, we'd just be seeing that every 3 months as soon as anything trivial went wrong. And because it's the government, there's nothing you can do about it.
  No, I'm not ready to give up on an industry that, so far, is so exceptionally reliable that most people are without electricity for maybe 5 hours out of the year. We get excited just for approaching that level of reliab
  - Re:Reasons for power blackouts (Score:2)
    
    by mc6809e ( 214243 ) writes:
    
    I'm not really seeing where government control would change that. If they were quicker to pull the trigger and cut power to 100,000 homes, we'd just be seeing that every 3 months as soon as anything trivial went wrong. And because it's the government, there's nothing you can do about it.
    No, I'm not ready to give up on an industry that, so far, is so exceptionally reliable that most people are without electricity for maybe 5 hours out of the year. We get excited just for approaching that level of reliability
- Re:Reasons for power blackouts (Score:2)
  
  by Orne ( 144925 ) writes:
  
  I disagree, government is never the answer if you want something truely fixed. There are plenty of rules in place on how to maintain a reliable system, rules formed by the industry itself as "best practice" procedures; not to mention that there's already an alliance called NERC [nerc.com] for US & Canada who's supposed to be managing it. A similar government commission FERC [ferc.gov] exists for setting USA policy only. Thirdly, there's another coallition called NAESB [naesb.org] who sets the common standards for energy markets.
  
  What
Race conditions are nasty ... (Score:5, Insightful)

by cagle_.25 ( 715952 ) writes: on Saturday April 10, 2004 @03:36PM (#8826078) Journal

As you programmers all know, avoiding race conditions is really difficult. The fellow Neumann quoted in the article who said
But Peter Neumann, principal scientist at SRI International and moderator of the Risks Digest, says that the root problem is that makers of critical systems aren't availing themselves of a large body of academic research into how to make software bulletproof.

is overly optimistic; it's theoretically impossible to write a general test to find all race conditions in code. This is a variant of the Halting Problem [wikipedia.org].

Share
twitter facebook
- Re:Race conditions are nasty ... (Score:2, Informative)
  
  by Animats ( 122034 ) writes:
  
  it's theoretically impossible to write a general test to find all race conditions in code.
  Baloney. It is possible to write programs for which race conditions are undecideable. Such programs are broken. It is possible to write programs for which race condition detection is NP-hard. Such programs are broken if N is large. It is also possible to write programs for which race conditions can be proven to be absent. That's what you want to do.
  Actually, it's straightforward to design software to be fre
  - Re:Race conditions are nasty ... (Score:3, Informative)
    
    by platipusrc ( 595850 ) writes:
    
    how do you have a large nondeterministic?
    
    hint: NP-hard [wikipedia.org] is a problem that is NP-complete [wikipedia.org], or worse. An NP-hard problem does not have to be solvable. NP [wikipedia.org] in this context stands for nondeterministic polynomial (with reference to time bounds). NP means that a problem can be solved in polynomial time with an infinitely parallel system. NP-complete problems are at least as hard as all other NP problems.
    
    Sorry, it just bugs me whenever people try to talk about theory of CS and use "non-polynomial" or something
- Re:Race conditions are nasty ... (Score:2, Insightful)
  
  by Mr. Slippery ( 47854 ) writes:
  
  it's theoretically impossible to write a general test to find all race conditions in code. This is a variant of the Halting Problem.
  
  I doubt PGN was refering to software to test for race conditions; I expect he was alluding to methods for writing code that does not contain them. People have, after all, been thinking about Dining Philosophers [mtu.edu] for quite a while now, yet coders still do amazingly stupid things with threads.
- Mutexes and Locks (Score:2)
  
  by Detritus ( 11846 ) writes:
  
  It isn't that difficult for most common cases. You just put mutex semaphores or locks on shared data structures.
  You need programmers with a good background in real-time and concurrent programming, who understand the hazards and how to avoid them.
  - Re:Mutexes and Locks (Score:2)
    
    by Tony-A ( 29931 ) writes:
    
    You need programmers with a good background in real-time and concurrent programming, who understand the hazards and how to avoid them.
    
    Agreed. Including all the places that look innocent but are capable of encountering such hazards. Including the pathological cases where innocent-looking code can have extremely evil consequences. Including code that looks dangerous but is in fact safe. Including code that looks safe but is in fact dangerous.
- - Re:Race conditions are nasty ... (Score:2)
    
    by Tony-A ( 29931 ) writes:
    
    It's easy to avoid race conditions:
    
    Right. Just one step at a time.
    
    Unfortunately, the real world is asynchronous and it doesn't really work to say "Stop the world, I've got some computing to do".
    
    it's also easy -- quite seductively easy -- to try to write excessively complex, multithreaded systems that are too complicated for you or anyone else to understand,
    You're right, but methinks you understate the case.
Software ENGINEERING (Score:4, Interesting)

by Anonymous Coward writes: on Saturday April 10, 2004 @03:39PM (#8826112)

If I want to build a large structure (bridge or building) where it is possible that public safety is at issue, I had better have an engineer's signature on the drawings.

This case seems like a real good argument for having the same requirement for software.

Good engineering practice would probably have prevented this. A simple example of such a system would be a burglar/fire alarm panel. The system is self-checking. If any part of the system isn't working (ie. someone cuts a wire), then that causes an alarm.

I realize that there will be strange undetectable bugs in software but if the system as a whole is properly engineered, the system will fail gracefully and safely.

Share
twitter facebook
- Re:Software ENGINEERING (Score:4, Interesting)
  
  by Orne ( 144925 ) writes: on Saturday April 10, 2004 @04:45PM (#8826478) Homepage
  
  The two systems you describe are fundamentally different from the design of this alarming system. In fire or safety, the "reading" is the voltage of the closed loop wire itself; 12 volts connected, 0 volts open.
  
  Now imagine if you have a layer in between; you want to monitor the fire status of a complex of warehouses from a single room several miles away. Analog/Digial the signals to all of the individual buildings, transport the data to a common computer, and view the data there. Figure you have several hundred buildings you're watching at once, and now you're getting closer in scale to how the grid dispatchers get their data.
  
  Now imagine that the computer's software back at the main station reads all these meters, and if a line's open (say you're tracking window openings for security), it writes an alarm to a text log on the screen; on a good day, you don't get any alarms. Now suppose the driver that writes the alarms to the screen hangs; since you werent expecting any alarms, you're not that concerned that you aren't seeing anything. That's pretty much what caught FirstEnergy for those 3 hours that afternoon, while the system was failing and they didn't realize they needed to act.
  
  Parent Share
  twitter facebook
- Re:Software ENGINEERING (Score:3, Insightful)
  
  by Kirill Lokshin ( 727524 ) * writes:
  
  the system will fail gracefully and safely.
  
  A mission-critical system, by definition, cannot fail "safely", since it must not fail at all.
- Re:Software ENGINEERING (Score:4, Interesting)
  
  by sjames ( 1099 ) writes: on Saturday April 10, 2004 @05:52PM (#8826867) Homepage Journal
  
  At first glance, that can seem like a good idea, but are you prepared to pay for that signoff from each engineer whenever you install a piece of software?
  
  A PE signs off on each particular instance of a design taking intended use, site and other construction into account. If you then build elsewhere, you need a new signoff. If you make any significant change (including adding other structural elements to the design (that is, installing more software), you'll need a new signoff. Add a new network driver, another signoff. Upgrade the CPU? You guessed it!
  
  Some software is poorly designed and crash prone. Other software is well designed but cannot be signed off on because it might be installed on nearly anything that pretends to be a compatible platform.
  
  The one justification for that sort of signoff is in situations where a bug will kill someone. Even then, the system should be divided into critical and auxillary parts to limit what must be signed off on.
  
  Autopilots work that way. You have a small and reletivly simple part that assures safe conditions, is extensively tested, and rarely changed. Another portion is more frequently updated, attempts to optimize the flight and provides a nicer interface. The latter can fail completely and the plane will continue to fly (possibly with poor fuel economy and the pilot navigating manually, but it won't fall out of the sky).
  
  There are many tradeoffs. In some sense, many small distributed systems are more robust than centralized control. However, it's a lot easier to create a chaotic system that way. If you do, you won't know until the system falls into a weird state without warning.
  
  Parent Share
  twitter facebook
- Re:Software ENGINEERING (Score:3, Informative)
  
  by iabervon ( 1971 ) writes:
  
  One issue is that there is no safe state for the system to go to if the control system breaks down. Bringing the power grid in an area down safely is as hard as bringing it up safely (which, if you remember, took a while) and is harder than just keeping the system running.
  
  The system is full of inductors, whose voltage drop is determined by the change in current through them. If you disconnect a transmission line, suddenly you're trying to change the current to 0, which puts all of the inductors at whatever
Additional Information (Score:4, Interesting)

by Orne ( 144925 ) writes: on Saturday April 10, 2004 @03:47PM (#8826145) Homepage

Oddly enough, while writing a comment to another user's message, I threw some info in google to learn about FirstEnergy's EMS system, and found this other SecurityFocus story [securityfocus.com] in Feburary 2004, which gives more raw facts than this newer story.

"DiNicola said Thursday that the company, working with GE and energy consultants from Kema Inc., had pinned the trouble on a software glitch by late October and completed its fix by Nov. 19..."

"With the software not functioning properly at that point, data that should have been deleted were instead retained, slowing performance, he said. Similar troubles affected the backup systems. " This dovetails well with why the testers had to "slow" their testing to make the race condition appear.

Share
twitter facebook
342 years of online operational hours? (Score:3, Insightful)

by VoidEngineer ( 633446 ) writes: on Saturday April 10, 2004 @03:51PM (#8826169)

So, as far as I can figure, there are 24 hours in a day, and 365 days in a year, which equals about 8760 hours in a year (give or take).

Now then, 3 million hours divided by 8760 hours per year equals approximately 342 years, modulo 4070 hours (i.e. approximately 169 days...).

Now then... how the hell do they get the idea that they've been up-and-running for 342 years? Are they counting things in parallel? Even if they were counting end-user operational hours, the number should at least be a couple orders-of-magnitude higher, no?

3M online operational hours sounds like fuddy-duddy accounting to me... although, obviously I haven't looked over the books. I would be interested to see how they came up with this number.

Share
twitter facebook
- Re:342 years of online operational hours? (Score:4, Interesting)
  
  by Creepy Crawler ( 680178 ) writes: on Saturday April 10, 2004 @03:58PM (#8826199)
  
  342/x
  
  x = "how many reactors they have in operation"
  
  Parent Share
  twitter facebook
Testing vs RTFS. Proprietary vs open. (Score:5, Insightful)

by SharpFang ( 651121 ) writes: on Saturday April 10, 2004 @04:02PM (#8826227) Homepage Journal

if(int(rand()*1e20)==31337){
blow_up();
} else {
do_your_work();
}

Now I can't imagine amount of testing in proprietary software that could reveal this example of malicious code. In open source one look at the code will reveal it. Of course not all cases are so obvious, but always reading the code should be used together with "testing the software". How do you know lots of proprietary software that IS close-source isn't i.e. a gatweway for terrorists? How do you know biggest companies' stuff isn't all trojans? It wouldn't be hard to hide it. Say your software is kind of server. It does its job okay unless it receives TCP packets starting with certain string. Then it just executes commands contained after that string. Boom. No amount of -testing- will reveal this.
And there are bugs that can be triggered once in several billion cases. Only looking at the code could fix them and explaining "we did a lot of tests" is bullshit.
I put a lot of iron, gum, different materials, C4, glass and some more together and it goes, I call it "a car" and I rode 1000's of kilometers okay. Now no amount of testing in all road conditions will reveal it contains the C4 explosives. Looking under the hood will reveal it really fast.

Share
twitter facebook
"We test exhaustively..." (Score:4, Insightful)

by Fratz ( 630746 ) writes: on Saturday April 10, 2004 @04:12PM (#8826286)

Um, no you don't. By definition, if you tested exhaustively, you'd have found everything that could possibly go wrong with whatever you tested.
I'm not saying it's always feasible to test exhaustively, but don't say you did when you clearly didn't.
Also: "we had in excess of three million online operational hours in which nothing had ever exercised that bug"
Taken with the "exhaustively" statement, I'm thinking that whoever said these things doesn't understand QA very well. It's easy to write code that works well when everything's good, and it's often just as easy to test that. It's another thing entirely to write code that works well (or fails gracefully) when everything's wrong. And again, it's harder to test that.

Share
twitter facebook
- Statistics (Score:2)
  
  by Detritus ( 11846 ) writes:
  
  Exhaustive testing, however you wish to define that, can reduce the number of defects in the code, but it isn't going to eliminate them in a complex system. The number of defects found per unit of test time follows a predictable curve, where each new defect found requires more test time. It's like accelerating to the speed of light, the closer you get to 0 defects, the more test time is needed.
  - Re:Statistics (Score:2, Informative)
    
    by chgros ( 690878 ) writes:
    
    Exhaustive testing, however you wish to define that
    Exhaustive \Ex*haust"ive\, a.
    Serving or tending to exhaust; exhibiting all the facts or arguments; as, an exhaustive method. Ex*haust"ive*ly, adv.
    
    Basically, it should mean you've tested everything (which is of course impossible in most cases).
    The term usually used (and rightfully so) is extensive testing.
    - Re:Statistics (Score:3, Funny)
      
      by aardvarkjoe ( 156801 ) writes:
      
      Maybe it just means that they got very tired while testing the software?
A perfect example (Score:2)

by maximilln ( 654768 ) writes:

This probably would've been prevented if they had compiled using -O3 and -march=athlon-xp.

Someone said "always go with package installs" and that person had more seniority.

Unum. 'I'm not sure that more testing would have revealed that. Unfortunately, that's kind of the nature of software... you may never find the problem. I don't think that's unique to control systems or any particular vendor software.'
bugs are not inevitable (Score:4, Insightful)

by ummit ( 248909 ) writes: <scs@eskimo.com> on Saturday April 10, 2004 @06:21PM (#8827029) Homepage

We test exhaustively... I'm not sure that more testing would have revealed that.
For an obscure race condition, this is undoubtedly true.
Unfortunately, that's kind of the nature of software... you may never find the problem.
This is sorta true, sorta false, and definitely misleading.
I don't think that's unique to control systems or any particular vendor software.
No, it's not unique; bugs that may never be found are rampant in most varieties of software. What's false -- tragically, crushingly false -- is the presumption that these unfindable bugs are therefore inevitable. They are not.
If there's a class of bugs that's hard to test for -- and of course there are many such classes -- the prudent thing to do is to find development methodologies that skirt those bugs entirely. If you don't put in so many bugs in the first place, you obviously don't have to work so hard trying to find and fix them.

Share
twitter facebook
Why isn't this Open Source? (Score:2)

by chris_sawtell ( 10326 ) * writes:

While I can understand that one does not necessarily want every Tom, Dick, and Henrietta checking changes into the current CVS branch, software which is created to reliably serve the General Public's need on a 24/7 basis, should be available for the said General Public to at least examine and critique. This would create not only the much needed conduit between Industry and Academia, but also the background 'body of literature' which is so essential to all learning. It would also vastly improve the code qual
- Re:The problem with SCADA systems (Score:5, Interesting)
  
  by Vancorps ( 746090 ) writes: on Saturday April 10, 2004 @03:13PM (#8825959)
  
  This all reminds me of the movie Resident Evil where they shut down power and all the doors unlock when power is restored.
  You bring up a great point about failure states. I work for several large hotels and the fire control systems are the ones that alert whenever there is any problem of any kind largely because any problem of any kind needs to be addressed immediately so it makes sense.
  I would think power systems would think along the same lines since the odds are, ANY failure whatsoever needs immediate attention of engineers that maintain the system. This is not a requirement for all software but when it comes to such critical services why doesn't everybody do the same practice? It seems so blatently obvious that alarms should have been raised.
  Also, in situation's where you don't work on a live environment you can always create a test environment that is for all intensive purposes "live" For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?
  
  Parent Share
  twitter facebook
  - Permanent Alarms (Score:2)
    
    by Detritus ( 11846 ) writes:
    
    That's assuming the faults get fixed. I've seen buildings with the new fancy computerized fire alarm systems where alarms for sensor and wiring faults get ignored for months.
    - Re:Permanent Alarms (Score:2)
      
      by Vancorps ( 746090 ) writes:
      
      In my experience its impossible to ignore since it sets off the firealarms throughout the building and calls the fire department who have to come out an investigate. Only they are allowed to shut it off, if you do then you get yourself a nice hefty fine. I guess its not like that everywhere from the sounds.
  - Re:The problem with SCADA systems (Score:5, Insightful)
    
    by miu ( 626917 ) writes: on Saturday April 10, 2004 @05:22PM (#8826666) Homepage Journal
    
    For web development work I do I have a testing domain which is used to test sites to ensure that because they work here in my lab they will work when I hand them off to the client. Its 100% accurate, I've seen it done with countless other systems, so why wasn't it done here?
    Mostly because web systems are still toys compared to real systems.
    These systems get real and very intensive testing in labs as close to live as they can get. Even once they knew the conditions and affected subsystems it took the dev and testing teams months to recreate this bug in the lab. The lab is never just like real life, it cannot be - because even real life now is not always the real life of 10 seconds ago.
    
    Parent Share
    twitter facebook
    - Re:The problem with SCADA systems (Score:2, Informative)
      
      by wintermute1974 ( 596184 ) writes:
      
      I agree. These SCADA systems can become quite complex. If you are interested, you can even read General Electric's brochures [gepower.com] for the XA/21 system.
    - Re:The problem with SCADA systems (Score:3, Interesting)
      
      by Vancorps ( 746090 ) writes:
      
      Web systems were but one example. I'll through another much more complex example. Take DNA from bacteria and splice it with stem cells to produce nerves much more resistent to damage. You are talking thousands about thousands of long protein strands most of which you have no idea what perform what task. Do this without destroying the cell. When you are done with that test you move on to a more complex test until ultimately you are ready to do it with humans, at which time you can accurately predict exactly
      - Re:The problem with SCADA systems (Score:3, Interesting)
        
        by miu ( 626917 ) writes:
        
        When it comes to troubleshooting systems you always have the option of making an exact scale model. You scale it up for more precision. This is a simple concept and apparently a lot of people think just because a system is complex and antiquated the same ideas can't apply.
        Even if you could create a model to test with that is identical to the live system you cannot test every possible situation which can occur in the real world. Integration testing can only test those things which can be envisioned by th
  - Re:The problem with SCADA systems (Score:2, Interesting)
    
    by Fishead ( 658061 ) writes:
    
    Wasn't Chernobyl taken out by a test gone bad?
    
    Testing is all fine and good, but there are always going to be instances where something will remain undetectable for years until circumstances are just right (wrong?)
    
    I am a technician at a plant that makes batteries and we see this all the time.
    
    I remember one time where an operator was cleaning a conveyor with a cloth soaked in Methanol (standard procedure) but forgot about the rag he had left on the underside of the running conveyor. Once the Meth had all
  - - - Re:The problem with SCADA systems (Score:2)
        
        by Vancorps ( 746090 ) writes:
        
        Every test you can run will behave exactly the same in the real environment. I'm not sure why that is even remotely hard to understand.
        
        Re:The problem with SCADA systems (Score:2)
        
        by EddWo ( 180780 ) writes:
        
        It should be "for all intents and purposes", thats what he's getting at. Nevermind, you're not the first person to write it as they here it.
        
        Re:The problem with SCADA systems (Score:2)
        
        by Vancorps ( 746090 ) writes:
        
        Yeah, I figured that is what it was, but language is about communicating an idea and the point got across so it is considered acceptable. The meaning still fits my purpose even though it is not the common saying.
        
        You can't simulate the real world (Score:2)
        
        by A nonymous Coward ( 7548 ) * writes:
        
        No matter how fancy your testing system, the real world has more connections, more diiots with fingers on keyboards, more feet tripping over cables, more weather knocking out transformers and lines, more everything.
        
        I'm not sure why that is even remotely hard to understand.
        
        Re:You can't simulate the real world (Score:3, Insightful)
        
        by Vancorps ( 746090 ) writes:
        
        Forgot rats, for some reason they likes to chew cables.
        Now, for an example. I stress tested a database I am in the process of building for Mercedes, I made the machine come to crawl. I did it to a dual cpu server, a quad cpu server, and a 16 cpu server. Guess what? They all behaved exactly the same as the system grew. Now scale it up to the DB/2 cluster that it will actually be working on. I do the same thing and guess what? Yep, the exact same result.
        
        If testing fails to produce an outcome that brings a
        
        Testing cannot guarantee systems (Score:3, Insightful)
        
        by Goonie ( 8651 ) * writes:
        
        If testing fails to produce an outcome that brings a fault then there is a flaw in the testing procedure. The real world can have more connections, but I don't care, software can be 100% bug free.
        
        The first thing they teach you in a software testing course is that testing cannot guarantee the absence of bugs. The only way you can guarantee, through testing alone, that your program is error-free is to exhaustively test every possible "input" (combination of external inputs and internal state) and check
- Fixing with words what needs engineering. (Score:2)
  
  by Futurepower(R) ( 558542 ) writes:
  
  From the Slashdot story: "Unfortunately, that's kind of the nature of software... you may never find the problem."
  
  What the parent poster said sounds right. The GE spokesperson is just trying to fix with bullshit what should be fixed with engineering.
- Clocks (Score:3, Interesting)
  
  by Detritus ( 11846 ) writes:
  
  That's one reason that I like to put UTC clocks on displays. A quick glance at the clock will tell you if the display subsystem has crashed.
  I'm also a big fan of watchdog timers. The process that periodically resets the timer can make all sorts of health and sanity checks.
  - Re:Clocks (Score:2, Interesting)
    
    by corngrower ( 738661 ) writes:
    
    A watchdog timer on the alarm system (that was deadlocked) would probably have prevented this scenareo. And I also agree that displying clocks on the screen is a good way for the operator so see if the display system is functioning properly.
    
    Not having the display system give a visual indication of stale data was also a deficiency.
    
    There also seems to have been a problem in that the data collection and monitoring portion of the system was held up by a malfunctioning alarm system.
- Re:The problem with SCADA systems (Score:2)
  
  by rand.srand() ( 243903 ) writes:
  
  Having been at the plant where they make/support the XA/21, it's no wonder the thing failed. In the last few years they've axed the entire support crew, and tried to sub it out to recent high school grads. The last few good people worked really hard but couldn't document a single thing in the pressure to release systems.
  
  As for the updates and how that works, etc, the XA/21 system uses RTU's in the field which are basically 1200 baud modems with some instrumentation and a simple controller. They call back i
- Re:The problem with SCADA systems (Score:3, Interesting)
  
  by Kirill Lokshin ( 727524 ) * writes:
  
  This is exactly the same as software in my industry (HVAC fire/security systems for large buildings), where if you lose communication to a subsystem or the field, you have to raise alarms all over the place.
  
  And perhaps the software in question also tries to do that. However, there are any number of reasons it could still fail.
  
  Consider the following scenario: one software component (a proccess, if you will) is responsible for synchronizing the data between the remote testing station and the local data st
- Re:The problem with SCADA systems (Score:2, Interesting)
  
  by spurdy ( 590954 ) writes:
  
  You make a good point, but in my company, we have hundreds of data points reporting continuously. When the communications (telephone company) fails, which it does multiple times every day, you end up with wrong data temporarily. If the operator had to investigate every comm failure, he'd never get anything else done. So, there has to be a threshold somewhere of when does a problem reach a level that it needs to generate an alarm.
- Re:The problem with SCADA systems (Score:2, Interesting)
  
  by fermion ( 181285 ) writes:
  
  It kind of depends on how often the out of data conditions occur and how long they occur. My understanding is that the design of proper alarms is actually a complicated security issue, and improper alarms leads to less effective security.
  For example, I once worked at a place with many many Window web servers. Every time a server failed, an alarm would sound. But the reason we used Window servers is that they were dirt cheap so we could buy enough to compensate for the expected frequent failures. The r
- Blame Game (Score:2)
  
  by The Monster ( 227884 ) writes:
  
  We all know it was Microsofts fault...Blaster Worm?
  
  If you want to know the truth, ask former White House Cyberterrorism expert Richard Clarke. He'll tell you that he had been warning both the Clinton and Bush administrations about this, and although Clinton's team had approved a plan to deal with the menace (but never actually got around to implementing it), none of Bush's senior aides listened to him, and instead wanted to do a pre-emptive strike on Kazaa to elimnate Weapons of Mass Distribution. It
- Re:The real problem (Score:3, Informative)
  
  by sjames ( 1099 ) writes:
  
  From the reports I have seen, other than FE, the various companies did take appropriate action and shed load where necessary, it's just that the situation developed too quickly (from their perspective) and was too large to save by the time they could see it.
  
  The problem was that the grid was running too close to capacity in general. Since the electricity is traveling as fast as any control signal could, it is necessary for the system to be able to tolerate whatever condition may exist long enough for syst
- Re:two words: formal methods (Score:2)
  
  by Tony-A ( 29931 ) writes:
  
  take this formal description and produce a rigorous proof of some property, .g., that some state is never reached ... and then have the system go beserk when that state is reached.
  
  The problem is that while you can get a rigorous proof (Wasn't the parallel postulate "proved" in the 13th centery or so?) of the formal description, you have nothing remotely like a proof, formal or otherwise, that the formal description actually matches reality.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Software bug was just one part of bigger problem (Score:5, Informative)

Re:Software bug was just one part of bigger proble (Score:5, Interesting)

This Defines All Catastrophic Failures (Score:2)

Re:Software bug was just one part of bigger proble (Score:2, Flamebait)

Re:Software bug was just one part of bigger proble (Score:3, Funny)

World's largest machine (Score:5, Interesting)

Re:World's largest machine (Score:2)

Re:World's largest machine (Score:2)

Re:World's largest machine (Score:2)

Re:World's largest machine (Score:3, Insightful)

de centralised power (Score:2, Insightful)

Comment removed (Score:5, Insightful)

Re:Software bug was just one part of bigger proble (Score:2, Interesting)

Re:Software bug was just one part of bigger proble (Score:2, Insightful)

Re: (Score:2)

Re:Software bug was just one part of bigger proble (Score:2)

Re: (Score:3, Insightful)

Re: (Score:3, Interesting)

Re:Software bug was just one part of bigger proble (Score:2)

Well if you've got no warning... (Score:2, Insightful)

For the 21st century... (Score:3, Funny)

Re:For the 21st century... (Score:5, Insightful)

Re:For the 21st century... (Score:2)

Re:For the 21st century... (Score:3, Interesting)

Bug free! (Score:5, Funny)

Re:Bug free! (Score:2)

Re:For the 21st century... (Score:5, Insightful)

Re:For the 21st century... (Score:5, Informative)

Re:For the 21st century... (Score:2)

Re:For the 21st century... (Score:2)

Re:For the 21st century... (Score:2)

Re:For the 21st century... (Score:2)

Re:For the 21st century... (Score:2)

B Method? (Score:5, Interesting)

Re:B Method? (Score:5, Interesting)

Re:B Method? (Score:5, Interesting)

Re:B Method? (Score:2, Interesting)

Re:B Method? (Score:3, Insightful)

I don't trust this Mike Unum guy... (Score:2, Funny)

The American jackasses who blamed Canada (Score:5, Interesting)

Canada has a history of bad grid control (Score:3, Informative)

Re: (Score:2, Insightful)

Re: (Score:2)

Re:The American jackasses who blamed Canada (Score:2)

Re:The American jackasses who blamed Canada (Score:4, Funny)

Testing isn't the answer... (Score:5, Insightful)

Re:Testing isn't the answer... (Score:2, Insightful)

Re:Testing isn't the answer... (Score:2)

Re:Testing isn't the answer... (Score:3, Insightful)

Reasons for power blackouts (Score:5, Interesting)

Re:Reasons for power blackouts (Score:2)

Re:Reasons for power blackouts (Score:2)

Re:Reasons for power blackouts (Score:2)

Race conditions are nasty ... (Score:5, Insightful)

Re:Race conditions are nasty ... (Score:2, Informative)

Re:Race conditions are nasty ... (Score:3, Informative)

Re:Race conditions are nasty ... (Score:2, Insightful)

Mutexes and Locks (Score:2)

Re:Mutexes and Locks (Score:2)

Re:Race conditions are nasty ... (Score:2)

Software ENGINEERING (Score:4, Interesting)

Re:Software ENGINEERING (Score:4, Interesting)

Re:Software ENGINEERING (Score:3, Insightful)

Re:Software ENGINEERING (Score:4, Interesting)

Re:Software ENGINEERING (Score:3, Informative)

Additional Information (Score:4, Interesting)

342 years of online operational hours? (Score:3, Insightful)

Re:342 years of online operational hours? (Score:4, Interesting)

Testing vs RTFS. Proprietary vs open. (Score:5, Insightful)

"We test exhaustively..." (Score:4, Insightful)

Statistics (Score:2)

Re:Statistics (Score:2, Informative)

Re:Statistics (Score:3, Funny)

A perfect example (Score:2)

bugs are not inevitable (Score:4, Insightful)

Why isn't this Open Source? (Score:2)

Re:The problem with SCADA systems (Score:5, Interesting)

Permanent Alarms (Score:2)

Re:Permanent Alarms (Score:2)

Re:The problem with SCADA systems (Score:5, Insightful)