Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror

Comment Re:PSU failures (Score 1) 301

I don't have numbers, but I have witnessed it.
The worst PSU case was with some IBM SP2 systems. These have multiple redundant power supplies. However, a design/manufacturing fault in some early parts meant that they were prone to failure on power-up and when they failed they'd trigger a failure in their bretheren. I encountered this the first time when we had a rack powered off for hardware maintenance, so the good news was we already had a scheduled outage. However, what with getting replacement parts the outage had to be extended by a couple of hours which wasn't popular with our users.

I've also seen PSUs fail on normal tower servers when they're powered on. A case which comes to mind was, again, a server powered off for hardware maintenance. One PSU failed at power-on but fortunately this time the redundant PSU was OK.

I've seen a couple of other types of hardware problems. The most common was with disk drives. Some older SCSi disks sufferred from stiction if left powered down for a prolonged period, (over an hour, say). Sometimes you could revive them with a bit of physical intervention; somteimes not. The worst case was when we had our entire machine room powered off for upgrades to the power supply. When we came to restart two servers wouldn't reboot because their boot disks wouldn't spin up, and half a dozen or so external disks failed. (Mostly these were mirrored, so it wasn't the end of the world, and the servers weren't critical.)

Several years ago we had a weird problem with a server, which took a while to identify. We got a lot of weird, intermittent I/O and memory errors, (iirc). Never actually brought the server down but they caused some application glitches and we couldn't find the cause. Eventually, one of the engineers worked out that a connector to one of the boards hadn't been seated correctly. Every time the server was shutdown, the pins would cool down and contract; when they heated up again they expanded and loosened the connector slightly, leading to the errors.

There may be something to be said for powering off idle equipment, but if so I think it's particularly important to have some redundancy built in.

Slashdot Top Deals

Any given program, when running, is obsolete.

Working...