Forgot your password?

typodupeerror

Comment: Re:Why? (Score 1) 297

by georgewilliamherbert (#39564937) Attached to: Ask Slashdot: How Do You Test Storage Media?

Raid is toast. I dont care WHAT raid you are running, none of them can withstand a loss of 50% of the drives.

Really? I used to do that as a routine acceptance test for clusters. The only times it failed for real was when we'd screwed up something.

For that to work, you have to rigorously separate RAID mirrors into their own trays so that a whole tray failure (or cable, as you said) only takes one mirror down. For something like 10, 50, 60 you just make sure all of one side is on one array and all of the other on another (or if you have more than 2 arrays, that you separate them out into pairs with one used for one side and one for another).

Physical separation helps as well, so that you don't accidentally unplug A while starting servicing on B. That exact scenario is one of the canonical HA oopses.

Comment: Re:Why? (Score 1) 297

by georgewilliamherbert (#39564875) Attached to: Ask Slashdot: How Do You Test Storage Media?

Also: HARDWARE RAID CARDS.

I can't stress that enough. software and semi-software raid is a joke.

Not until the hardware fails and you need the data that was on there but not on the backup (or realized the backup failed a long time ago...).

For performance, yes, hardware is fastest. For reliability though, software RAID is better (hardware RAID can have interesting firmware version issues).

Old SAN / Cluster folks believe in belt+suspenders. I.e., often, use both.

Use Software RAID 1 across a couple of LUNs (or separate controllers / drive array stacks, for non-SAN environments). Build the LUNs with internal RAID (5, 6, hot spares, figure out your rebuild times, etc.)

Also - hugely common failure is that the operators aren't properly monitoring the underlying hardware RAID drive status. You need to know immediately when a drive fails even if there's RAID6 and a couple of hot spares in the array. When I worked for a VAR on clusters, I can't count the number of times I arrived and found that they'd had 2, 3, 4 failures nobody noticed, and were one more failure away from catastrophic data loss...

Comment: Re:Why? (Score 2) 297

by georgewilliamherbert (#39564761) Attached to: Ask Slashdot: How Do You Test Storage Media?

There is a very slight bathtub type curve - all numbers rounded, it's about 3% AFR in the first quarter (i.e. about 0.75% failures in first quarter) and 2% for drives in the 3-12 month range (i.e. about 1.5%). If I read the statistics presentation there right 33% of first year failures look to happen in the first quarter, which is detectable but minor initial higher rate. That's dwarfed by 1-2 year AFR (about 8%) and 2-3 year AFR (about 9%), but drops slightly after that.

They presented the AFRs rather than the culminative losses in an initial cohort per quarter/year, which would be slightly clarifying, but whichever way they did the analysis it's about like that.

Comment: Re:Why? (Score 1) 297

by georgewilliamherbert (#39564651) Attached to: Ask Slashdot: How Do You Test Storage Media?

I have worked for an OEM who installed about 30,000 drives a year; for end users with 10,000 drive environments, built out new 1,000 HDD and 600 SSD environments in the last year. I know all about static, having had the manufacturer-level training on how not to zap.

It's not just static. Some drives come with SMART errors (or bad blocks that matter), despite $MFGR assurances. Some of the failures develop in the factory and get shipped anyways as unlikely to get worse, some develop while being packaged or shipped or unpackaged. Run SMART data collection across hundred-drive collections (or thousands or more) and you get a lot of useful and scary info.

Also, there are well documented runs of drives - specific models, time ranges, factories involved etc - which all just blew up. Also happens to chips sometimes - I've been seriously bit by bad CPUs by Sun and Intel, support chips from several vendors. Also RAM going bad.

One prototype CPU literally melted the system down, all the plastic nearby inside the casing melted and puddled on the bottom of the case, the CPU label plastic was carbonized.

Comment: Re:*SMOOTCH!* Buh-bye Enterprise! (Score 2) 165

by georgewilliamherbert (#35645600) Attached to: Intel Replaces Consumer SSD Line, Nixes SLC-SSD

Doubling lifespan that way requires that you only use half the disk capacity.

I have burned out a Major Name Brand SLC SSD with a high traffic OLTP DB in eight months. I have heard the same from Large Internet Companies which tested these for internal use. There are ongoing independent reliability expert studies in FAST, HOTDEP, other conferences which are uniformly highly skeptical of vendors' claims on SSD lifetime.

If you have not actually tested the drive out to six years service, run an accellerated pilot test unit out ahead of your main prod usage, to give you the canary warning.

NetFlix outage

Submitted by Anonymous Coward
An anonymous reader writes "As of a bit after 7 pm EDT, the NetFlix site started to experience problems, going from being completely unreachable to intermittent responses, and back down to being unreachable. Given the outage pattern, it is likely that an outage on a limited number of servers caused a cascading outage when the remainder of the servers could not handle the combined load. No information seems to be available at this point concerning the expected duration of the outage."

Comment: Re:The tried & trusted will still rule the ser (Score 1) 237

by georgewilliamherbert (#35410494) Attached to: Hard Disk Sector Consolidates Amid Uncertain Future

I've tried to do large database server farm tests on modern enterprise SSDs with TRIM, the best wear load leveling, SLC, etc. They go "poof" at moderate (few months, for my loads) lifetimes.

IOPS x Lifetime / price is a metric I find useful. Unfortunately, it makes SSD look even worse than it does just on a price basis 8-(

Comment: Re:The tried & trusted will still rule the ser (Score 1) 237

by georgewilliamherbert (#35409982) Attached to: Hard Disk Sector Consolidates Amid Uncertain Future

Not really improved. I burned out a REALLY GOOD (best available) SLC SSD in 7 months with a mirrored production workload at a previous jobsite not that long ago.

Poof. All gone.

At the FAST conference, was yet another presentation on SSD lifetime burnout mechanisms, news not actually improving in the slightest so far on life. SLC is not good enough; MLC is toast in write-intensive apps.

Phase-change memory or one of the others, with millions of write cycles per bit, may pull this out, but Flash is not proving good enough for enterprises.

The Internet

Last free IPv4 blocks allocation in progress->

Submitted by georgewilliamherbert
georgewilliamherbert writes "IANA has announced that the last two unrestricted IPv4 /8 network blocks were allocated today to APNIC. By preexisting agreement, to avoid timing concerns from putting any regional IP number registry at a relative disadvantage, the remaining 5 /8 blocks are now to be allocated immediately to the 5 RIRs, which will presumably happen very soon.

Though one can semantically argue whether the final 5 allocation or the last 2 free blocks represent the actual end of IANA's IPv4 allocation, today was a major milestone in the end of new IPv4 use and coming IPv6 future."

Link to Original Source

Never have so many understood so little about so much. -- James Burke

Working...