Forgot your password?
typodupeerror

Comment: Re:And what about... (Score 2) 444

by MetricT (#46036399) Attached to: Who Makes the Best Hard Disk Drives?

I manage a couple petabytes of scientific data (LHC) on our own object filesystem, and at that scale, RAID really isn't an option any more simply because you will, with unacceptable frequency, manage to have two drive failures simply due to the number of drives.

All our new data is being stored with Reed-Solomon 6+3 redundancy. And I greatly look forward to the day when a drive can fail at 3 am and I don't have to get paged to repair it.

And Seagate well and truly sucks. Not only do they have an unacceptably high failure rate, but they have some pretty annoying non-complete failure modes, like firmware bugs causing the drive to hard-lock, and the only way to get them back is to power-cycle the entire server. And they don't support TLER, so drives blipping and getting a 3 am ticket is a regular occurance.

One other thing we learned is that Linux *really* needs a defragment utility. We started having complete permanent slot failures. Turns out we had 100's of drives with extreme fragmentation, and the amount of vibration the head would cause trying to read fragmented files 24x7 would destroy the slot. We have a "warmer" script that scrubs the drives for bitrot errors, and it also opportunistically defragments really fragmented files.

Comment: Cool science coming... (Score 5, Interesting) 136

by MetricT (#46029459) Attached to: CERN Antimatter Experiment Produces First Beam of Antihydrogen

http://arxiv.org/abs/1106.0847

One of the most interesting physics papers I've read in recent years. Does away with dark matter by presuming that antimatter has the opposite gravitational sign as matter (which pops out very naturally once you apply CPT to general relativity).

As the electromagnetic force is almost 10^40 times stronger than gravity, it would be virtually impossible to test with anti-protons or positrons. But with electrically neutral anti-hydrogen, it becomes potentially testable.

Comment: Checksumming + sufficient redundancy (Score 1) 321

by MetricT (#45653255) Attached to: Ask Slashdot: Practical Bitrot Detection For Backups?

We wrote our own parallel filesystem to handle just that. It stores a checksum of the file in the metadata. We can (optionally) verify the checksum when a file is read, or run a weekly "scrubber" to detect errors.

We also have Reed-Solomon 6+3 redundancy, so fixing bitrot is usually pretty easy.

Comment: Re:Been there. Done that. (Score 3, Insightful) 841

by MetricT (#45636175) Attached to: Employee Morale Is Suffering At the NSA

I made a $3 mistake on my income tax return (Scottrade updated my tax info *after* I'd sent mine in, but they didn't notify me).

The IRS apparently took that as an excuse to torment me for most of a year. I got audit for the above $3 claim, as well as for "falsely claiming that I was due a tax deduction for student loans" (I took some night classes at the local community college). Apparently that $3 claim was justification for a fishing expedition.

First time, I take an entire day off to redo my taxes, discover that I have made a $3 error, cut them a $3 check, and sent them the 1098-T from the college to prove that the other claim is false.

Couple months later, they send me the exact same form. I again take another day off to recompute my taxes (I was correct), and again send them the same 1098-T info that they requested.

Third time, I told that I will be taken to court because I haven't provided the proof required. I take yet *another* day off to go to the local IRS office in Nashville and sit down with a lady to explain that I've already sent the 1098-T form in.

She logs into her computer, turns it toward me, and starts hitting page-down. "We don't have any record that you sent it in." I see it flash by and tap on the screen. "Yes you did, it was just on your screen a second ago." She pages up and stares at it in silence for 2-3 minutes. "Well I just don't understand that."

Great. So now that the IRS knows I've sent it in, we can put this whole misunderstanding behind us, right? "I'm sorry, but there's nothing I can do to fix this". My choices were pay it off, send an appeal to the IRS, and hope that suddenly grow a brain after the **4th** time, or go to tax court, lose yet another day's salary, and hope the judge was smarter than the IRS. So I paid.

The IRS's excruciatingly, devastatingly, mind-numbing incompetence cost me roughly $1000 in lost salary for a $3 difference. And the whole collective IRS can go pleasure itself with a saguaro cactus.

Comment: Scaling problems... (Score 1) 270

by MetricT (#45602279) Attached to: For First Three Years, Consumer Hard Drives As Reliable As Enterprise Drives

I manage a couple of petabytes worth of disks (consumer, not enterprise) for the HPC center at Vanderbilt University, and they get absolutely hammered by CMS-HI users 24/7/365. At scale, you will daily see problems that you would never even think of.

The firmware on consumer hard drives is often crap. Very few of them support TLER, we have ~400's drives (Seagates) that needed a firmware fix to prevent sudden death but the fix wouldn't work en bulk over the SAS controller so we had to yank/flash/replace/repeat, and drives will occasionally lock up hard and require a power-cycle.

Don't believe for a second that Linux doesn't need a defrag utility. We were mystified by a sudden influx of permanent drive *slot* failures. After *much* investigation, it turns out that our users were filling them 100% full, erasing 5%, refilling, erasing 5%, etc, until the average file (~100 MB) had thousands of extents. The vibration from the head frantically scanning the disk to read the file was enough to cause the SATA connector to destroy the connector on the backplane (Supermicro chassis, would *NOT* buy again, Chenbro is the way...) We wrote a simple defrag script that simply copied the worst files to a different location and then move them back.

RAID5 isn't nearly sufficient at this point because you will eventually have two or more simultaneous failures just due to the number of disks. We wrote our own filesystem to offer Reed-Solomon-6+3 redundancy.

I'd love to know if you guys have any similar "WTH" horror stories.

Comment: That's interesting! (Score 1) 190

by MetricT (#45491699) Attached to: Elevation Plays a Role In Memory Error Rates

A couple of years back at one of the Supercomputing conferences (I think in Phoenix), Fermilab had a cloud chamber in their booth, and you simply *would* *not* believe the amount of ambient radiation passing you at all times. I can easily believe that altitude would have an effect.

Another interesting idea would be to do the same experiment by latitude. Does the Arctic Region Supercomputing Center have a higher rate than the Maui Supercomputing Center? What happens during an aurora?

Comment: Re:Physics versus MBA (Score 2) 343

by MetricT (#45476603) Attached to: Elon Musk Talks About the Importance of Physics, Criticizes the MBA

I have a MBA from a top 25 school, but I also have 4 years towards a Ph.D. in theoretical physics and 12 years experience in academic high-performance computing, so I hope I have street cred when I say this...

Saying you can get a "12 Hour MBA" is like saying you can get Ph.D in astrophysics by reading Carl Sagan's "Cosmos". It's Dunning-Kruger made manifest.

I found my MBA to be just as challenging as my physics degree. Strategy, game theory, operations, and economics aren't exactly power-puff courses. And there's a reason they hand out Nobel's in economics.

Don't confuse the body of knowledge with the person seeking it. There's a difference between someone who /has/ an MBA, and someone who /is/ an MBA. The latter are annoying, but they would be annoying no matter what degree they had.

Comment: He's an idiot (Score 5, Insightful) 568

by MetricT (#45218747) Attached to: Top US Lobbyist Wants Broadband Data Caps

Bandwidth is a time sensitive commodity. It's going to be sending either a 0 or a 1 100% of the time. Instead of caps, they should think about allowing customers to volunteer to be throttled for a reduced fee.

It's similar to an airplane ticket, in that it's worth full price, right up until the point the gate is about to close, at which point they will take any price over the marginal cost of fuel. I know many people that would be happy to let "full price" guy go first if it saved them a few bucks.

Comment: Tornado *resistant*... (Score 1) 189

by MetricT (#45043459) Attached to: Engineers Design Tornado Proof Home

The walls may help shield from debris in the event of a EF-1 to 3 (which granted is the vast majority of tornadoes). But there isn't much on this earth (above ground, anyway) that's going to survive a direct hit from an EF-5 tornado.

My dad saw the track left by one that hit in Alabama years ago. The thing sucked up everything, including grass, in a 1/2 mile wide path. The only thing left behind was orange clay. There wasn't a single intact structure left, not even foundations.

Closest thing humanity has to a EF-5 -proof structure is probably the pyramids in Giza, and I'm not sure about that either.

Contemptuous lights flashed flashed across the computer's console. -- Hitchhiker's Guide to the Galaxy

Working...