Forgot your password?
typodupeerror
Data Storage

+ - Expedition Reaches Robotic Antarctic Observatory->

Submitted by Michael Ashley
Michael Ashley (812193) writes "20 people from the Chinese inland traverse team arrived at Dome A, Antarctica earlier today after an arduous 19 day, 1220 km, traverse with 7 vehicles and about 500 tonnes of supplies. The arrival of the expedition was captured by a webcam installed in PLATO. PLATO is a completely robotic jet-fuel-powered astronomical observatory that provides its own electricity, heat, and internet connectivity via Iridium satellites. PLATO operated throughout 2009 at Dome A collecting astronomical and site-testing data with no humans on-site. The PLATO computers are PC/104 systems running Debian from solid state disks, and are able to function down to -60C. China is building a major new inland base at Dome A called Kunlun Station, with astronomy as one of its objectives.

Slashdot readers might be interested in the following story about how we recovered from an interesting disk corruption issue.

One of the instruments in PLATO used four 500GB hard disk drives in external firewire boxes — model AE5SACSUF from Addonics Technologies. The disk drives appeared to work perfectly during testing, but in Antarctica we noticed data corruption, and it was then that we realized that there was a maximum 127GB limit due to firmware problems in the Oxford Semiconductor 911 chipset. The Addonics website claims that you can get around the limit by simply creating a number of 127GB partitions. However, we found this not to be the case.

We originally noticed the problem following a file system check. The fsck showed errors, which we let it correct, but subsequent checks showed even more errors, and so on, until the filesystem rapidly
became unusable.

To understand the problem we wrote a short program to write a diagnostic block to every block on the disk. The block simply contained the number of the block being written to. Upon reading the data back, we discovered that about 1% of the time some blocks were written with the wrong number, and multiple reads of the same block would sometimes give different results. Furthermore, the problem appeared to be dependent on the previous history of reading/writing the disk.

So, we had four 500GB disks that would unpredictably read/write data from/to the wrong blocks. We spent a week trying to see if there were any patterns in the errors, and there was some interesting bit-twiddling going on between the desired and actual block numbers. However, a consistent pattern didn't emerge.

We had hoped that since the problem had first been identified back in 2001 that we would at least be able to find out what the firmware issue was, so that we could work around it. However, Oxford Semiconductor were completely uninterested in providing any assistance. IMHO it is fairly extraordinary that you can buy Firewire enclosures in 2008 that have unpatched firmware bugs from 2001.

Anyway... what to do? Remember that our disks were now alone at Dome A, with no chance of human intervention on-site for 12 months.

Those of us who are old enough to remember the VMS tape backup utility will recall that it was legendary how it could recover from all sorts of media errors. You could cut and re-splice the tape and backup would fix it for you.

Perhaps there was something similar that could be used for disks?

Yes there is, par2. par2 is a program that creates "parchive parity files", which basically start with an original data file and add a user-specifiable amount of redundant parity information to it. You can then take a par2 file and delete, say, 20% of it, or reorder blocks in it, or corrupt parts of it, and par2 will be able to verifiably reconstruct the original file. Wonderful!

We wrote a perl wrapper for par2 that writes par2 archives to the disks in raw mode. We keep an index of the starting block number of each file, pointers to the next file, and various md5 checksums. We also stored multiple special blocks at the beginning and end of each par2 file on disk so that it would be possible to recover the index by scanning the disk.

This technique worked very well, and we were able to store about 400GB of compressed data per disk and accurately recover any file we wished despite the firmware occasionally reading/writing the wrong blocks.

In fact, we are so pleased with the par2 technique that we are thinking of using it for archiving data on disks that aren't afflicted with an Oxford Semiconductor 911 controller. It is much easier to reconstruct the data from a par2-encoded disk than it is from any standard filesystem. A single bad-block can result in substantial data loss with a normal filesystem — but with par2 you can have 10% or more of the disk unreadable and still recover everything.

[NOTE TO SLASHDOT EDITORS: I can't seem to be able to reset the tags. I just wanted "Science" and "Data Storage"]"

Link to Original Source

Comment: Re:some information on the computer control system (Score 3, Interesting) 128

by Michael Ashley (#22332236) Attached to: Robotic Telescope Installed on Antarctica Plateau

Hi Guillaume, good to hear from you! (Slashdotters - do yourself a favour and visit Guillaume's website and have a look at some of his amazing photos). We aren't currently running anything at Dome C. Dome A is likely to have similar seeing to Dome C above the boundary layer, but the layer is expected to be lower, possibly touching the ice. That is one of PLATO's prime goals - to measure the height of the boundary layer with a sonic radar.

The Chinese took an Australian Antarctic Division AWS to Dome A in 2005.

Yes, the reliability of our equipment continues to improve. It is now even better than the stuff we took to Dome C!

We mounted everything on shock absorbers to survive the 1200 km sled trip. There was no damage.

The traverse team should arrive back in Zhongshan station today.

"Religion is something left over from the infancy of our intelligence, it will fade away as we adopt reason and science as our guidelines." -- Bertrand Russell

Working...