Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×

Archiving Digital Data an Unsolved Problem 405

mattnyc99 writes, "It's a huge challenge: how to store digital files so future generations can access them, from engineering plans to family photos. The documents of our time are being recorded as bits and bytes with no guarantee of readability down the line. And as technologies change, we may find our files frozen in forgotten formats. Popular Mechanics asks: Will an entire era of human history be lost?" From the article: "[US national archivist] Thibodeau hopes to develop a system that preserves any type of document — created on any application and any computing platform, and delivered on any digital media — for as long as the United States remains a republic. Complicating matters further, the archive needs to be searchable. When Thibodeau told the head of a government research lab about his mission, the man replied, 'Your problem is so big, it's probably stupid to try and solve it.'"
This discussion has been archived. No new comments can be posted.

Archiving Digital Data an Unsolved Problem

Comments Filter:
  • by csoto ( 220540 ) on Monday November 20, 2006 @06:00PM (#16921568)
    Working at a University, this is not a subject I'm not unfamiliar with. We've had lots of discussions about this. Everyone always talks about how many zillions of "pieces of information" are out there. The number of web pages in existence is always brandied about. My point in these discussions is that most of what's out there is crap. Humanity is not lessened by its loss. Good stuff gets reproduced, reviewed, studied, dissected, etc. and survives. It *is* stupid to try to solve this problem, because the problem doesn't need solving.
  • by duh P3rf3ss3r ( 967183 ) on Monday November 20, 2006 @06:00PM (#16921574)
    I've seen this very thing happen where I work -- we've lost data over the years because of incompatiblity issues. On the other hand, as with many things, it's a huge problem but not an insurmountable one. The key is in planning an anti-obsoloscence strategy into every IT decision. Store data files in open formats on robust media and put someone in charge of ensuring the archives are maintained and accessible.

    It's not easy, sure, but neither are many of the other tasks we take on as humans.
  • Re:Not too long... (Score:5, Interesting)

    by eln ( 21727 ) on Monday November 20, 2006 @06:01PM (#16921580)
    Your timeline may be a little off (at least I hope so), but you're right that it's a silly goal. Whether the US has 10 or 1000 years left, history shows us it will most likely fall at some point, and that point will be fairly soon when compared to the entirety of human history.

    Making a format that will survive a thousand years so long as our advanced civilization is still around and still cares is pointless, because as long as there is a continuous line of people that care, they will be willing to transfer at least the more important stuff to new media. The trick is coming up with something that will still be readable when archaeologists dig it up 10, 50, or 100 thousand years from now.
  • The solution (Score:4, Interesting)

    by alexwcovington ( 855979 ) on Monday November 20, 2006 @06:07PM (#16921672) Journal
    In this era of virtualization, the solution for x86 software is as easy as retaining a copy of the primary partition of a computer originally used to work with the desired files. Searchability could be a problem for proprietary data formats, but the move to open standards in the future will mitigate that.

    The real problem is 60 years of archives of antiquated, proprietary, task-spcific and mainframe computer data cards and tapes whose original programmers are halfway to cedar boxes; if the government can't get their support in time it may as well call all the early stuff a loss and hand it over to archaeologists.
  • by John.P.Jones ( 601028 ) on Monday November 20, 2006 @06:10PM (#16921726)
    Keeping 'a copy of every program' is tractable, 'and a system to run them on' however is not. Data (programs) can be easilly copied to new media and thus live forever (as long as people are around to order new media, install it and copy the data anyways but thats just a staffing problem). But hardware is not so easilly ported, that is unless you have an open, easy to port, emulator that will run your programs. Preferably this emulator should require very little say just a functional C compiler for future hardware. So there you have used a common CS solution, you have REDUCED the problem of saving all your data to the problem of maintaining hardware for which you have a functional C compiler, a much easier task. If you can't find such a machine your solution would then be to implement a C compiler, again a tractable problem.

    I have simplified for the sake of being lazy but the essence of portable emulators + extensive software and data backup and storage is sound, you don't even have to concern yourself with speed if you are willing to accept that future hardware will be fast enough.
  • by quanticle ( 843097 ) on Monday November 20, 2006 @06:11PM (#16921736) Homepage

    Its different because of the sheer volume of information being created today. Ancient cultures were not creating millions of pages of information every day.

    Your Rosetta Stone analogy is inappropriate. We have not discovered any sort of Rosetta Stone for the ancient Maya hieroglyphs but we have had success in deciphering them because we can apply linguistic analysis techniques to figure out what words correspond to what actions/things. Its a little more complicated for abstract concepts, but you can figure out a surprising amount from basic language knowledge.

  • by s20451 ( 410424 ) on Monday November 20, 2006 @06:23PM (#16921918) Journal
    Say western civilization is disrupted for a period of time that is short by historical standards -- 40-50 years would be enough. Electrical power is only sporadically available, and as a result the Internet collapses and PCs become useless. With much more important issues to deal with, such as finding food, people ignore digital data storage.

    The era of restoration comes. However, when people blow the dust off those old DVDs and players, they discover that the DVDs have decayed to the point of unreadability. Massive quantities of archived data and knowledge are irretrievably lost.

    The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.

  • Re:Not too long... (Score:1, Interesting)

    by Anonymous Coward on Monday November 20, 2006 @06:32PM (#16922036)
    Using a book of fiction for anything isn't usually a good idea.

    Granted it's not like most people care nowadays. Look at any slashdot discussion on education, rather sad how people complain about having to take history (heck or any subject they're not "interested in deeply") in school. People want to be ignorant sheep.

    Hell look at Xena or the dozens of other "historical" tv shows out there, I shudder to think of how many people's knowledge of history is probably based on such crap alone. In 20 thousand years they'll have Princess Diana was running around with a lightsaber killing communists or something.
  • Stuff I can't read (Score:3, Interesting)

    by Animats ( 122034 ) on Monday November 20, 2006 @06:34PM (#16922056) Homepage
    Media I actually have useful data on:
    • MacOS floppies. (Maybe on an older Mac.)
    • MacOS-only CD-ROMs. (Could be read on a Mac, if I still had one.)
    • 4mm DAT-II tapes from NT systems compressed with HP's hardware compression. (I still have a drive for this.)
    • 1600BPI 9 track open reel magnetic tape, UNIX TAR format. (I managed to get that copied before the last 9 track drives at Stanford died.)
    • 8" floppies for the IBM Series/1 minicomputer controller for the IBM RS-1 industrial robot. (Not really very useful at this point, but it would be nice to look at that work again.)
    • IBM PC/AT 5.25" high-density floppies in compressed Fastback backup format for DOS. (Years of DOS work, now obsolete)
    • 8" floppies for the Marinchip 9900 (A small theorem prover, in Pascal)
    • UNIVAC UNISERVO steel tape, 8 tracks, 200bpi, written on an UNIVAC UNISERVO IIA on a UNIVAC 1107. (A compiler I wrote as an undergraduate, plus some very early 3D graphics software.)
  • UK/BBC Domesday book (Score:3, Interesting)

    by bLanark ( 123342 ) * on Monday November 20, 2006 @06:36PM (#16922106)
    It happened recently. When I was a lad, the BBC and UK schools composed a "domesday book", which was supposed to be a parallel to the original Domesday book [wikipedia.org], which was a bit more than a cencus from the UK made in 1086.The modern one used the popular home PC the BBC Micro (made by Acorn). It was made on laserdisk, and distributed around the UK to the schools that had compiled the information.

    Well, 15 years on, it was useless. The then-proprietary format was not readable on anything modern, and there was not much of the old hardware around either. You can google for it ("UK domesday bbc data" should do it), the first link I saw was on the Guardian Online [guardian.co.uk].

    I've still got stuff on floppies, but no-one builds PCs with them anymore. I've got two old laptops with floppy drives, the other three computers have none. (OK, I also have two corpses with floppy drives, and the controllers on two of the new PCs will accept floppy drives, but, please take my point - they're going out of fashion.)

    In 20 years time, there will probably be no CD/DVD drives, we'll all be using a new more portable, more backupable, lighter, faster, probably online-only storage medium. Kids won't recognize laserdisks, floppies, or USB ports. They might not recognise keyboards either - who knows?
  • Re:CDs (Score:2, Interesting)

    by lethalwp ( 583503 ) on Monday November 20, 2006 @06:47PM (#16922246)
    Afaik, cds are the worst media to 'backup' your precious data.

    The first burnable cds you could buy (in the 90ties) were of a decent quality, i still have some burned ones around, and they are still readable (older than 10yrs).
    But some newer ones (cheaper, & mass-marketing 'mode') are of an awful quality: i have plenty that "died" when reading them: it begins with some bad CRCs, and then more & more & more, till nothing valuable can be read off it. This happened in LESS THAN 2 YEARS.

    The problem with cds:
      - They hate sunlight
      - they hate being in a too hot, or too cold place
      - they hate being in a place with too much/not enough humidity
      - and the worst: they react with air (oxygen).

    It's build with a 2mm plastic, the dye is on top, with some 'protective' layer over it. Some are better than others.

    Now with DVDs, they seem to be from a much better quality already, the explanation is simple: the dye isn't on the surface anymore, but between 2 slides of plastic glued together. The reaction with air seems to be insignifiant. Atm, i have no single failing DVDR that i know.
    But some brands are of better quality than others.

    And btw:
    "Real men don't use backups, they post their stuff on a public ftp server and let the rest of the world make copies." - Linus Torvalds
  • Re:CDs (Score:3, Interesting)

    by Doctor Memory ( 6336 ) on Monday November 20, 2006 @06:54PM (#16922340)
    Pressed/stamped CDs (like commercial audio CDs) age fairly well, given appropriate handling (well, at least my 20yo copy of Greetings from Asbury Park, NJ is still playable). Recordable CDs, however, aren't stamped. Instead, they use a phase-changing dye. Some of the earliest used a blue dye (cyananaline?) that wasn't stable and degraded after just a few years (10). Even discs with better dyes are sometimes not sealed properly and can go bad.

    That said, there are some newer dyes that are claimed to be stable for a hundred years. I haven't ever seen these in stores, so they may be seriously expensive, or maybe I just don't know where to shop... ;)
  • by Marxist Hacker 42 ( 638312 ) * <seebert42@gmail.com> on Monday November 20, 2006 @06:56PM (#16922354) Homepage Journal
    Now that's the right problem. What is needed isn't some mysterious Universal Translator Format- it's storing the read hardware, with programs in ROM that understand the format, along with the electronic copy. Hell, store the whole thing in ROM chips with a well documented interface printed on the outside of the chip. Libraries could be made up of whatever reading technology exists at the time the library is built- with this common pin-level interface.
  • Re:Not too long... (Score:4, Interesting)

    by FooAtWFU ( 699187 ) on Monday November 20, 2006 @07:29PM (#16922774) Homepage
    I've been wondering, with our global nature now, will we need archeologists in the future? While I believe cililiziations will surely 'collapse', won't we all be around to immediately take note of it, and update Wikepedia?

    Archaeology is the search for fact. Not truth. If it's truth you're interested in, Doctor Tyree's Philosophy class is right down the hall. So forget any ideas you've got about lost cities, exotic travel, and digging up the world. We do not follow maps to buried treasure, and 'X' never, ever marks the spot. Seventy percent of all archaeology is done in the library. Research. Reading.

    -- Indiana Jones and the Last Crusade
  • Re:Not too long... (Score:4, Interesting)

    by nido ( 102070 ) <nido56@yahoo . c om> on Monday November 20, 2006 @07:37PM (#16922864) Homepage
    Granted it's not like most people care nowadays. Look at any slashdot discussion on education, rather sad how people complain about having to take history (heck or any subject they're not "interested in deeply") in school. People want to be ignorant sheep.

    History is interesting, school makes it suck: "In Year ABC, XYZ happened. Test next week - students who regurgitate well will get an 'A'."

    People don't want to be sheep - totalitarian governments need populations to be docile. School is designed to suck the uniqueness out of children so, as adults, they'll take up a spot on a standardized assembly line.

    Kinda cruel how the government has encouraged the shipping of assembly line jobs to China... Dumb down the population, then get rid of the reason for the dumbing-down.

    See Gatto's Underground History [johntaylorgatto.com], for example.
  • Funding (Score:3, Interesting)

    by Detritus ( 11846 ) on Monday November 20, 2006 @07:59PM (#16923176) Homepage
    Don't forget funding. I've seen vast amounts of data disappear when nobody was willing to pay for its storage. This is common in large bureaucracies. You've spent years building and maintaining a library, and then it all ends up in a dumpster when the parent organization is eliminated.
  • by Panaqqa ( 927615 ) on Monday November 20, 2006 @08:01PM (#16923200) Homepage
    Unless I miss my guess, Google will continue towards its stated objective of making all the world's information searchable and retrievable. Want something archived, Google will take care of it. And if Google fails, my suspicion is the entity that takes their place will take it on.

  • by darrenadelaide ( 860548 ) on Monday November 20, 2006 @08:17PM (#16923368)
    Just because the difficulties in doing a job isn't easy, doesn't mean its not of importance.

    In the early 1960s a wise man spoke

    / quote

    We choose to go to the moon. We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard, because that goal will serve to organize and measure the best of our energies and skills, because that challenge is one that we are willing to accept, one we are unwilling to postpone, and one which we intend to win, and the others, too.

    / unquote

    We Went to the Moon, and all the signals received including a high definition picture quality version (by the technology of the time) was recorded at Nasa (and also I believe at the receiving station of Parkes receiving station in Australia where the signals were received through their deep space network radio telescope), these most important "documents" of our time have been lost, lost and never able to be recovered leaving us purely with the broadcast version which was at a much lower quality standard (eg a poor quality photocopy).

    Its important for the nature of our history and our essence of our technology and who we are as a people to preserve these important events for our future generations.

    When you look at this Planet, we regularly goes on a rampage where the technology is lost and we are thrown back hundreds of years, Take Ancient Egypt, The Technology of the first milenium, The great library of Alexandria, (atlantis etc) so much of the past for which we have lost and are poorer for as a result.

    Cant we get it right this time as we face our possible next destructive surge, whether it be by climate, economic, famine, nuclear war, microbiological warfare / disease (whether natural or manmade), chemical accident causing a chain reaction etc..., so many risks, lets do this before its too late, too late to be done and too late to be able to be done.

    Darren Stephens
    Adelaide, Australia
  • by Dun Malg ( 230075 ) on Monday November 20, 2006 @08:52PM (#16923688) Homepage
    The main problem in our age is thermodynamics -- information is stored so densely that it tends to decay naturally, on its own. By contrast, ancient stone carvings (as well as their keys, such as the Rosetta stone), are sufficiently durable to last (basically) for ever.
    Of course, preserving the data is only half the battle. Figuring out what it says is the second part. This is, of course, nothing new. We still can't read Linear A [wikipedia.org]. In the case of the Rosetta Stone we were simply lucky to find something relating hieroglyphics to a language we knew. The Rosetta Stone is rather unusual. Normally we have nothing so convenient.
  • by Anonymous Coward on Monday November 20, 2006 @09:34PM (#16924044)
    Stored so densely -- but that is the whole point. Now you can store the same information in more than one spot on the same disc for the sake of redundancy. And you still have enough room to store a lot of information. Also, a disc is cheap enough to be copied and scattered throughout the globe. If you want to improve reliability, also bundle your disc with more discs that contain an OS and other application software. As we take the Moore's Law joyride in our computational amusement park, you might bundle your important data media with the requisite hardware to extract it.
  • Re:Old stuff. (Score:2, Interesting)

    by TropicalCoder ( 898500 ) on Monday November 20, 2006 @10:12PM (#16924320) Homepage Journal

    As I perused the contents of said stack of discs, I found that almost 90% of them were redundant or out of date copies of files I had completely forgotten about.

    Well then I have question that I would like to throw out to Slashdot readers. Like the person who wrote the parent, I have tons of old files on my hard drives. I always run at least two hard drives, using one for backups. Then when I upgrade computers, I bring over one of the old hard drives to the new computer, copy it to the new drive, then continue to use it to backup new material. By now I have files duplicated and triplicated all over the place. After almost a decade of this, I have many gbs of files which would probably condense down to a fraction if all the duplications were eliminated. What kind of software do I need that will analyze all my files and automatically find and remove duplicates? - or do I need to develop such software for myself? ...and if I do, then is there niche for commercialization of such software?

  • by adrianmonk ( 890071 ) on Monday November 20, 2006 @10:51PM (#16924630)
    They aren't going to see files. They are going to see 1's and 0's. Lots of them - billions on a memory card and trillions on a harddrive. They won't have a clue know how to interpet the file system, even for something relatively simple like FAT16. They may not even know that a byte is 8 bits.

    They might not know that a byte is 8 bits, but with a little analysis, it shouldn't be hard to figure out. There are numerous statistical properties that can be exploited to figure this out relatively easily. For example, with most types of data, the higher-order bits (in any size byte) are more likely to be 0 than the lower-order bits are. Think about how booleans are stored in most systems. Think about the characters in this message: 100% of them have a zero high-order bit. To put it a little differently, there is more entropy in the lower-order bits.

    So, to figure out how many bits there are in a byte, you take your data, and for all reasonable sizes of bytes (say, from 4 bit bytes up to 36 bit bytes), you compute the function that maps bit position (low- or high-order) to an entropy value for that bit. Then you can tell by the shape of that curve which guess about bits per byte was the right guess. Heck, it should be such a strong trend that you can probably automate it!

    Remember that future civilizations will probably also use digital data as well, at least ones sophisticated enough to try to read the optical and magnetic media. They may not know the FAT32 filesystem, but they will have invented statistics and information theory, and they will be able to make some awfully good guesses at things. And yeah, it might take them 10 or 20 years to be able to read a FAT32 volume correctly if some poor college student of the distant future has to do it on a shoestring budget of grant money, but if they're reading 10,000 year old data, how much does that matter?

  • Re:Not too long... (Score:3, Interesting)

    by cultrhetor ( 961872 ) on Monday November 20, 2006 @11:31PM (#16924828) Journal
    Mod parent up. The most interesting aspect of history - one that's never taught in high school - is that it is a constructed narrative: the writings and accounts of several pieced into varying "histories." American high schools teach history as frozen in time, according to the same revisionist history that they were taught fifty years ago: Columbus, "Injuns," Pilgrims, the whole nine yards.
  • by Anonymous Coward on Monday November 20, 2006 @11:53PM (#16925004)
    Ok, this has been happening since the beginning of computers.

    I still have some punch cards that I need to convert. It doesn't matter that IMSL solved all the same problems better.

    I have some:
    - 5.25" floppies
    - RLL/MFM hard drives with data
    - parallel port QIC80 tapes (250MB ea)
    - 1/4" tapes (not so important)

    All of these need to be converted to useful media. Or, I figure they are probably corrupt by now.

    Hence, the addition of par2 http://www.par2.net/ [par2.net] which provides parity protection against partial media failures. Corruption can be handled by home users. For enterprise customers, EMC, Veritas and STK have handled this for years. For home users, the extra effort that par2 requires, http://en.wikipedia.org/wiki/QuickPar [wikipedia.org], may not be worth it. But if it is your wedding pictures, RAID on disk, off-site storage and optical media are **all** required to save your marriage. That 1GB of GMail space might be worth it?

    If it is your corporate date - go ahead, take your chances. How much can it possibly be worth if it is missing? YOUR JOB perhaps?
    The cost of time to find all the data and work interuption is nothing compared to gracefully handling a disk failure without a production impact. Where I work, we do backups **very** well and remote vaulting in alternate data centers.
  • Re:Not too long... (Score:2, Interesting)

    by ExFCER ( 1001188 ) on Tuesday November 21, 2006 @12:04AM (#16925082)
    Sorry for beating a dead horse. "indifference to objective truth is encouraged by the sealing off of one part of the world from another, which makes it harder and harder to discover what is actually happening. There can often be doubt about the most enormous events... .The calamities that are constantly being reported -- battles, massacres, famines, revolutions -- tend to inspire in the average person a feeling of unreality. One has no way of verifying the facts, one is not even fully certain that they have happened, and one is always presented with totally different interpretations from different sources. Probably the truth is undiscoverable but the facts will be so dishonestly set forth in that the ordinary reader can be forgiven either for swallowing lies or for failing to form an opinion ..."
  • by frdmfghtr ( 603968 ) on Tuesday November 21, 2006 @12:30AM (#16925264)
    This reminds me of the study done for the Waste Isolation Pilot Plant (http://downlode.org/Etext/wipp/#executivesummary) . The study looked at how to mark the site in such a way that the purpose of the site would be indicated for 10,000 years.

    While the WIPP site won't have the benefit of constant updating of the media (it's designed to be survive on its own for 10,000 years) it does address some of the same points; longevity of the media, a format that will be usable into the future, and ability of future civilizations to understand the message.

    Off-topic perhaps but an interesting read.
  • by take5 ( 561870 ) on Tuesday November 21, 2006 @04:03AM (#16927070)
    If you are serious about archiving, print your stuff on thin (0.3-0.5 mm) high grade ceramic plates the size of A4 paper, using a laser to remove ceramic material in order to form letters. Then put the plates in large pyramids, with several copies in various parts of the world.

    Not every piece of digital info can be saved that way, or needs to be saved as others have pointed out. Current college textbooks, some history books, literature and music and an encyclopeadia will go a long way to create a useful memory of our times for the future.

    Some years ago, in California, they opened up an 100 year time capsule. I do not remember the suff that was in it, but it was mostly useless junk by our standards today. If we could send an e-mail back in time, we would ask them to include totally different things. It is easy to make the same mistake now as to content.

  • Re:Not too long... (Score:3, Interesting)

    by mattpalmer1086 ( 707360 ) on Tuesday November 21, 2006 @07:33AM (#16928512)
    No one seriously working in digital preservation is trying to make a single thing that will last for 50, 100 or 1000 years. The point is not to preserve information in the event of a total civilization collapse, to make it easier for future archaologists, or some such scenario. The point is to keep our historical digital records *currently* readable at any given point in time. If our civilization collapses, it will be up to those who come after to figure out what we were up to.

    There are two basic strategies to keep our digital files *currently* accessible:

    1) emulation. Check out IBM's Universal Virtual Computer project.
    2) migration. Not only migration of storage media, but migration to new and currently readable formats.

    We will need to migrate all of our digital files every 5-10 years or so to keep them current. And yes, information will get lost along the way - everything decays eventually.

What good is a ticket to the good life, if you can't find the entrance?

Working...