Forgot your password?
typodupeerror
Technology

On Data Obsolescence and Media Decay 382

Posted by Cliff
from the data-that-is-still-readable-20-yrs-later dept.
mouthbeef asks: "What's the future of storage media? With CDs and tapes prone to relatively speedy decay, and hard-drives an entropic nightmare of moving parts, how will we keep our data safe over the long haul? I just got some e-mail from a writer pal who isn't really technologically sophisticated, alarmed because someone told him that his backup CDs would decay and rot in 20 years. He's an sf writer, and he was thinking "big picture:" a coming infopocalypse in which sysadmins devote their every waking moment to re-archiving their old backup data." Is such a scenario likely? Why or why not? (More)

"I wrote back that I didn't think that would happen, because:

  • Every time I buy a computer, it's got more storage on-board than all the computers I've owned until then, and I just migrate all the data files I've ever created or saved to the new box, like a hermit-crab changing shells
  • With broadband becoming more real and more cheap, it makes sense that in the long run we'll store most (if not all) of our data on remote servers -- encrypted, of course -- that are managed by trained pros with access to mirror drives, climate-controlled vaults, etc. etc.
  • Even if this doesn't happen, most of your data files will be in stupid, proprietary formats like Word 3.0 that won't be openable, anyway
(I've since changed my mind about the last one: thinking about it, I'm willing to believe that the high-speed, high-capacity distributed servers of tomorrow will have VMWare-style emulators for every chipset and every OS ever made on them as public utilities like grep or perl)

How reasonable does this seem to you folks? What do you do with data that you need to preserve for the ages? "

This discussion has been archived. No new comments can be posted.

On Data Obsalescence and Media Decay

Comments Filter:
  • by Anonymous Coward
    our valuable data is like money. if we keep our money in big piles and no one buys or sells anything, the money is worthless. witness the 1930's depressions. to keep things going, we buy and sell goods and services, keeping the money flowing. if we let our data sit in local clusters of decaying storage, not only is the physical media being destroyed, but the value of said data is diminished in that no one is looking at it. i think the answer is to keep the networks busy, seeing as we'll all have fiber to the home, just passing our backups around. ahh, but it's not secure you say. and to answer you, i should question your faith in Open Source. and now i feel quite silly for writing all this, as it's probably going to show as the 347th anonymous post in the last hour, and ain't none a yous gonna readit. but i got it out there, it's not just backed up in the recesses of my mind.
  • by Anonymous Coward
    Some brands claim 200 years. Maybe this is under ideal conditions and assumes you never (or seldom) actually read the CDs.

    BTW, I hand old CDs from strings outside the house as decorations and have noticed something. The Sun destroys the green ones and the gold ones, but the blue ones do not fade at all. I a blue one down after a year of hard suns rays (I live in the deserts of AZ), and read back a full and complete ISO image of Red Hat 4.2! The other colored CDs didn't even recognize as a valid disc. The blue ones were generic, blank label Verbatims. Don't forget to think about ability to survive abuse when choosing backup media.

  • by Anonymous Coward
    There is a subtle point being missed on this thread. We are not facing a data extinction problem, we are facing two, with different time constants.

    Data within an institution typically has a short half-life, say years to decades (banks, tax info, etc.) The problem here is moving all still useful data into a format that is still readable by the rest of the firm, an in-house job in most cases. The hermit crab analogy is particularly apt, going from tape to (say) CD to solid state.

    This emerging problem will demand innovation, and specialists. Specialists to resurrect or maintain the old formats and reading machines, specialists to oversee the transfer, and specialists to find the latest and greatest encoding scheme. The real fulcrum here is the manager (information management, I suspect, will become a new and major field), who must schedule the maintenance, oversee the buying and employment of equipment (and stay in budget and on time!!!), and most importantly, get the biggest bang for the buck by keeping only the most necessary data.

    The second problem of the infopocalypse has a slower time constant, decades to centuries. It is largely irrelevant to institutions (all but the very few that will survive that long, and who can predict that?). Works of art, philosophy, and science are the major players at this time scale.

    Whereas the first problem (of institutions) is particularly Sysyphesian, pushing the data up one hill only to have it roll back down in another 5-10 years, the second problem is not really a problem. Unless you're so anal that you consider all works of art and science to be worth saving.

    Think about it, how many physics students read Newton's Principia Mathematica? I know of none. They get the summary and biography in the textbook. From the library. How many art students need to see the original? None. They get a print. From the library. The enduring themes and ideas of our culture last because they are enduring, not because someone chooses to furiously keep copying them down. Sure, there are more new scientific ideas per day now than in Newton's time, but the distilled product is kept in the text, while the old theories and bad ideas are not.

    As fields, science and art and history advance on their own, students getting the necessary detail from their teachers (not, say, Bacon or Descartes or Michelangelo). It would be arrogant indeed to assume that we know ahead of time that this or that is worth saving. If it is, someone will save it.

    We need to dispense with the silly notion that Visa's database needs to be saved for 20,000 years, or that string theory needs to be repeatedly transferred from CD to solid state to quantum computers. Only the most boneheaded of archaelogists would hope to save all of our present culture for future generations to laugh at.

    Only the most idiotic fool would want them to.
  • by Anonymous Coward
    650MB in puch cards? That's a truck full.

    A binary-punched card has 80 columns of 12 bits (possible holes) each. 650e6/(12*80) is 677,084 cards. Was it 500/box? I remember carrying more than 4 boxes was a chore, but 1354 boxes?

    Also consider what one CD is in terms of plain ascii, equivalent to typing novels on a one-font typewriter. If you type 120wpm and words are average 5 characters and a space, that's 650e6/(120*6) minutes of typing, or approximately 41.2 years of continuous 7/24 typing. I don't think the average person will live long enough to fill up a CD that way. So you can assume all your novels and all your source code and all your tax data will fit on a CD.

    Of course, your digital movies are another matter. It's just interesting to see relative scales.

  • by Anonymous Coward
    This is what the worldly part of religions is about (IMHO): Creating a pattern of behavior that has built-in self-preservation and adaptation, in order to carry a message through time and across language barriers, about how to discover a way of experiencing being.

    The same general principles could probably be applied to transmitting less important messages.

  • by Anonymous Coward
    Carve your data 5 meters deep into a mountain range. Immune to most natural disasters except impending geologic eras.
  • by Anonymous Coward
    With the exception of the first formula for CD and CD-R, the potential for CD decay is an urban myth that is being exploited by the media, including librarians.

    Please reality check your perceptions at:
    http://www.cd-info.com/CDIC/Industry/news/media- problem.html

    If you are still worried about decay, there is a permanent CD format that will last millions of years. HD-ROM (200 gigabyte) and HD-ROSETTA is immune to technology obsolescence, electromagnetic failures, and withstands the effects of time.
    http://www.norsam.com/
  • a way to map all your data against say the position of all the stars in the milky way

    Say, ever heard of the three-body problem?

    Stars move in literally unpredictable ways. After a while, your mapping will produce gibberish. Even more ironic, you'll find that your key will be as large, if not larger than the information you are trying to record.

    Claude Shannon will not be denied.

  • - a 6-foot stack of source code, punch cards, old Algol and FORTRAN programs. Unreadable by current hardware. Value:zero/sentimental.

    Historian: "What's with all these pieces of paper? Do you think those holes could be encoding something?" (researcher then trips, falls, and unsettles the whole 6-foot deck)

    - a stack of Apple ][ disk with "all Apple ][ software ever written". Unreadable by current hardware. Value:near-zero.

    Apple IIs are like cockroaches :)

    - A couple of thousand 400K diskettes containing Mac System 1.0, Microsoft Word 1.0, Adobe Photoshop 1.0 and similar stuff. Unreadable by current hardware. Value:who knows?

    Still good for competitive upgrades in some situations :)

  • Evidence? I have several (audio) CDs from the early 80s which are no longer readable.

    No longer readable, or just no longer readable by your equipment? One thing that doesn't seem to have been mentioned is that our CD readers are generally designed for speed, rather than making absolutely sure every last pit is read. Given that mass-produced CDs have physical pits in the media, I think it's likely the pits themselves will remain, and be readable to some device which reads them rather more slowly than a 40x (or even a 1x) CD-ROM.
  • It's easy to deal with. Just costs a lot of money.

    data goes to disk, disk migrates to tape.
  • I think the problem isn't just how long will the storage last, but how long will it take to move the gargantuan amount of material that has already been collected and is still being added to at a furious pace.

    There was another Slashdot article [slashdot.org] about this a while back.

  • true but there are people that don't want to be responsible for their own backups. like the writer who submitted the post. he's a writer not a computer guy. anyway, i never thought about it this way until scott mcnealy said something that rang true. he said that keeping money on your hard drive or anywhere locally is like stuffing money under your mattress. we put money in banks because we know that our money is safe and secure there. if your hard drive crashes or you walk through an airline xray with your laptop or your house burns down, you're shit out of luck. i may get a feeling of security with a local backup but i still think remote storage is safer.

    "The lie, Mr. Mulder, is most convincingly hidden between two truths."
  • I worked for a company that (in 1988) did all their backups onto paper tape, because it would last longer than magnetic media.

    They weren't so forward thinking in all respects though. This was the same company that had a Y1988 bug caused by using 1-digit years in their databases.
  • I'd say the whole thing is a fairly moot point - it used to be (way back) that if you wanted to save some data, you threw it on a floppy. Floppys tend to degrade (depending on quality, conditions, treatment, etc...) after a few years. Of course, hard drives got bigger, so people didn't need to move so much off onto floppies...then programs got bigger (bloat) so hard disk space was once again at a premium. The Zip drive came along into the mainstream (I believe it was in use by graphics houses for a while before it became popular to the average joe) --- so now it was a larger (capacity-wise) storage medium, that didn't degrade quite so fast...but even Zip drives are being replaced by CDRs, as the drives become less expensive. I'd say that it isn't inconceivable that in the next 20 years we'll see a new, larger-capacity storage medium that will outlast CDRs by a LARGE factor come into play...and by the time THAT is starting to break down, we'll have something better.

    Anything that's actually important enough to keep forever will survive by any means necessary (barring Murphy's Law taking hold). The rest can peacefully degrade.
  • No one ever complains that his or her eight track collection is going to rot. Bigger and better media are always on the horizon. Rearchive what's important, but not everything.

    Mankind has always dreamed of destroying the sun.

  • If you appeal to the right interests, humans will preserve something indefinitely.

    Case in point: If you go the Museo del Oro (sp?) in Lima, Peru, you can see some of the few Incan gold artifacts that the Spanish didn't melt down into gold bars. There aren't that many religious items there, but, well, if you wonder what they did for fun...

  • It seems to me the best storage material would be stainless steel platers from 5 1/4 to 12 inches in diameter. Information could be recoarded by litleraly blasting pits in the material on both sides. The first disk in the archive set would be marked so it would be ovious that it is the first. I would make each disk a different color and that color would folow the sequence of colors in the s pectrim. That way if the disks got out of order it would be easy natural way to tell what order to put them in. A society advanced enough to get to the archive would know what the spectrim is.

    I thought about using gold since in has real good properties for storing lasting information but decited against it for several reas ons. One, gold is soft and can easy deform destoring the information. Stanless steel will last forever where I want to put it. No t years, decades or even millinua, but billions of years. Second, gold is vaulable. Stanless steel itself is practially worthless. Only the information on the disks has any real value other than the fact that disks themselves would tell a small story about how technologicly advanced in materal research our society was.

    The first disk I would make side one low density. So low that you could tell there was information on it by rubbing your fingers ac ross it. First few tracks would contain a roseta stone on how to read the rest of the disks. The next few tracks would contain a s ummary of what is in the archive and why it is here. The next tracks would contain all the information on how to build a reader fo r the much higher density disk to follow. Hell, I would include a "simple" mechanical reader in the archive for the first disk and base the higher density readers on the simple one.

    Now where would I put this archive? I would put it in the most stable, secure environment I could think of, space. The first archi ve I would put on the moon, in Tyco crater. Second archive I would put in a secure orbit around Jupiter and the third I would put i n deep space beyond the orbit of Pluto. Now here is where I start ripping off 2001. I would put each archive set inside a black, o bisidian monolith of perfect dimesions. You want it so that any intellegent creature that came across it would know from looking at it that this was placed here by another intellegence. So you want it to look as unnatural as possible.

    I would have the archive set in side a special case in each monolith that would resonate when a high intenstiy radio beam was focuse d inside the monolith. So that any probing intellegence would know there is something inside it. The only way to open it would be to break it.

    Now why put it on the moon and in space? Well I couldn't think of a better place to put them. Certainly couldn't keep them here o n Earth. By placing it on the moon if humans where blasted back to the stone age we would have to climb to a level of technology eq ual to the '60 before we could retrieve the first archive. Since by that time we would have forgoten about the second archive the first one would point to it on the last disk. The second archive would contain all the contents of the first archive plus much more . The third archive would contain the first and second but no new information. See the third archive is not ment for humans, it is ment for others.

    In 4 billion or so years the sun is going to expand and consume the first and maybe the second archive if they are not found. The third archive will last forever in deep orbit outside Pluto. It would be for any future aliens that came along in a few billion ye ars. It would be our way of saying "we where here, this is us and what we where." It would point to the second and first archive b ut the first and second would not referance it. We want humans to forget about it sinse it is not for us.

    What would you put in such a archive? I would put us. Our cultures, our history, copies of our music, stories, religious beliefs, and genetic make up. I doubt I would put much technology in it because that would be redundent. Any society advanced enough to ret rieve the archives past the first would be far in advance of our society when we placed them there.

    Somebody is going to see the words space involved and going to bawk because some imgainary cost they are going to pull out of thier ass. While I don't know how much such a project would cost I would rather spend a couple billion on this than on some mulitbillion dollar pentegon wartoy. Besides, I don't think it would cost a couple billion.

  • well, partially (the process is close) but the idea is to get a single, playable CD with as durable a construction as possible. metal and glass rather than foil and plastic, and so forth

    Of course, you should be careful how you store those (vertical or horizontal), considering glass is really an extremely viscous fluid.

  • Typically they store a disk for a while in really bad storage conditions (high temperature, humidity, elephants etc) and then extrapolate from those results to guess how long a disk would last if it was stored 'properly'.

    Personally I was ripping a lot of my CD collection over the weekend (just so much more convenient to click on a playlist entry rather than having to find a CD every time I want to play it), and several of the early disks that I haven't touched for years have oxidation tracks curling in from the edges. Luckily not far enough yet to destroy data, but had they been 70-minute CDs rather than 40-minute CDs many of the tracks would probably be unreadable.

    Incidentally, this isn't just a problem in the digital domain; finding good prints of old movies is becoming harder and harder, and apparently when the Babylon 5 folks got their negatives back from Warner Bros to re-edit the pilot episode they found that many of the rolls had been soaked in a flood in the Warner film vaults, and others had been eaten by rats! In fact, it's quite possible that current DVDs will be the best version of many older movies that will be available in a century from now.
  • Not true, you can still buy readers. We bought one a couple of years ago to dump a bunch of Harvard survey cards (uses IBM-style cards with slightly different punch codes) for a customer. Since the automated testing outfits are still using cards that are the same size as the IBM cards, it's an easy mod for the manufacturer to make the optical scanner look for holes, rather than splotches of #2 pencil graphite.

    For what it's worth, these cards had been "folded, spindled and mutilated" and we still managed to read them, although some extra effort was required. They had been stored in a sub-tropical climate since the sixties, and several of them were bundled with rubber bands--most of the rubber bands had dissolved and crystalized so that we had to scrape the residue off the outside cards to get them past the key slot on the reader. Overall, I'd say the punched cards survived pretty well, despite having been so poorly treated. They did much better than any oxide-coated media would have.

  • Try this. Get a strong magnet. I used an old speaker magnet. It had lots of force, enough to support a hammer. Pass the the magnet over the zip drive a few times. Heck, leave it on there a while. Now try and read it. Do the same with a floppy and see whether you can read it. I could read both just fine. I would have thought a strong magnet would totally wipe them but apparently not. A buddy of mine asked how he could quickly "destroy evidence" on storage media. I told him a strong magnet was bound to do it. I had another thought coming. Apparently, magnets are not a reliable way to destroy data.
  • I work for STK (The company that owns the tape storage market for big companbies with lots of data) Our customers already have this problem.

    Nasa (Which has all that satalite and other automaticly collected data that needs to be stored. Not all of it has been processed yet despite being 20 years old or more) They are in the habbit of migrating to the latest tape technology every couple years. (3? no more then 6) because the latest and greatest allows them to get double the storage in the same space. they do this not only for the space savings, but also to keep that data from getting unreadable.

    They are not alone, but I can't remember the specifics. (I'm also not sure I'm allowed to mention more)

    STK equpiment has a reputation of reliability. Then again, you pay minimum of $20,000 for a tape drive and it goes up to $150,000. (Or buy the OEMed DLT drives for $6,000)

    As a linux user, right not the best you can do is copy to a new medium every couple years. Make sure you do a verified write, and keep a copy offsite. (in case of fire if not protection from over zealious law enforcement) Better yet is a vaulting company, which do in fact exist, but they are immature at this point. (Meaning that you shouldn't trust your data to them without research into them, there are good ones and there are those that will lose your data. Pricing may also be more then you want to spend) I would not trust any one media to be my backup.

    Remember that most data isn't worth backing up. (linux source - except for local mods that are not yet in the source, /usr, most jpegs . . .) Think carefully, what is worth saving to backup? Probably "My dog by jessica age 6" (momentos of youe kids), pictures of the family, the project you are working on today. Tax records (for three years in most cases). There is more, but the majority of your 50 gig hard drive isn't worth the bother.

    Don't forget what other have said about reliability of the medium. They appear to have more data then me so I don't cover that ground. They had other insiteful things to say too.

  • Jukervin says,
    Currently there aren't any real long time (500 years for example) preservation solutions for digital information.

    Colin Smith notes,

    Man - stone tablets are the way to go!

    Only guaranteed storage mechanism! Good for thousands of years.
    Capacity: 2Kb/tablet
    I/O: 1 byte/hr...

    Have a look at http://www.norsam.com/rom.html [norsam.com] for digital archiving and http://www.norsam.com/rosetta.html [norsam.com] for analog archival storage. The basic technology is to use particle beams to write very high resolution to silicon wafers ("high-performance rock" :-), which are extremely durable as long as you don't go after them with a sledgehammer or something.

    ...impervious to electromagnetic disturbances and has the ability, where needed, to store data on materials that are extremely durable and resistant to abrasion, atmospheric contamination, heat and other types of physical deterioration.
    The digital version stores 200 GB on a side of a 5 1/4 inch platter (with 10-disk and 300-disk jukeboxes, making possible a "petabyte machine room"), with very high speed (30 MB/s) write rate and reasonable (3 MB/s) read-rate. The analog version you can think of as "super-microfiche", writing analog page-images to the wafer (at something like the entire Encyclopedia Britannica on one wafer); it is readable by even such lo-tech methods as a good microscope (so it shouldn't suffer from reader-obsolescence).

    Norsam is partially funded by IBM venture capital, by the way.

  • magnetic tape and punched card formats which can no longer be read, because there are no surviving readers

    Actually, punched cards are the easiest legacy format to read. A reader is an easily constructed electro mechanical device. If a great many have to be read, optical is the way to go. If speed isn't an issue, precision alignment isn't required either. A sheetfeed scanner and simple software can also read the cards. For that matter, if it's important enough, they can be manually transcribed. The above all applies to paper tape as well. All other storage techniques (magtape etc) require more sophisticated readers. If the punched cards have not been read, the data on them must not be all that important.

    On the other hand, what would one use (other than a CDROM drive) to read a CD? Any option I can think of requires far more effort that for magtape or punched tape/cards. The best bet (since it's not practical to transfer everything to punched tape) is to keep updating storage media. When denser or more durable media comes into wide usage, that's the time to make a transfer. If the data is to survive civilization itself, the reader should be documented in an easy to read form (such as diagrams and text on hard plastic plates).

    The biggest problem I see is stupid copyright protections. When such measures are employed, the media and data are ACTIVLY HOSTILE to archiving/preservation efforts. For an example, over 20 years ago, the BBC lost several early episodes of Dr. Who. A number of them were recovered because fans had recorded them on professional video tape machines (home VCRs were not available at that time). Had copy prevention mechanisms been in place (like MPAA and others want to have now), those episodes would simply be gone because nobody could have recorded them. Consider, how long is forever for a DIVX silver disk?

    A few TV shows may not seem all that important (and with TV, they probably aren't), but consider encyclopedias, textbooks, novels, etc. All of which are moving to electronic form, and all of which will probably be in some stupid proprietary copy protected form. That's a problem even now. Just try reading an old Excel spreadsheet today, and then consider trying it in 50 years (Good luck).

    On the other hand, nice simple comma delimited ASCII is pretty easy to read no matter how old it is. Even if the fields are not documented, it's not too hard to guess.

    In summary, unprotected open standard formats are the way to go if preservation is important.

  • The promising upcoming technology is the Norsam HDROM. [norsam.com]
    NORSAM HD-ROM is impervious to electromagnetic disturbances and has the ability, where needed, to store data on materials that are extremely durable and resistant to abrasion, atmospheric contamination, heat and other types of physical deterioration.

    They had a test done at Los Alamos National Labs [norsam.com] where they tested the media for corruption after exposure to extreme heat and corrosive conditions.

    It's not quite ready for people to have an HDROM burner in their home PCs, but I suspect that when the patents run out in a dozen years, many will take interest in the technology...

  • Well, I was thinking that they might find a floppy and not even realise that it might be used for data storage.
    This reminds me of the "prayer fans" from Fred Pohl's Heechee Saga (the Gateway series). For decades humans had been finding these little alien artfacts all around the cosmos: crystalline cylinders that open out into a fan shape if you squeeze them right. No clue what they were for, so they called 'em "prayer fans", figuring they had some religious significance.

    Then they found the disk drive for them.
  • The BBC's policy in the mid 70's was "nobody wants to watch that old black-and-white stuff", and so the Evil Pamela Nash went, torch in hand, to vanquish The Doctor and his companions... along with all the old Quatermass stuff, the moon landings, many early BBC broadcasts, LOTS of early soaps & dramas, a vast number of pop music programs, etc.

    Many other TV stations did the same thing. When ABC TV (the UK station by that name) was bought by Thames TV, all the old ABC tapes were left in a pile outside. Anything not picked up by collectors was trashed. That would have included the early episodes of The Avengers (many of which ARE now lost forever).

    Some programs were lucky and were missed by the raving hordes of vandals and Huns that inhabited TV at the time. Sapphire & Steel escaped by turning into a door-stop.

  • most likely services like xdrive [xdrive.com] will be used for storage we want to keep. even now this is safer than keeping it on your local drive because they handle all the backup and if your local drive crashes your S.O.L..

    "The lie, Mr. Mulder, is most convincingly hidden between two truths."
  • Stored properly, writable CD's last 100 years or more

    Evidence? I have several (audio) CDs from the early 80s which are no longer readable. OTOH, I have 9-tracks made at the same time which are still OK (presumably; I don't have access to a 9-track drive currently, but they were fine five years ago).

    Personally, I think that microfiche is the way to go. Plastic lasts quite a while, and OCR software is already good enough to read in straight text in a standard typeface. And even if civilization collapses, all you'll need is a decent lens and a mirror to review your pre-cataclysm tax records...
  • It will be a problem if the speed of the storage media does not keep pace with the capacity. There is a long technical paper [nasa.gov] on the issue of a storage surviability crisis over at the nasa site:
    Over the past 10 years, tape data storage density (with the same form factor) has increased according to Moore's law, doubling every 18 months. However, during the same period, data transfer speeds have only increased at a rate of about 1.3 times every 18 months, and thus have fallen behind data density growth rates by a factor of at least 3.
  • I've read all the posts here and I think we've missed the entire question. We're thinking about existing media. Why?

    I think the answer is a self-contained recorder and playback device which is sealed and can accept a wide variety of power source. Call it "BackAnywhere"! A data time capsule.

    The premise is simple: encase the actual storage device (likely solid-state and non-magnetic for obvious reasons) into a case, write the data out, and seal it. The catch is in the interface - since 100 years from now we can't be certain ASCII will still be in use we shouldn't necessarily write the data in that format. However, it's been shown by history that languages with a sufficiently large text-base can be deciphered even if they are thousands of years old (or a rosetta stone can be found to translate)... I suggest we put a well-known book into the encoding stream. When you start it up and press one of the buttons, out comes shakespear or something. After the archeologists have figured out what it says, press another button and there's the stored data - whatever it may be. Hell, you could bury an entire library into a 6"6"4" space.

    The thing about the power supply is the only problem: electronics require power. How this power is put into such a system in a way that ensures that you won't blow the thing to kingdom come if you plug it in wrong will be the problem. Afterall, after WWIII in 3200 when we're rediscovering the lightbulb somebody might have the "bright" idea of plugging it into a 30kV generator.

    Something to think about....



  • From Kodak's CD-R Overview. [kodak.com]

    The InfoGuard protection system includes a special coating that resists damage due to scratches, dirt, rough handling or other common mishaps. As a result, it's reasonable to expect a life of 100 years or more when discs are stored in average
    home or office conditions.



    --
    Why pay for drugs when you can get Linux for free ?
  • ...an article in the New Yorker dealt with this issue last year, and discussed the National Archives, where more than half the staff is now dedicated to transferring items from optical discs (the archival format of choice in the 1980s) to more modern media, while plenty of pre-1980 archival information (newspaper, video, gov't papers) both pre-1980 and current are still not yet archived. The problem is that optical disc turned out to be a dead end; the format they standardized on no longer has anyone manufacturing players for it! The article finished by pointing out that we now generate more 'recordable' information than ever before, but we are also losing it at a higher rater than ever before. The format least likely to be obsolete, ironically: paper. --Philip
  • >Up until the 1930's somewhere, journals are pretty well
    >preserved. Then they suddenly get awful as paper mills switched to new methods.

    s/9/8/ for most of printed materials.

    I have several books I inherited that were printed in the 1800s. The two oldest -- bound periodicals from the 1830s -- handle like they were just printed a few years ago. The books from the 1870s are very brittle, & when I have children, I'll have to hide them away from the rugrats until they're old enough to understand just how fragile the darned things are.

    And we're not talking quality literature here: the bound peridocials are examples of popular magazines, full of sentimental stories & poetry. At some point the covers were torn off, & my grandmother rescued them just before they were tossed into a fire pit. The one book from the 1870s are translations of Schiller, was far more carefully produced & has an inscription from my great-uncle to my great-aunt.

    In a few hundred years, a lot of stuff from the 19th & 20th centuries will be lost. And it'll puzzle people how it happened.


    Geoff
  • If my local harddisk crashes, I always have my own backup. A seperate computer is turned on automatically every 3 days to make a backup :)

    I use a small program to signal the parallel port and turn on a computer which NFS mounts the harddisk of my PC to make backups... sorta like the coffee mini howto but different ;P

    There's prolly better ways to do it, I just built that stuff for fun once and it doesn't take any effort to keep it running ;)

    I still trust my own backups better, I wouldn't want somebody else to be responsible for that.
  • It's interesting how some of the oldest technologies for "data" storage are proving to be the longest lasting.

    Here's an example:

    Wire recorders and wire recordings. These date back to the 1940s and 1950s. Instead of using a magnetic tape as we know it today, you record on a stainless steel wire.

    The disadvantages are:

    o Mono. It's a single wire. You can't put multiple tracks on it.

    o Frequency response. Not so good, but acceptable for voice recording and radio recording.

    The advantages are:

    o Little to no hiss! Tape hiss is mostly due to the fact that magnetic tape is covered with small, irregularly shaped magnets. A stainless steel wire is continuous, with no individual magnetic particles. Wire recordings sound surprisingly clean.

    o In theory, they can last forever! Tape formulations tend to break down over time. The plastic backing dries out and the oxide flakes. A wire recording is just a spool of stainless steel wire. It doesn't deteriorate. I have recordings from the late 40s that still sound pristine, and may well last forever.

    A couple more examples of "obsolete" technologies that are incredibly archival:

    o Black and white photography: Daguerreotypes. These were made by plating silver on copper, sensitizing with iodine, developing with mercury fumes, fixing with salt, and toning with gold chloride. The image is basically gold on silver, and the images do not fade.

    o Color photography: Technicolor. Technicolor pictures were originally made on a special camera that performed color seperation in the camera, and produced three black and white negatives, each representing one of the primary colors red, green, or blue. These negatives were then used to create "matrices", which are essentially printing plates. Finally, the three matrices were used to print the release films -- using highly stable, acid based cyan, magenta, and yellow dyes.

    The Technicolor process was replaced in the 70s by monopack film, which has three color layers in the film base. Monopack film is much cheaper to produce and easier to use, but the dyes used are dictated by the chemistry requirements of the process, and the dyes are not stable. This is why original prints of such films as "The Wizard of Oz" retain their color unfaded, while most films from the late 70s and early 80s have faded to shades of pink and red.

    Another example is punched cards. As someone pointed out, they can rot, but in a hundred years, if you found a stack of punched cards in the bottom of a desk drawer, next to a magnetic tape, I'll lay odds that you can recover the data off the cards, but not the tape.

  • We need a standard for the long term storage of data. This would consist of a number of mini-standards for media and file formats, call it LTDS (Long term data storage standard) LTDS 1.0 might support the CD-R format as the media, and a flotilla of file formats - say MP3/WAV for Audio, MPEG3 for video, HTML 4 for documents, PDF, and Java for programs, and of course some sort of file system standard (probably the current ISO CD file system standard).

    Hardware and software 'readers' would then be certified as LTDS 1.0 compatible, meaning that it can read all the physical media in the standard and all the file formats in the standard.

    As time progresses LTDS 2.0 will of course be developed say on DVD-RAM with newer file formats, but LTDS 1.0 would be a subset of the 2.0 standard. Hardware and Software readers would have to be LTDS 1.0 compatible as well as LTDS 2.0 compatible to be certified LTDS 2.0 compatible. You would always be able to read your stuff, no matter how old the format you saved it in.

    There is still the problem of physical media decay, but I am sure that the media manufacturers can address this and make some especially long-lived CD-R packaging (or DVD-RAM in the future, or what have you).

    -josh
  • With media obsolescence and decay a fact of life, your best bet is to keep copying your content between machines and media. In fact, there is historical precedent: that's how a lot of content has survived so far.

    It's also important not to store content in formats that become undecodable. So, Word 97 is out for archival storage. If the content is in ASCII or UNICODE format, you can probably hack up a parser that gets most of the information back. It is also useful to store source code in a common and reasonably simple language (C, Fortran, core Java, Scheme; not C++) that can decode the content along with the content. For example, for encrypted data, I usually store a source copy of the crypto program along with it. I consider good formats for long term storage formats like HTML, PBM, JPEG (with decoder), MPEG, and Sun audio format.

  • Well, actually....

    Tubes are better than transistors in certain applications just becasue they DONT work right. They color the sound in a way that is appealing to audiophiles. It has nothing to do with clairty it has to do with a listening experience. Tubes apply sort of a dynamic eq to playback as their electrical properties muddle with the sound. Alot of it is foolishness on the part of gearheads, but a fair amount of it is actual fact, that tubes are percieved to sound better. Remember this is all perception, it's not black and white.

    About digital versus analog, find someone with a good quality audio card and record something at 22khz, 44khz, and 48khz. Now do the same accross 16 20 and 24 bits. If you're using good listening and recording equipment, you WILL hear the difference. That doesn't mean it wont sound better than certain analog gear, but it does mean that in theory analog has the ability to do better. Until I sit in a recording studio and hear state of the art digital vs state of the art analog of the same event recorded at the same time, I'll have to side with analog.

    The other thing at work here is that people take tapes and make them digital and then whine about digital's quality. That doesnt make much sense as the media is PART of the recording. The limitations and strengths of analog are part of an analog recording, running that through a/d converters to change the forrmat is going to be lossy. Similarly taking a digital recoridng and transcribing it to analog is probably lossy as well.

    As for mp3, yeah the quality is bad. Any self-respecting audiophile would never archive to mp3 given analog as an option. Of course most people dont have recording studio quality analog gear and the leap of clarity of a digital reproduction of the master tapes is tremendous when compared with consumer analog devices.

    -Rich
  • ..or make a gold gramaphone disc of your vital data, then nail it to the side of the nearest convenient space probe.

    Bit difficult to retrieve, though.
  • I agree. We don't know what we lost, so we can't judge the usefulness.

    It is unlikely that the library on Alexandria contained scientific knowledge we haven't rediscovered (I don't believe in lost Atlantis, etc) but it certainly contained facts which are now forever lost, like more historical records of the time than exist in biases recordings (the bible, etc). To have lost that library, and others similar, is tragic, from a historian's POV.

    So, we should have a way to record all the data that we want, such that none is ever lost accidentally.

    For this, data havens aren't great. If the owner of the data is lost it's all too easy for the data to be meaningless to everyone, strongly encrypted until it appears to be white noise. Physical data is handy this way, if your backups and in a safety deposit box, you might decide on less encryption, enabling heirs to read your files if you didn't pass on encryption keys in your will.

    Speaking of which, we need a strong encryption system whereby you can unlock data with a certain number of secondary keys, or a master key, but where the data doesn't get easier to unlock with less than the required number of secondary keys. For instance, the boss can unlock the data, as can any five of the seven employees, but if four conspire, the cracking is no easier. This will let keys be passed on after death, etc, in wills and by delay mail, such that records can be unlocked, but in such a way that a dishonest person can't look at your will and gain premature access to sensative data.

    On the subject of easily recovered digital information with a fairly high density, have you considered printing digital data to paper as a series of light/dark areas? This way it can easily be scanned into a bitmap (something we'll always have the ability to do) and a programmer could whip up a translator in an hour or two. Then print an intro page describing the text format (65 -> 'A'), etc, and the encoding (if you need to use anything special) as well as the dimensions, etc. These pages could be printed on high quality paper and laminated, or in the msot paranoid case, photographically etched onto non-reactive metal film (which allows a better resolution, btw.)

    Testing of this method allowed 2048x2688 (or so) resolution, which translates into 672KB / page, or just over two pages/floppy disk.

    It's the longest term data storage we could think of because if you did the paranoid route and used metal film, it would theoretically last thousands of years, and all it requires to access is a scanner, which we have to assume there will be in the future, and a semi-talented programmer.

    It does have file-format problems, but if you completely document the file format in text in the beginning (could even be very small, requiring a magnifying glass or scan+enlarge to read) or at least provide bootstrap info, as in, describe how to read a text file, then include the first data as a text file describing what to do with the rest of the digital data, etc.

    With metal film, or even very good paper and photographic printing, you could get 4-8MB / page. At 500+ pages per volume, it's pretty compact storage, and it would be used for rosetta stone type info, or the most important records. Everything else can just be translated from one format to the next every 10-20 years, as storage should allow ten times as much data on the same size medium in that time.

  • I used to think that our civilization would leave more info about itself then any other. But when I started to think about it I quickly realized that most of the data is in degradable media. And the problem is getting worst, today most of the data we store is in digital format and usualy compressed. Compressed data usualy looks like random bits and even a few bits missing could render the data useless.

    Some of the data is already encripted like DVDs (ok bad encription, but it is encription), and soon the USA will make it even easier to everyone encript data by relaxing even more the cripto police.

    Also the media is evolving, but usualy it is to accomodate more data. The durability of the data is usualy less important, if the media can survive 5 years then it is more then enouth (after all in 5 years this media will be obsolete).

    So what is the memory that our civilization will leave for the future archeology? Tiny little disks, with 1000s of terabytes of encripted and compressed data that will be probably half damaged. Even if they could read the data, remember that is probably in almost atomic level, it would require to find the algoritm of decompression and decription and a key.


    --
    "take the red pill and you stay in wonderland and I'll show you how deep the rabitt hole goes"

  • However, I think he was mistaken. Ancient societies left stone tablets, cave paintings and the like behind, and there's no-one who fully understands the languages or the contexts (when an archaeologist says an object is of "ritual significance" he actually means he doesn't know what it's for). We do have the technology now, as the poster says, to migrate our data ever forwards into new storage, assuming no cataclysm occurs. And even if it does, it is far more important, in terms of recovering data, that the language (source code) survives, rather than CD ROM drives, Minidisc players etc (the binaries), because then data recovery is an essentially straightforward task.

    The point is is that stone tablets are damn durable, while digital mediums (take your pick) aren't. When you see a stone tablet you see the inscription and you can say, "Golly gee! There's something written on this! It looks like a horse." or what not. When you see a CD, you look at it and say, "Hmm kind of shinny. Mirror?" or if you're smart/lucky "CD!" Then of course you have to figure out whether it's filled, or empty, whetherit's an audio CD, or data CD. Okay now there's files on it. Is it using rock ridge, joliet, or iso. Is this file data or is this an excutable, or some support file like a libary. Is it for mac, windows, solaris, linux, Be...

    We have this problem today. I can give you an 8 inch floppy disk and say, "Behold! The answer to all the world's problem lies within. All you must do is read it and begin." Do you know where to even get an 8 inch drive? I sure don't. The only one I've ever seen was in "Wargames".

    You talk about the fact that it's more important for source code to survive. That way you can reconstruct the system that produced the data, and then you can read it. Sounds reasonable enough. One problem. What is source code typically stored on? Big stuff sure as hell isn't stored on paper (However, I did one time see PGP source code printed and bound in an appendix to a book, don't remember which one though. (It had something to do with PGP. Suprise. Suprise.)). Source code is typically stored on a digital medium, because it makes it easier to use. It's a catch-22.

    Now don't tell me about "well everyone will know" because "everyone" knew back in the past how to read Myan, and we all know how well that turned out.
  • A good point. But honestly I don't think this will be a problem in 20 years. More recent standards like DVD look like they're going to maintain upward compatibility with CD-ROMs. A hundred years might be more of a problem, but hopefully, within a hundred years, there will be ample time to transfer the data to a newer format.

  • Is the product capacity times lifespan divided by price. For a CD, you have something like 5*10^10 o.yr/$ (that's a byte-year-per-dollar). For printed paper, it's maybe 3*10^5 o.yr/$, so we've definitely made some progress. Floppies are utterly worthless, by the way: nowadays, they survive about one week before going bad. Anyone care to calculate how much a tape is worth, by this standard (I don't know how much they cost)?

  • by ben_ (30741)
    A recent article in the UK paper The Guardian commented that the sheepskin on which the earliest known version of Beowulf is written had lasted far longer than any modern medium, and was therefore superior :-)
    So go for holes punched in sheepskin: the storage medium of the last millennium.
  • Many people seem to think that its about storage size; it isnt. There will be no problem finding the space.

    The problems instead are actually migrating the data. Ideally, the data should be kept in a live state, transferred from old storage media and converted to more modern formats (and classified and indexed!) during the available migration period, when such migration is supported. That, for even a mediumsized organization will be a full time job for a few people.

    In the worse case, you're only transferring from old media. Then, recovering any data instead becomes a full time job of locating it and researching storage formats, finding something able to read those formats and eventually converting the documents to something readable.

    Of course, it mostly becomes a problem if your organization is using proprietary format on data. Using the simplest most standard format such as ascii or sgml formatted documents makes it far easier.
  • How many millions of tree's worth of redundant paper do we have in cold storage, just in case someone needs it?


    ...

    The answer is simple. If it is of relevance, and people want to keep it (remember Oliver North ?), then people will keep it. People will judge the value, and take appropriate action.


    One problem is that the next generation may not care about a particular piece of data, but the one after that would find that data invaluable.

    My parents have recently gotten into geneology and I find it very interesting as well. The thing is is that we continually find people who didn't write down their parents or grandparents names, because they "knew" that information. But two or three generations on and you don't know your great grandparents were, where they lived, or anything else about them and their lives.

    People who keep regular journals, even of the most mundane and day-to-day activities and events are invaluable to those who are trying to find out about their lives. Trivial information provides glue that ties historical events into perspective and show how things relate to the individual and not just what is represented by the history books.
  • Seems to me that there are already two solutions that will handle exactly what you describe. They are both low cost, intuitive and user-friendly. Arguably, once you've used either product, you may fuind it difficult to manage without them (at least, I know I do ;->). The products are:

    1) "The Brain" by Natrificial. You can check it out at thebrain.com [thebrain.com]
    A relational File-manager for Windows.

    2) BeOS [be.com]

    Clearly the solution is here. The question is: will enough people adopt it to make it work?



  • I've read of one decent long-term solution. It's not particularly convenient, but it *is* known to be capable of surviving hundreds of years.

    Print the data on acid-free paper. Use an impact printer, not a laser printer. (Are you sure that the toner binding agents will last hundreds of years?)

    You could print textual material in an OCR-friendly manner (e.g., the source listings in the "Cracking DES" book). This will obviously take a lot of paper and space, but it could be read by a human.

    Or you could print the material in a "barcode" type style with plenty of embedded checksums. If you use 1 mm square cells (which should be large enough to allow scanners to adjust for paper warp, water damage, etc.), the amount of data which fits onto a single sheet of paper isn't much more than you get with raw text... until you consider that this is true 8-bit data with error recovery, not 7-bit text.

    I think it goes without saying that this format is not intended for frequent use. But if you had information that you *had* to archive for centuries and you had unlimited access to vast underground storage vaults, this is probably the most stable media known today.
  • A closely related problem is the question of whether you're storing data in the first place.

    This question has come up twice in the past decade. In the first case, a tape backup drive quietly failed and there was no indication of a problem until they attempted to retrieve a file.

    In the second case, the person responsible for performing backups carefully ticked off the paperwork... but as far as anyone could tell he never actually swapped any tapes. The company discovered this after he (intentionally) corrupted the Netware database and then walked out the door.

    Both problems can be solved by simple procedural changes (e.g., always "verify" tapes after writing, have someone else run "verify" or rotate the duty).

    Yet... twice in the past decade I have direct knowledge of data loss. In one of case this happened despite a competent and dedicated IT staff. Assuming this wasn't just a statistical fluke, it follows that there must be a significant risk that archival data is bad at the moment it's produced - perhaps somewhere between 5-20% of one or more bad media per year per backup group.

    It doesn't make much sense to invest in premium media if you're saving garbage.
  • This raises an interesting point. Given the efforts made by the PGP group, and with less limitations, would it be possible to generate pages of text on paper or possibly something more resilient like plastic, with both english encoding and binary encoding of the same information? something similar to an efficient barcode style encode down the side of the margin (Which is rarely used in documents) with standard english text and pictures in the document? scanners these days are easily capable of distinguishing text from pictures, so those needn't be encoded, leaving one page of text plus checksums, which surely can be binary-encoded within the margins?

    Just a thought.
  • Get yourself a big radio transmitter, and just beam the stuff into space with lots of error correction. When you wanna retrieve it, you just have to hope for a faster-than-light drive. No media decay problems, and with technology advances your ability to retrieve the data properly increases every year, depending on the rate, this could mean thousands of years of archiving. Even better is if some alien races pick it up and store it as well.
  • That is totally true, let me give another example. I loved Atari 2600 when I was kid (who didn't) and I have two emus for it. I downloaded all my favourite games and I was happy to know that I actually remembered how to play them a good 10, 12 years after the fact.

    But, I downloaded a ton of games that I never had but always wanted - but was not able to find manuals for them all online. Now I am a bit confused as to some of the games...what the blinking blob means or why x happens when my little guy picks up item y. So I have sadly given up on some games that looked cool, but I'll be durned if I know how to play. Given too that a lot of the early games lacked "begin" or "end" screens.

    I've also read about Finnish scientists who are trying to come up with signs to last at least 500 years in a language/medium that people will understand in the future, or perceive as a danger sign. Our yellow anad black shield will probably have as much meaning to people in the future as Venus figurines do to people now.
  • This is an old thought experiment that has one flaw (see if you can spot it!) But it makes an interesting case for low tech solutions.

    Iron Bar Storage System

    First, take all your data and stream it out as one long multi-digit number. Now place a decimal point in front of the number and treat it as a very precise fraction. With the length of your iron bar treated as 1, measure off the distance along the bar equal to your data's fractional value and file a little notch in the bar.

    That's it. The notch now represents the fraction that is your data. Anytime you want to recover your data, just measure the distance to the notch, divide it by the length of the bar, remove the decimal place, and convert the number back into your data bytes.

    Voila! Instant low tech storage!
  • As some other posters also touched upon, I'm not as concerned about how we save data, or how much we can save, as I am about /what/ we save. It would be a tragedy to save /everything/ and then find that anything of any significance is entirely drowned out by the noise of the irrelevant and insignificant.

    We are getting to a point at which traditional "file systems" are going to become archaic. When "file systems" were first created, drives had very low volume, and very few files. The name-space-to-"file" ratio was very high. It was possible for everything you could fit on a disk to have a uniquely identifiable name/location which gave you instant insight into what that file contained or was for. However, we now have /very/ high volume capacities. Traditional "file systems" (systems of files) doesn't scale very well. It is not feasible to uniquely name every file on your hard drive; if it is possible, it is at least very tortuous to maintain. The problem is that tradition file systems have only one dimension. They account for only one strata, on plane, one cross-section, of the many attributes files have - namely their "name"/location. But files have much more meta-data than just their "name". They have content. When we want to obtain a file, we aren't looking for it's name, but it's content.

    A new paradigm needs to be introduced. I think traditional file systems will need to acquire characteristics of relational databases. What good is a 17 GB drive if it takes you half an hour to find something you want?? Today we have much richer and diverse content in our data, and our storage systems need to accomodate that. We need to be able to make intelligent, high-level queries, like "All email files which contain spreadsheets on last weeks product demonstration". This is what we are looking for, not "prddemosprsheet012500.text". File's aren't just of one type, or one attribute anymore.

    Our data contains many planes of meta-data. We need a storage system that understands that, and allows us to make intelligent and intuitive high-level use of it. We need an associative/relational storage mechanism, whereby files are stored not according to an absolute location, but according to their attributes and relations to other things.

    Jazilla.org - the Java Mozilla [sourceforge.net]
  • E.g. A video file is not just "video"...it could be "documentation", "work-related", "outdated". Should I put it in my "media" folder or my "docs" folder, or my "work" folder, or my backup "folder", etc...

    The answer is all of the above - through a relational database-type system, associate the file with /all/ of the above attributes. The answer cannot be solved simply by a 1-dimensional hierarchy. Data is of various dimensions we should account for.

    Jazilla.org - the Java Mozilla [sourceforge.net]
  • Hmm...you basically just described the process for making a conventional CD master, except on the master stamp all the pits are inverted. the life of the disc)
    well, partially (the process is close) but the idea is to get a single, playable CD with as durable a construction as possible. metal and glass rather than foil and plastic, and so forth
    --
  • How about this:
    Make a standard backup of your vital data.
    Take it to a special "data preservation agent" who will probably do it as a sideline for normal Disaster Recovery stuff
    Agent makes an optical mask of a CDR image onto a blank, metal disk
    optical mask is acid-etched to give a metallic CD (using two metals with noticably different optical properties, or burning right though if the disk is thin enough)
    in a suitable atmosphere, mold glass around the metal disk to give you a metal-and-glass CD.
    place in padded, light-opaque, metal case, and state you only guarantee the data readable if it is kept in that case full-time.

    Obviously, the DRA would need to keep some hardware capable of reading these, but as he will probably be offering vaulting services for these disks anyhow, he will be wanting to access them on demand in any case. Any comments?
    --

  • Modern word processing still opens really old file formats like Windows .WRI and Word 1.0, and I don't see that likely to change in the near future.

    But it could happen at any time and without warning.

    In my experience the typical non-geek computer user buys a computer and uses it hard for eight or ten years. I know people who still use AppleWorks on an Apple //e on a daily basis, and many more people who are still using old word processors (Word or Word Perfect) for Windows 3.1

    All someone at Microsoft has to do is say in a meeting "You know, I wonder if it's time we dropped support for ancient version of Word X.YZ?" and it could easily happen.
  • They're enough of a problem, though, especially if they're based on closed-source protocols.

    The question of how long you want your data to live is important. Who do you want to be able to read it--yourself ten years from now, your grandchildren, or an archeologist? Compare these three time scales to the rates of evolution for various data storage technologies and protocols.

    Choose ASCII. Absolutely anything can read it. It has been around for nearly forty years and serves as the foundation for Unicode and every other significant modern encoding scheme. If someone can recover the bits you wrote, they can read what you wrote with nothing more complex than cat. If you want to go beyond ASCII, choose HTML. If you want to go beyond HTML, choose TeX. HTML is wonderful for formatted or semi-formatted documents because it is utterly platform independent and almost intuitively obvious to the reader. If formatting is critical to you, TeX is slightly less readable, but clean enough and well-documented enough that your document can be recovered with only slightly more effort than the HTML.

    As for storage media, gosh, I can't really help you there. I've seen various reports in here of what does and does not survive for how long. Congratulations if your CD-ROMs last 300 years; doubtless you'll be able to fire up your creaky old computer and read those files back in ten or twenty years from now. But in three hundred years, or even fifty, who's going to have (a) a CD-ROM drive that will read your 300-year-old disk, (b) a computer that will interface to the CD-ROM, or (c) enough documentation of how those technologies work to reproduce a working example?

    If a writer wants her stuff to last, her best bet is to print it as text on acid-free paper. The disadvantage to paper is in its editability. With slow-decaying acid-free paper and reliable storage/handling protocols, her worst-case scenario is she has to scan it back in and hack it up. Scanners are wonderfully architecture-independent--they translate what is universal into the currently fashionable file formats of the day. If she wants better editability and format preservation, let her print out the HTML or TeX source; then she can scan it back in and continue messing with its layout, fonts, styles, whatever. So what's the best font to use if you want to scan text back in later and you want humans to be able to read it.

    --

  • Most data has a useful lifetime after which it is of little use to the owner. My tax returns are only useful to keep for a few years, the same with the financial documents that support them. My birth certificate is useful for my lifetime and has some value to genealogists after that. Flame mail to the network that aired some boneheaded Y2K alarmist story 7 weeks ago is already obsolete.

    The problem is to organize data in a way that highlights how long it is needed. It is difficult to give a date in advance after which some things will be obsolete. If I write a book, when I am finished writing it, if no one buys it, the file is useful as long as I want it. If it becomes a best-seller, my biographer will probably want the rough drafts in 20 years. But I don't know that when I save them.

    The solution to this is to learn a better strategy for identifying data. Some file formats already make provisions for this. LaTeX and DocBook already provide tags for quite a bit of identifying information about the source. Meta information can be placed into HTML. CVS stores records of who made changes and why in addition to retaining a record of each revision.

    In fact, now that I think about it, CVS provides a good model for data storage. You get a way to retrieve each version of a file. You get a way to link together corresponding revisions of several files. And you have a record of when, why and by whom all of the changes were made. But at its heart, it is a system for data that is still alive. It is not a system for organizing the historical records of a person, company or government. And it doesn't address the question of media decay because it is independent of the specific media.
  • Interestingly enough, with all the recent hubub over "millennium capsules," the proposals for the NYTimes capsule wrestled with this very question of data loss. One group came up with a pretty impressive solution: genetically splice/embed data into the DNA of a cockroach - then reproduce it and set them in the wild.

    http://www.nytimes.com/library/magazine/millenni um/m6/design-lanier.html

    That data will survive everything.
  • For a time. early to mid 20th-century, acid-free paper was noticably more expensive so there was a tendency not to use it. However its price has come down so now most printing use it. Here are some statistics from a published source [stanford.edu]:

    Of all the new acquisitions tested, 89% were printed on acid-free paper. Approval orders--books arriving as part of a vendor selection plan--were printed on acid-free paper 97% of the time. Only 1% of hardcover approval orders were acidic. Compare this with 30% of softcover approval orders and the difference is marked. Of firm orders--orders placed by the library to a vendor or publishing house--82% were acid-free. Hardcover books (which accounted for 70% of the firm orders) were 93% acid-free while 80% of the softcover were printed on acid-free stock.
  • Take a look a the work being done with etched nickel disks, of the sort made by Norsam Technologies [norsam.com]. They are making disks which can have the actual textual content as the data-carrying element, visible under very high magnification. Norsam builds automated retrieval workstations, which are really just a computer attached to a powerful microscope, with some automation for finding and downloading the data. The beauty of the system is that the disks are _very_ long-lived and stable (expected to be _thousands_ of years.) and the content is readable by anyone with a microscope. Optical microscope readers can allow up to 20000 pages per 5cm disk, and electron microscope readers can allow something like 350000 pages. The key here is that data can be written either as text alone, or a combination of digital data and textual descriptions of the methods required to decode it. The Norsam web page shows color digital images (TIFF-format) etched onto the disk digitaly.

    Several groups are looking into this technology as a possible way to stably maintain their archives over a very long period of time. Take a look at the Long Now Foundation [longnow.org] library for an example.

  • Magneto-optical is probably one of the most stable storage mediums available.

    Theory and practice diverge in an unhelpful manner. 3 years ago I worked on a project to convert a 5 year old MO system to another MO system, simply because the old drives were no longer available and ongoing maintenance was a hassle. Owing to stupid cost-cutting on my project, the "new" drives we used were already becoming obsolete. Today no-one still makes drives that can read either set of disks and on-going maintenance of the #2 system is dubious.

  • At my site, we're tasked with creating and maintaining the archive of sattelite data for the GOES series of weather sattellites operated by NOAA. Currently the archive spans about 25 years and 150+ terabytes of data. Most of the data lives on Sony Umatic tapes (video production quality equipment). It was very cutting edge at the time and required some interesting H/W hacks to get it interfaced with the dish electronics.

    Currently the system is hopelessly obsolete and the remaining units are being carefully nursed as we begin the migration effort. Furthermore much of the older tapes have becomre "read once" media, so you can't afford to miss anything.

    Many of the suggestions about formats and media life ignore some of the realities and complexities of the real world. Our acrhive necessities break a large number of these assumptions (as I'm sure do many others).

    1) ASCII and higher order reprsentations are not adequate for scientific data.

    2) Selecting media with a longer life span only defers the problem to a later date and makes the migration process longer. It also makes it even less likely that adequate readers will be available when the migration begins

    3) Raw data formats can change arbitrarily often during the lifetime of the archive.

    4) There is unlikely to be adequately stable online storage media available to hold the entire archive as well as the "live" data set (data volumes will increase to match existing storage capacity).

    So, what can be done? Many of the suggestions already posted are good and should be incorporated into any archive strategy. So here are some suggestions based on things we're looking at:

    1) Identify what needs to be archived. As many have noted, most things don't need to be archived.

    2) Build a migration strategy into the plan right away.

    3) Keep source code and any auxilliary data needed to access the data available with the data itself.

    4) Keep at least two copies of the archive. It's amazing how many archives exist in only one place (depressing, really).

    Of course the biggest challenge is to make all that data in a meaningful form. That's really the biggest part of the problem, and it's likely to get worse as data volumes grow. Things are coming down the road that will make our current demands look pretty small. That's good and bad. On the good side, our existing problem will fit easily into any solution we come up with at that point. On the down side, it's not clear what those solutions will be.

  • This brings up two personally relevant items for me that others may find illustrative.

    The first is laserdiscs. They were advertised as permanent. After all, what could go wrong? The media was sealed in plastic and couldn't get any air, so deterioration was impossible, right? Wrong! "Laser rot" became obvious within a few years of the introduction of the technology. There are now gazillions of first-generation (and later) discs that are simply unreadable. I have dozens of them and can personally attest to the sinking feeling that comes with seeing data degrade and become unavailable. (In a similar vein, music CDs will degrade while the vinyl they "replaced" will, given proper care, soldier on for another century or two. And records sound better/hold more data, too. CDs were supposed to be an improvement?!?) The lesson? Don't trust industry shills who tell you a technology is good for 100 years. They simply don't know.

    The second example is more personal. As a former photographer, I have some works that I want to preserve forever. Maybe I'm conceited enough to think that in a thousand years my works will be found and I'll be proclaimed a great artist. Maybe I'm just anal. Either way, I want my photos to be around for a long, long time. Now, properly processed silver halide-based film is pretty stable. My negatives will last a long time. But for the ultimate in longevity, I've begun making platinum-based prints on a variety of media, including plastic squares and enamel tiles. If I can find a source of enameled titanium squares (about 10 inches or so), I'll have a combination of media and chemicals that can reasonably be expected to last for a couple of thousand years. The lesson? True long-term data integrity sometimes requires an open-minded approach. If I rejected platinum processes because they were 100 years old, I'd have never discovered their permanence.

    Until someone can come up with a novel way to store data that is truly permanent, I'll rely on the "bigger hard drives, cheaper, every year" theory to keep my data safe. But I don't really feel good about it.
  • What I've noticed is that most of the data we're accumulating is quickly becoming useless. 10 year old schoolwork isn't something so worthy of archiving. The data you really want to keep shouldn't be very large anyway...

    This may be true of households and many businesses, however there are government requirements for keeping data for things like clinical trials and funded research which are subject to this problem.

    There is also a public-interest issue here. Imagine if the tobacco industry, which has in effect lied to the public and hence murdered for the past three or four decades, had been required to have its research data archived in a retrievable format. Another example is the archived PROFS email correspondence of the National Security Council members during the Reagan era, which led to the smoking guns of the Iran-Contra scandal. A final example: a small city near where I live recently had difficulties deciding in what its City Charter actually said, because of poor recordkeeping of its officially adopted amendments over the years.

    While most data is of little use to us after a year or a few years, there are longer-term projects and public-interest requirements that make it a public issue. True, I doubt if anyone would want to save most of those Letterman Top-10 lists, blonde-jokes and similar net-chaff for very long. I do think the Stephen Wright stuff will endure, however....

    -Dave
  • and the listing itself explains why. In the (relatively) not-so-distant future, it's very possible that an entire century worth of data be stored on thousands or hundreds of dollars worth of equipment (as opposed to millions) - keep in mind that not much data has been produced in centuries previous to this, and not much produced in this compared to future ones. (By this century I mean the 20th: one more year folks).

    Perhaps the amount of data will increase faster than the amount (price) of storage, but I doubt it. 640k should be enough for anyone! In any case, all the data generated thus far is likely to remain safely stored somewhere until extinction, if it is ever digitized and made publically available (and anybody cares to store it).

    I can imagine them in 3000 CE looking back on the logs of the (at that future time) most popular web site ever to have existed, and reading this very thread :) I suppose it'll be found somewhere on ENCYCLOPEDIA, DISC 1: PREHISTORY-2012.
  • It's a real problem. IBM Research has an answer, which draws on the fact that they're a Big Old Company. They have a huge archive of historical material which they copy to new media every few years as new media are developed.

    Two years ago, I was involved in an effort to preserve the archives of the Stanford AI lab from the 1960s and 1970s. Several alumni spent weeks taking turns reading 9-track backup tapes through the last two working 9-track tape drives at Stanford. The raw data was sent over the Internet to a big file server at IBM Almaden. There, somebody who remembered how the old SAIL tapes were formatted had written a program that extracted the files from the backup tapes.

    Once the files had been extracted, they were processed into standard formats. (The SAIL machine had its own wierd character set and its own image formats.) Text was converted to Unicode and (monochrome) images were converted to GIF. The MD5 hash of each file was computed, and duplicate files (these were backup tapes) were removed based on the MD5 hash.

    The material was then indexed with a web-spider type program, so it could be searched readily. CD-ROMs were made of the content belonging to individuals, and sent to those who could be identified. Permission is being obtained from each individual to have their data published. (The files include private E-mail, for example.) Data approved for public release will be visible on the Web in a year or so.

    If you ever had a SAIL account at Stanford, Bruce Baumgard at IBM Almaden has your stuff, and you can contact him for a copy.

    It's a lot of work. And this was only about 10GB of data. This gives you a sense of how hard the problem is.

  • The concept: build a hyper-reliable storage device or something, something that can last for millennia. Then ship it to the moon. Wah-lah! Permanent information deep-freeze.
  • Assuming that I have no information about a Disk Drive, then I agree that it would be unreadable, but not punchcards. If you know about the alphabet, then ASCII punchcards containing source code probably wouldn't be too hard to decypher.
    Newspapers are frequently archived using photographic reduction. These are expected to last a very long time. Reading these just requires a very good magnifying glass.
    Even CD's would probably be quite easy to decode and decypher with technology equivalent to ours.

    Also, we keep a lot of data around purely for future historians. I think destroying everything, or even just that part of all the worlds data stored to prevent decay would be virtually impossible.
    What might be harder to decypher than the technology is the language that these are written in. We need to start making some Rosetta Stones, and burying them all over the world.
  • It is reckoned that 90% of paper, once filed, is never read ever again. How many millions of tree's worth of redundant paper do we have in cold storage, just in case someone needs it ? Do we want to repeat the mistakes of history, by wasting resources and lives in archiving every piece of data in sight, for fear of losing it one day ? Do you realise that we are generating data faster than ever before (yet genuine information remains at a premium) ?

    The answer is simple. If it is of relevance, and people want to keep it (remember Oliver North ?), then people will keep it. People will judge the value, and take appropriate action. Otherwise, it goes to the bit bucket in the sky, and remains there ......

    In years to come, will we be complaining of 'data pollution', caused by spurious archives that no-one dare ditch, or noxious chemicals from discarded CDs ?
  • Most of the stuff people archive tends to be logs or other such chunks of 'low value per byte' data - expensive stuff tends to get kept live and backed up regularly. The only other thing that I think is habitually backed up is stuff like important configs, important content - in other words, things which must NOT be lost, but which are frequently obselete in a few months. With hard drives expanding fast, most critical data can just be migrated onto evre more capable storage systems. I know where I work (an ISP), customers get backups done for them, and no-one thinks about the long-term viability of those backups because all critical data is backed up regularly, because it is kept live. Who cares about last year's web pages? As for an entropic nightmare of backups of backups :> The tools for controlling backups and the speed of backups are improving fast enough that we spend LESS time doing backups, not more. Plus with speciality data vaults coming more into play, I see the future being one of many people and companies with data, backed up by their choice of 'backup provider', who keep backups, and who in turn are backed up by national or international 'secure backup services', unseen companies known only to those in the business, who aren't interested in backing up less than a coupla hundred terabytes per customer. There's been a few rumbles about very high-density CD's and laser-read semi-biological crystals as storage media, the notable point being a move from '2-D' storage media to '3-D'. Which makes me wonder if some physical process of copying a crystal block might take over the informational process of writing a new one. Data usage is expanding to fill the services that are provided, and technology races to stay ahead of that demand. Given that we are not yet storing data by the electron excitation states of atoms ("NEW from Store-TEK - a Titanium based storage cell, allowing a whole byte per atom! Forget your 4-bit arrays and buy the new...") I think that the storage media industry has got plenty of cards to play to meet the expanding needs of data storage. Cache-Boy
  • One thing that has always disturbed me is the chaotic way in which information technology evolves. It is natural that it is so, considering evolution comes through individual steps taken here and there. But I believe that if we opted for a 1-year freeze in new developments and set up new standards, everyone should benefit in the long run.

    It sounds absurd, ok, but I would like to see a standard defining a general information storage format, which would encapsulate whatever format would be chosen to really organize data (FAT, NFS, journaling, orthogonal, etc.), much in the way that IP encapsulates other specific protocols. If well designed, such a standard could allow for really substantial growth in storage capacity and still provide backwards compatibility, like reading a 720kB floppy disk in a 1,44MB drive. It could be designed to work in disks, tapes, chips and so on.

    But that, of course, is just wishful thinking...

  • Magneto-optical is probably one of the most stable
    storage mediums available.

    It can be rewritten up to 10 million cycles, has a
    shelf life of 50+ years, and are only affected by
    magnetic fields if you heat the surface to 300
    degrees.

    Another good contender with a higher data density
    is AME tape technologies, such as AIT, Mammoth and
    VXA-1. AME tapes are good for around 20,000
    passes, with an archival life of 30+ years.

    SLR is another good contender for high-capacity
    data storage, with a shelf life of 20+ years.

    For large scale storage, using a form of
    heirarchical data management would be the best
    approach, with MO drives (which have a "mere"
    capacity of 5+ GB) serving out files that are
    still accessed on a regular basis, and using large
    capacity tapes on the backend (such as SLR100 or
    AIT-2, each boasting 100GB compressed).

    As data warehousing becomes a more important
    industry HSM systems will likely integrate auto
    migration from media that is reaching the end of
    its archival lifecycle.
  • punched cards. The only media to survive a nuclear blast radiation! Hope this helps.
  • The answer is simple. If it is of relevance, and people want to keep it (remember Oliver North ?), then people will keep it. People will judge the value, and take appropriate action. Otherwise, it goes to the bit bucket in the sky, and remains there ......

    Historians are also interested in the stuff that people did not explicitly choose to keep because it may reveal facts that seemed unimportant at the time, but have been forgotten since.

    However, I don't think that means that all our receipts should be converted to more lasting media. We'd drown in our own history.

  • Banks can replace your money if it disappears. They're insured. Storage companies cannot replace your data if it disappears.
  • by Colin Smith (2679) on Monday January 31, 2000 @03:57AM (#1319808)

    Only garuanteed storage mechanism! Good for thousands of years.

    Capacity: 2Kb/tablet

    I/O: 1byte/hr

    Media cost: £50/tablet

    Error rate*: 1 per 100bytes

    Note: Error rate assumes fully qualified and certified stone mason.

  • by Detritus (11846) on Monday January 31, 2000 @07:08AM (#1319809) Homepage
    The tapes in question are 3600 foot, 7-track analog tapes recorded at 15 inches per second. During recovery, the analog experiment data is digitized at 40,000 (16-bit) samples per second. That comes out to about 230 megabytes per track. Not all of the tracks are used for experiment data, some are used for frequency reference, time code and low rate spacecraft PCM data, others are unused. Assuming one track with experiment data, the result is about 250 megabytes per tape.
  • I can't speak for anyone else here, but so far my personal experience has been that Maxell CD-R's are the absolutely worst available out there and Verbatim have so far been the best.

    By saying that I'm refering to how I bought my first CD-R about three years ago, and of the 20 or so Maxell disks that I've archived data onto, only one is still readable by any CD-ROM/CD-R that I insert it into. By contrast, every one of the verbatim disks that I've burned, which were stored in exactly the same environment as the Maxell's are fully-readable, and I haven't had any problems with them.

    I've also used a few Sony and Memorex disks with which I haven't had any problems (that I'm aware of) but I have found my verbatum disks to be incredibly durable. I burned 20 or so Audio CD's onto verbatim disks two years ago before leaving on a cross-country road trip, and despite vast changes of heat and cold, as well as being literally tossed around my car, every one of those CD's is also still working.

    Again, this is just my personal experience, but whenever I see someone picking up a spindle of 50 or so no-name brand disks at a local computer store, I have to wonder how important the data they're putting on there must be...

    --Cycon
  • by rillian (12328) on Monday January 31, 2000 @03:09AM (#1319811) Homepage
    This is a very real problem, but it won't amount to an apocalypse unless we ignore the issue.

    As others have pointed out, the exponential increase in storage capacity makes it relatively easy to "keep buying more disk" and migrating your data all the time. Certainly the convenience of having everything online is nice, too. And everything on line should have periodic backups happening. I've managed to do this for the past decade with my data, but I've lost the eight or so years before that, and I miss some it.

    But there's logical as well as physical bitrot. The media itself deteriorates, making it hard to get the information back, but understanding what that bitstream represents after a few years can be a real problem. If you've got binary word processor files from an Apple2 or C64, you'll probably not be able the read them unless you also have the binary and can get it running in an emulator. Given the amazing progress that's been made in the last 150 years deciphering the records of dead civilizations, I wouldn't say that reading your MS Word 5 documents will be impossible in twenty years, but it might not be worth the effort. Open standards and open source really help alot with this issue. If you can find a document describing the file format, you're saved. And the same applies to hardware formats. Also, it's much easier to keep open source software alive--essentially carrying the 'make a copy on the new system' over to executables.

    I'd say the solution is pretty much that simple: keep track of your data, plan to make a complete copy every 5-10 years, and choose formats and that are publicly documented and that (you hope) will be easy for future software to support.
  • by gotan (60103) on Monday January 31, 2000 @06:27AM (#1319812) Homepage
    This approach will well work with anything that is in daily use by a reasonably large group of people. Also it works best with information already stored in digital form. There is other information worth keeping, historical data, literature, even texts intended only for reading once (adverts, notes, email) may give later generations an insight into present everyday life and hence be worth keeping.

    Many of these texts are not yet broadly available in digital form and are not important or interesting enough for enough people to be kept handy. Try looking for some older book by a not so famous author. Even encyclopaedic works are worked over for each new edition and older bits of information have to make place for newer ones.

    With historical facts it's even worse, in most cases there's at least two versions of one event and who was in the right is mostly determined by who survived. Just have a look how warfare now concentrates on media control or try to imagine the twisted version of history if the nazis had won WWII, even now there are some denying the existence of the holocaust.

    I think all this information is well worth keeping, and since it's difficult to see today what later generations might find worthy the 'evolutionary' approach (if i/we don't want to keep it later generations won't want it either) doesn't work. And it doesn't suffice to just keep this information somewhere, it has to be kept in an accessible form, on media readable with modern equipment (who will go through the trouble reading an old magnet tape) and indexed (if you have 1GB of unsorted texts/textfragments on a harddisk are you ever going to wade throgh that to get that piece of information presently of interest?)
  • by dingbat_hp (98241) on Monday January 31, 2000 @05:57AM (#1319813) Homepage

    I disagree almost entirely.

    Very little of the data volume becomes useless, because we don't know what "useless" will be to the readers in the future. Contemporary archaeologists spend much useful time sifting the contents of rubbish pits and latrines - if that turns out to bhe interesting, how can we ever say that data won't be. Maybe your schoolwork is dull and uninteresting to you, but how about an educational historian in a century or so ? Wouldn't you like to know how teaching was carried out in the past ?

    Also the majority (by volume) of data will always automatically generated sensor data (humans can't type fast to keep up), and that tends not to become useless with time. NASA have already lost interesting telemetry data.

    Authors have definitely lost early book drafts because modern WPs don't open old WP formats. Word 1.0 isn't old ! that's not even a decade ago. What about stuff from the '70s on hardware formats that no longer have players ? CP/M WP formats used by some of the first great novelists to work digitally ? (mind you, losing the whole of Pournelle is fine by me). Personally I'd find it very hard to read my own degree work, and I'd probably have to do it by scanning in the paper copies

    Solutions ? I'm not a hardware guy, so I can only talk about the soft data side of it. I think XML (and similar) has a big part to play here. Let's stop thinking of data formats subjectively as "the data format that belongs to SprongWriter 4.2a" and instead work with formats that have objective definitions that extend beyond the client app of the day. Why should I need a copy of that particular WP to open the data, if the data is already in a format that's inherently accessible. We already have the technical skills and tools for this, I call on all developers to make use of them and to stop writing these proprietary data oubliettes.

    Book Recommendation: The Clock of the Long Now Stewart Brand Why this sort of thing matters, and what a few people are trying to do about it. Best book I've read this year.

    PS - SciAm also had a piece on digital data loss, a year or so back.

  • by ctj2 (113870) on Monday January 31, 2000 @06:09AM (#1319814) Homepage

    This is not a new problem. People have been dealing with the question of recovering data from old media for years. As a first data point, a number of years ago, about 5 IIRC, some people finally decided that some old music tapes had to be rescued.

    The method used was to find this old RCA gentleman how had retired more than a few years before then. They then went to the Smithsonian and got the last remaining version of the tape recording/play back device that had been used to make the original master tapes. The RCA guy used the specs and his knowledge to tune the tape deck to perfection. They then put a high quality amp and spliter down stream of the tape deck to feed 2 digital tape decks (The professional version, not DAT, more bits and a bit faster sampling rate) and a couple of analog tape decks as well.

    After testing, they carefully placed one of the Master tapes on the deck, started all the recorders and press "play". As the Master tape played it just came apart. They had to keep the heads clean but this was a one time, one chance thing. They succeeded.

    From the recordings they made some wonderful CDs. Amazingly enough, the Master tape had almost no "hiss" in it.

    Data point two. MIT I believe it was, decided to move some of their older theises to CDROM for easier online access. The first thing they noticed is that many of the data tapes they had stored things on were 7 track tapes, and of course they had no 7 track tape drives any more. Again people went to the museums got out a 7 track drive, spent the time to fix it and make it work, then built an interface box to connect it all up and away they went.

    3rd data point. Somebody sent out to a mailing list that they were looking for some old code to run on a mulator for a PDP11(?). We ended up going into our machine room and found some old release tapes. This included a copy of BRLUNIX (Based on a BSD release) and I think, an AT&T Sixth addition. These were 9 track reel to reel tapes. We went into the machine room, powered up the tape drive, copied the tapes verbatium to disk. We set it up to do the least amount of reading. These tapes were around 15 or 20 years old.

    Because of this rescue which happened late last year, we saved the tape drive when the machine was tossed due to "Inability to prove Y2K compliance". So the tape drive still sits on the machineroom floor. The operators turn it on and clean it once a week. But it isn't currently hooked up to anything, but we expect it to be hooked up to something again in the next year or two. Just to be able to read all those old tapes we still have.

    At home I use EXABYTE-8200s for my back ups. I have 3 drives and you can still get them referbished. While each tape only holds 2GB (Compared to a max of 150MB for a 9 track tape). The media is small and low cost. The exabyte encoding also has a great deal of redundency in it making it an exclent choice for long term storage.

    At work they do much of their backups EXABYTE 8500s. For the Crays, they use to use IBM 3480 tape cartrages, when they changed tape formats, they spent a few weeks moving all the data from the older format to the new format.

    Of course our most reliable storage medium to date has been our paper tape and punch cards. While they maybe low density and sometimes we've had to make readers for them (Auto feed to a flat bed scanner which scanned the card. Process the card for holes and voloa).

    CDROMs have the problem of decaying do to light contamination. If you want to keep them for years and years and years, they have to be kept out of sunlight. And because our long term, low cost, storage methods keeps dropping in cost and increasing in size, I suspect that what we will find in 3 years is that everybody is carefully copying all their data from CDROM to DVDs which will have a twenty year life span.

    The basic rules on saving your data for the long term are:

    1. Have more than one copy
    2. Copy from older media to newer media when density more than tripples and the safty/economy matches.

    Chris

  • by luckykaa (134517) on Monday January 31, 2000 @03:27AM (#1319815)
    This suggests that ALL data should be made freely available for archiving. If NASA had made an effort to make sure as many people as possible had copies of that data, then you wouldn't need to do all this transferring. It would have been transferred to newer systems by someone already.

    Apart from with MAME, nobody is making any effort to archive old computer games. The BBC managed to destroy a lot of valuable origional video tapes (Apparently they taped over their copy of the moon landings). These show that data is kept around much longer if copying is encouraged rather than discouraged.
  • by Anonymous Coward on Monday January 31, 2000 @07:07AM (#1319816)
    I am an archivist. My job is to sift through data and decide what is worth saving. Generally about 5 percent of collections of modern records are saved. Popular culture is indeed documented to some degree in any historical library and there are several repositories which are dedicated specifically to the preservation of popular culture.

    The filter of decay has served mankind well? How illogical, when you have no idea of what has been destroyed how do you know mankind has been served well? Was mankind well served by the destruction of the Library of Alexandria, the Aztec library destroyed by the Spanish, the historical libraries destroyed by the Serbs in the Balkans?

    Sure CDs may last 100 years (we really don't know) but it is unlikely they will be able to be read by anything. Paper is still the most stable format available (although it is impractical for many reasons to transfer digital data to paper as some of my colleagues are prone to doing) and there are many vast libraries of data open to the public. We had well over 40,000 researchers use our library last year and less than 1 percent were scholars.

    My profession is wrestling with two technology related questions.

    1. How to make paper collections accessible electronically. For example the papers of ONE congressman (approx. 400k documents)took 5 years and nearly 3 million dollars to digitize. We have one collection which has 32M documents. Sure digital copies are cheap - IF the original was electronic and in a form easily translated.

    2. How to preserve much of the information which currently only exists in electronic form, be it governmental databases, personal computer files or web pages. We did an interesting experiment a couple of years ago when we captured about six dozen web sites which documented the devestating Red River flood in Minnesota, North Dakota and Canada. Most of these sites existed on the internet for only 2-3 months and were disappearing as we captured them. I think it will be possible to study how the internet was used as a tool in response to catastrophe from the governmental level to local churches and organiqations. Of course current copyright law makes it illegal for us to post this database of websites on the internet but thats another issue.

    Aging Newbie is correct in the assertion that only a small percentage of data need be preserved, yet I feel that conscious, reasoned choices about what should be saved serves mankind far better than the filter of decay. I also believe tha solution ultimately will involve a combination of strategies including electronic.

    Skavvy(whose firewall apparently won't allow him to register)
  • by Anonymous Coward on Monday January 31, 2000 @03:06AM (#1319817)
    WOW, i cannot beleive that half of the /. readers are not working on data recovery as we speak. I spent a good couple months of my life running back and fourth across hallways doing tape retreival because the machines that were made in the late 70s, early 80s couldn't be replaced. This was made even worse by the fact that half the tapes were courrupted. Fact is, we have lost a lot of the voyager space probe missions. With data centers poorly funded, the race to copy all the data from older 7 track format tape to new media is slow and gruiling. 7 track machines are NO LONGER MADE and the companies outfitting newer tape heads to read the old data are charging way more than the scientific centers can afford. Not only voyager, but magellin and so fourth.. GONE... and going as we speak. As the few machines that can retreive the data struggle to re-read the tapes literally hudreds of times trying to recovered those last missing bits, tapes yet to be re-archived are falling apart. Once the data is stored, what does one DO with half-complete 1970s computer records? There is yet an "emulator" to read most of this stuff. Fact is, it is gone, and anyone who says this problem isn't going to pop up again has yet to store anything important on a floppy drive. bortbox
  • by sql*kitten (1359) on Monday January 31, 2000 @02:48AM (#1319818)
    In his book "Silicon Snake Oil", Clifford Stoll talks of a similar subject. His point was that all our media is essentially perishable and quickly becomes obsolete: for example, there are magnetic tape and punched card formats which can no longer be read, because there are no surviving readers (or if there are, there is nothing to connect them to). His point was that our society would leave little behind in terms of data to be discovered by future archaeologists, and even if we didn, they couldn't access it.

    However, I think he was mistaken. Ancient societies left stone tablets, cave paintings and the like behind, and there's no-one who fully understands the languages or the contexts (when an archaeologist says an object is of "ritual significance" he actually means he doesn't know what it's for). We do have the technology now, as the poster says, to migrate our data ever forwards into new storage, assuming no cataclysm occurs. And even if it does, it is far more important, in terms of recovering data, that the language (source code) survives, rather than CD ROM drives, Minidisc players etc (the binaries), because then data recovery is an essentially straightforward task.

    I expect acid-free paper to survive long enough after an ecological catastrophe or, say, a meteor strike, to be useful to the survivors (better start moving the engineering textbooks down into the bunkers). And of course, Ship-It awards will outlast the end of time, not to mention non-biodegradeable shopping bags.

    As a civilisation, if we wish to preserve a legacy, we currently posess the skills and technologies to do so - if we choose to.

  • by Aging_Newbie (16932) on Monday January 31, 2000 @03:20AM (#1319819)
    We should look at the information we have to save before we decry the methods of saving it. Society's popular culture is preserved poorly if at all while "everything" worthwhile from all of civilization still fits in a few libraries. The filter of decay has served mankind well so far - sorting out that which somebody treasured enough to save from the vast ocean of lesser stuff. In this century the Dead Sea Scrolls were discovered nicely preserved for over a millenium because somebody thought them worthwhile.

    Stored properly, writable CD's last 100 years or more while each holds well in excess of an encyclopedia. The problem of preservation is considerably simplified as compared to paper. By 100 years paper documents are of limited utility and only scholars can access them. With digital media, copies are simple and cheap so anyone could have a copy if they wanted.

    I think the challenge of the future will be one of sorting the trash; i.e. selecting moon landing data from a mountain of memos, reports, and minutae surrounding it. But, that would seem to have been the problem since history began.

    For all of our ego, I think we might have only a few times more real value to save for posterity than did our counterparts at the turn of the century or in the '50s. People seem comfortable with what we saved in the past - why not admit that we are really not that much more advanced and that the real value of our lives and era can be summarized on a few (or a few thousand) CD's a year. Not enough to cause an information apocalypse or anything but a shelf in a library...



  • by David A. Madore (30444) on Monday January 31, 2000 @04:32AM (#1319820) Homepage

    From what I've understood, the lifespan of a CD-R is around 20yr for those which are based on cyanine or AZO (and which appear blue or blue-green when you look at them) and around 100yr for those based on phtalocyanine (which appear golden to the eye).

    Of course, it depends very much on the way you treat those CD. If you put one in a light-free, dust-free, safe deposit box, it can probably survive several kyr (uh, thousands of years) without damage.

    The unfortunate thing, however, is that because the error correcting codes work so well, it is not always easy to tell that a CD has begun noticeably deteriorating until the data is actually unreadable, and then it is too late. It would be nice if the drives could return some sort of ``CD quality'' status.

    I always write down (on paper) the md5 fingerprint of the raw ISO image when I burn a CD. In that way, I can be sure whether I have pristine data yet. (And if I make copies, I can be sure the copy is exactly identical to the original.)

    This information is provided in the hope that it will be useful but WITHOUT ANY WARRANTY. Without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

  • by Jish (80046) on Monday January 31, 2000 @03:14AM (#1319821)
    This site claims 70-200 years... Anybody else have other evidence?

    Yamaha CD-R site [yamahayst.com]

    Josh

  • by dattaway (3088) on Monday January 31, 2000 @04:19AM (#1319822) Homepage Journal
    One could always do what Linus did for backing up his work --sharing it with the world. I heard he didn't have a tape drive for many years until he was given an Alpha [lwn.net], but his work could always be found somewhere on the internet in good hands.

    The internet will always save your best work [google.com] and discard the junk [waldherr.org].
  • by Detritus (11846) on Monday January 31, 2000 @06:18AM (#1319823) Homepage
    I worked on a data recovery project [nasa.gov] for NASA/GSFC. The spacecraft data was originally recorded on analog 7-track 1/2" instrumentation recorders at ground stations all over the world. There were 100,000 tapes stored in the Public Archives of Canada. The tapes were deteriorating and destined for a landfill. It costs a substantial amount of money, every year, to store that many tapes in a climate controlled facility. That was just the data from one family of spacecraft (Alouette and ISIS).

    Recovering the data from just a portion of the tapes requires substantial amounts of time and money due to the labor intensive nature of the task. Think of copying 20,000 LP records to CD-R disks.

    With limited budgets, NASA and other scientific research agencies are often in the unhappy position of having huge amounts of potentially valuable data on rapidly deteriorating media, of which only a fraction can be saved. Unless someone invents a time machine, the data is irreplaceable.

    For many years, magnetic tape has been the medium of choice for storing spacecraft data. Storing it on an on-line system, on disk, just wasn't practical or affordable. Huge amounts of data were archived on 7-track 1/2" digital computer tapes, the same kind of tapes that you see in cheesy science fiction movies from the 1960s. Try to find one of those tape drives today, or a computer that can talk to it.

  • by rillian (12328) on Monday January 31, 2000 @03:26AM (#1319824) Homepage
    As someone who just loves books .. most are not printed on acid free paper anymore and a huge amount of them is going to be lost within the next 10 to 30 years.

    I'm sorry to hear that. I've been fascinated by this phenomenon in our university library. Up until the 1930's somewhere, journals are pretty well preserved. Then they suddenly get awful as paper mills switched to new methods. Pages are yellowed and brittle. In the 1950's the error was discovered and pages become white again with the switch back to acid-free paper.

    Let's hope we don't make the same mistake with digital media. And it could be worse: almost all the film from the first half of the century is lost to self-rot and enviromental damage. For all its faults, DVD is probably the best thing that's ever happened to film from a historical perspective.
  • by hernick (63550) on Monday January 31, 2000 @02:54AM (#1319825)
    What I've noticed is that most of the data we're accumulating is quickly becoming useless. 10 year old schoolwork isn't something so worthy of archiving. The data you really want to keep shouldn't be very large anyway...

    Modern word processing still opens really old file formats like Windows .WRI and Word 1.0, and I don't see that likely to change in the near future. The filters will probably stay, but be optional. If you want to future-proof your documents, run a mass conversion utility on them and convert them to a more "standard" format than Word or Wordperfect. Say, pure ASCII, HTML or RTF. Sure, you're going to lose formatting, but if those are documents you're not likely to use ever again, yet there may be a slight chance you will, then losing formatting isn't important. If you need the content again, you shouldn't mind too much having to redo the formatting correctly again...

    Floppy disks are degrading rapidly, but most people's floppy collection can fit on a single CD-R. Then again, most people just don't care about their floppy collection, and will just let it die. The data contained on it isn't useful anymore.

    Let's see about Audio CDs. They degrade over time (scratches) and possibly rot. I believe that what will happen is that we're going to convert them to some format like MP3. I'm fairly certain that MP3 capability will continue to be implemented in computer for a very long time.. And if it shows signs of getting phased out, then you might simply batch-convert everything to the new format. Or just rerip your Audio CDs that are sitting in storage, if you really care about the quality (since batch conversion will result in degradation, unless we find a way to actually enhance the audio quality... which might or might not happen...)

    Movies. VHS tapes degrade... Probably, we'll be converting what we really want onto some kind of optical disk in the future. And the rest willl decay, and we won't care about it decaying. When the format (DVD-R perhaps ?) is being phased out, since it's in digital format, it should be possible quite easily to simply transfer our DVD-Rs to the higher capacity medium... Perhaps 10 discs on a single one... Saving a lot of space, and having the format live another 20 years. After all, how hard will it be to include MPEG-2 decompression in next generation video players ? The cost of an MPEG-2 decoding circuit probably won't be very high anymore.

    The other possibility I see is that bandwith gets cheap enough so that we may consider remote storage vaults. That has a couple of privacy issues I'm certain you can see... But it's incredibly convenient and will probably be adopted by everyone if we just find a way to have a high speed switched pipe to everybody's home at a reasonable cost..

    If we do indeed have high bandwith in every house, I see that the media companies might also get their acts together and start putting up their own gigantic media-archive. They could offer a monthly media-license that'd give you access to any music or movie you want. Or perhaps just make you pay for every access to the archive. Of course, such a thing.. I can think of so many ways it could go wrong. What if they decide to have only censored material on the archive ? What about independant artists ? Perhaps we'll just see a protocol to access and pay for access to media archives, and have a dozen appear. Let's say, DisnABCTimeAOL could have theirs, AndoTransmeVAMicrosoChryslerDaimler could have theirs...

    This could be so horrible if not properly done - a lot of "non approved" content could suddenly become unavailaible if you killed the distribution channels except those media-archives... So. Is this just an incoherent rant ? Would you care to add any constructive comment to it ? Answers ? Questions ? Anything at all.
  • by jw3 (99683) on Monday January 31, 2000 @03:20AM (#1319826) Homepage
    The books of my youth - that were books by Stanisl/aw Lem, the polish sf writer (he's also #1 in Germany, and quite known in the States AFAIK). He described an informalypse for the first time in a book entitled "A diary found in a bath" - a book written in the early sixties. This disaster doesn't play an important role in the whole story, it is only mentioned in the "introduction" - written by an editor somewhere in the far future, a representant of an other civilization, which arouse on the Earth after the fall of our civilization - which was due to a viruse eating... paper.

    In many later books Lem refers to an informatic catastrophe: sometimes it is caused by a necro-virus, a product of a computer evolutions (the arm race was banned from Earth and transported to the Moon, where sophisticated computer systems worked automatically on weapon development. Each nation was allowed to get the weapons back on Earth, but that meant others could equally prepare; somehow, the automata on the Moon get out of control and start evolving, finally leading to a nanobot-virus thriving on silicon chips - therefore the title, "Peace on Earth"), sometimes by basic physical properties (in a humorous story "Prof. A. Donda" the title hero discovers a basic equality between energy, mass *and* information, and one of the consequences is that if information achieves a certain density it changes into matter, that - a new universe. God's word was counting from infinity to zero in an infinitely small time :-) ).

    I admit - I was gestaltet by Lem's writing. Many of his ideas from sixties and seventies came to life in the nineties (e.g. virtual reality or sciences which deal only with information retrieval). I do believe that information storage is a problem - but not because the medium would not last forever, but because of the signal / noise ratio you have even in your personal files. As I look on the four Macs we work with in our lab, and the couple of Gigabytes of data, and then dozens of GB of backups, different versions, obsolate versions, alternate versions, gel pictures you have no idea where they came from and who needs them, and so on, and so on... Yes, there are better solutions than using a Macintosh in a multiuser environment, but that's not the point. I've been using Linux for years and have my personal data at home, and I seem to have a GB or so of data I'm to afraid to remove just in case. And there are so many alternatives of storage, backup, databases... and I'm just a simple biologist!

    Returning to Lem - yes, I do believe we are approaching a critical point, like a bifurcation in a chaotic equation, and the word "chaotic" fits here in especially well. What happens next? He who cometh and giveth us a system (not OS, but an information retrieval system), he hath the power and our souls. Well, mine at least. Hope he doesn't come from Redmont, though.

    Regards,

    January

Always think of something new; this helps you forget your last rotten idea. -- Seth Frankel

Working...