Slashdot Log In
On Preservation of Digital Information
from the keeping-things-for-future-generations dept.
Preservation of Digital Information
Recently there was an Ask Slashdot about the the problem of preserving digital material. The basic idea was that we are creating a massive wealth of digital information, but have no clear plan for preserving it. What happens to all of those poems I write when I try to access them for my grandkids? What about the pictures of my kids I took with that digital camera? Can I still get to them in time to embarrass them in the future?
Obsolescence of digital media can happen in three different ways:
- Media Decay: Even when magnetic media are kept in dry conditions, away from sunlight and pollution, and hardly ever accesses they will still decay. Electrons will wander over the substrate of the media, causing digital information to become lost. CD-ROMs luckily do not have this same problem with electron loss. They still are sensitive to sunlight and pollution though. Many people mentioned last week that distributors of blank CD media often make claims of an hundred years or more for the duration of their products. Research seems to indicate the truth is closer to 25 years,which seems like a long time, until you consider the factors below. Besides, information professionals often think in terms of centuries rather than decades.
- Hardware obsolescence: Far more dangerous than the degradation of the actual information container is the loss of machines that can read it. For instance, the Inter-University Consortium of Political and Social Research received a bunch of data on old punch cards. The problem was they had no punch card reader. It took a decent chunk of time, and a good deal of money to eventually be able to read the data off of these cards, even requiring some old technicians to come out of retirement to help tweak the system. Hardware extinction is hardly a foreign topic to Slashdotters. It happens, and as technology increases its pace of change, it will happen more quickly.
- Software obsolescence: The real stone in the shoe of digital preservation is obsolescence of the software needed to open the digital document. This can include drivers, OSS, or plain old application software. We all have piles of old software that were written for older systems, or come across an old file the bottom of a drawer where we can't even remember what application it used.
There are several strategies for preserving digital information. People mentioned some last week:
- Transmogrification: printing the digital document into an analog form and preserving the analog copy. An example would be printing out a Web page and archiving the print of that Web page. This, obviously, takes out the main strength of a Web document, hyperactivity, and may also ignore important color and graphical content. An alternative form of this is the creation of hardcopy binary that could later be data entered into the computers of the future. The media suggested have ranged from acid free paper to stainless steel disks etched with the binary code. The two major problems with this idea are that any misrepresentation of the binary could have disastrous results for the renewal of the document, and transformation to hard copy limits the functionality of many types of digital documents to the point of uselessness.
- Hardware museums: preserving the necessary technology needed to run the outdated software. There are several weaknesses to this plan. Even hardware that is carefully maintained breaks and becomes un-usable. In addition, there is no clear established agency that will be responsible for maintaining these machines. Spare parts eventually become impossible to find and legacy skills are required for maintenance. There must be technicians with the requisite skills to service these preserved machines. Finally, it does not create efficient use if all possible future users must bottleneck to just a handful of viewing sites to have access to the information.
- Standards: reliance on industry-wide standardization of formats to prevent obsolescence. Market place pressures for software produces create an incentive for a company to differentiate their product from their competitors. While unrealistic in a capitalistic marketplace, standards such as SGML have proven successful for large scale digital document repositories, like the Making of America archive hosted by the University of Michigan. However, many of these large repositories also receive information from donors that is not in a standardized format, and do not feel comfortable turning away those documents.
- Refreshing: moving a digital object from one medium to another. For instance, transferring information on a floppy disk to a CD-ROM. This definitely seemed to be the preferred method of most Slashdotters. While this takes care of degradation and obsolescence of the media, it does not solve the problem of software obsolescence. A perfectly readable copy of a digital document is useless if there is not software program available to translate it into human-readable form.
- Migration: moving the digital document into newer formats. An example might be taking a Word 95 document and saving it as a Word 97 document. Single generation leaps are usually not a problem, so large volumes of information could be saved. Unfortunately, migrations over several generations are often impossible, as is migrating from a document type that was abandoned, and did not evolve. Also, information loss is common in migration, and may cause the document to become unreadable. While this may be the best single method available, it is very labor intensive, and some knowledge of the nature of documents would be essential to determining which information containers to migrate. For instance, often you lose aspects of a document (good and bad) when you migrate it, but which of those aspects are important?
- Emulation: creating a program that will fake the original behavior of the environment in which the digital object resided. This is another very intriguing method that could be used. It's actually already pretty common. For instance, most processor chips include emulators for lower level processors. There also aleady exists on the Internet a very active group of people who are interested in emulating old computer platforms. Still, we need to do a lot of research yet on the cost of this method, and what sorts of metadata are necessary to bundle with the digital object to facilitate its eventual emulation. Another problem is the intellectual property hassle caused by emulation. Reverse engineering is a big no no, and there is no point in making the lawyers rich. This area is actually where Open Source can be of biggest help to preserving the longevity of different kinds of applications.
Many people in the discussion last week seemed to believe that simple refreshment or migration of the data would be a sufficient answer to the problem. At a personal level that may be true, but for anyone responsible for large amounts of digital information, neither is a completely convincing method. Here are a couple of reasons why:
- Not all documents are the same- In the digital preservation literature, most people talk as if all digital information is in ASCII format. Au contraire. As computing becomes increasingly robust, so do the documents we create. Multimedia games, three dimensional engineering models, recorded speeches, linked spreadsheets, virtual museum exhibits and a host of other documents spurred by the development of the Web have cropped up. How are they going to be affected by migration to a new environment?
- It's so darned expensive- It's a little gauche to talk about, but the Y2K bug caused what ended up being a huge migration of digital information. How much did the US alone spend on that fiasco? $8 billion? For smaller organization who do not prepare for the preservation of their digital information, the cost of emergency migrations could cause all sorts of budget trouble.
There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? Secondly, there is some data that is irreplacable.
Which is not to say that we should keep everything. In a traditional archive, only 1% of documents received are kept. Ninety nine out of one hundred documents are destroyed for various reasons. A similar ratio is not unreasonable for digital documents. Consider that 16 billion email messages are sent each day. It seems ridiculous to keep all of them, but how do we weed out the ones we do want to keep? Appraisal of digital documents for archival purposes is going to become a major issue in the not distant future. There are already examples of data that have been lost, or nearly lost. NASA lost a ton of data off of decayed tapes. The U.S. Census nearly lost the majority of the data from the 1960 census. These huge datasets are important for establishing a scientific record that reveals longitudinal effects.
Increasingly, the record of the human experience is kept in a digital format. The act of preserving that information is the act of creating the future's past, the literal reshaping of our world in the eyes of the future. Nobody knows the best answer yet. There is probably not a single answer that will fit absolutely all situations. Information professionals are just beginning to do research in the form of user testing, cost-benefit analysis and modeling to answer some of the thornier issues raised by the preservation of digital information. There are things out there worth saving, we just need to figure out the best way to do it.
Some links of interest in case you would like to read more:
- a really good bibliography of related sources by Michael Day
- an article by Jeffrey Rothenberg outlining some of the issues
- a site at Leeds University with many related links
those AOL CD's (Score:4)
more info (Score:3)
--
BBC article (Score:3)
magnetic storage (Score:4)
The solution would be to use an optical storage media, but as others have pointed out, CDR storage has a life expectancy of 75-100 years depending on the brand. Which wouldn't be too bad except you have to realize that in 100 years you need to start putting resources into copying all that data off and re-writing it again. After awhile you'll have a snowball effect where you spend more time writing the old data than the new!
What we really need is a piece of technology that doesn't age - an entirely self-contained computer (nuclear powered, maybe?) that has the media, the reading/writing mechanisms and has several failsafe mechanisms to alert you well before any data is lost. Think of it as a computer time capsule - you bury it and in 500 years come back and it has all the human interface necessary to reproduce the data in a usable format. Of course, you'll still need someone who reads English then..
agh, the problems, the problems....
A great challenge (Score:3)
Unless we take steps to archive, transcribe and preserve all this information (yes, grits, petrification et al) then we are in effect building a new Library of Alexandria.
It would be the greatest loss ever for archaeologists of the future to be unable to access archives of the WWW. Every day is a unique snapshot of the world as the endless churning of webpage updates/dead link removals changes the WorldWideWeb.
This information Ocean is something unique. Archiving such a huge store of information generates a challenge in itself.
I don't often wax lyrical about the internet but it is in effect becoming a snapshot of our civilisation.
What a loss for future generations if they cannot see the views of ordinary human beings (through the endless websites) preserved.
Old issue (at least in sci-fi) (Score:3)
Data Decay, Readability, and ASCII text. (Score:3)
When you look back at history, and you look back at documents that are a "mere" thousand years old, the wealth of information in these documents makes you wonder what could be found if all the documents from that time had survived. Just because the format is digital, rather than analog or (eek!) paper, does not mean that this media is impervious to decay.
However, I think that decay is much, much more serious in digital media. The root of the problem is that if you are looking at physical document with water damage, even though the original "packets" of information (letters and words) are damaged, the human brain can sometimes extract meaning from smearing ink and crumbling paper. When an electron wanders on magnetic media or when a CD begins to decompose, that bit is lost forever. Digital media is much more sucepitble to lapsing into unintelligibility than physical media like paper.
Preservation in a media that will not become obselete is the key. As mundane as it may sound, plain ASCII text will probably never become obselete because there is no real reason to come up with a new standard. Some people may scream at me: "*ML! *ML!", but at the rate that these things will obescelece, plain text will still be around when XSGHTML has been long dead.
Just a thought. If you have something to add, feel free to respond.
Brandon Nuttall, the inquisitor of Reinke
An excellent summary of the problem (Score:4)
Personally, I encountered the issue of software obsolescence well over a decade ago. I migrated my resume to TeX because it had already been through four other formats and I no longer had access to the tools to read them. I picked TeX because I firmly believed that a tool that I had the source for was likely to continue to be useful to me for a longer period. And the source for the document is ASCII text, which I was able to convert to HTML a couple of years ago with little trouble. I will not rely on the future availability of any tool that I have no control over.
This is one of the reasons that The Unix Philosophy, a fine book, recommends text formats for data. You can manipulate it with a wide variety of tools including text editors. It is unlikely that we will abandon those completely in our lifetimes. It also suggests, if memory serves, keeping notes online in text form. They are more portable and more accessible that way.
One worthwhile source of literature preserved as plain text files is Project Gutenburg [promo.net]. It is probably also the oldest such project around. It is to text in some senses what Free Software is to code. Although they aren't doing collaborative authoring projects, they are collaborating on getting old books whose copyrights have expired into electronic form. If you haven't ever visited their site, take a look.
It's not just digital magnetic information... (Score:3)
In this case, the main problem is not bit-rot (although this will occur sooner or later) but rather problems with not recalling the information for an extended period of time. For example:-
- Reels of tape start to inprint signals to adjacent tape causing loud passages to have ghost versions either before or after them.
- Tape actually becoming stuck to itself due to using bad binding materials leading to baking [audio-restoration.com] of tape as desperate restorative measures.
These and other issues are discussed on www.audio-restoration.com [audio-restoration.com]. Does anybody know if there are similar problems associated with digital media (the cross-talk problem will be virtually negligible due to noise-floor issues being irrelevant)? If so then it makes archiving a much more difficult thing if you have to physically do something to the archives every couple of years (especially with the exponential growth rate of information generation).Re:Inversely proportional? (Score:3)
CD's and related optical media do have problems with sunlight, but you have to remember that they where created (AFAIK) by the audio industry which is one of the most notoriously fickle industries in the world: they want you to buy a new CD from a new group every week, not have a single CD that is perfect and that lasts forever. I think that the concept of people being able to listen to their CD's for 10 years is already far too long for them.
The problem is that there doesn't really appear to be anyone making storage media that is optimised for long-term persistent storage. But do you think that such a format would be the way forward? Each year, we generate an exponentially larger amount of information. All the hard disks on the planet now would not be enough to store the new information that will be generated in the next 5 (wild guess) years. Therefore we are going to need progressively larger and efficient forms of data storage as the information bloat gets larger. As new formats come out, the important thing is to look at the movement of legacy data onto the new formats. If data is not treated as a static thing to be boxed up and forgotten, but rather as part of the on-going current set of information and transferred onto new technologies as they are developed then you will not have the situation where people are looking at a hard disk in 50 years time and going "what's an IDE interface?".
Of course, then you have the 'minor' issue of application file formats...
So just what *is* the life of a CDR? My results. (Score:4)
I put some CDRs out in the direct sun hede in the Las Vegas desert ofer the last summer. Blue, gold, green, pale green, and an RW. Both sides of the CDs had their chance to roast in the 100F+ (40C+) degree sun for several months each. And here's the results of attempting to read the data back on each type:
Old TDK green CDR: dead, nothing readable. Faded to a mostly clear plastic disc!
Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
Verbatim (blue): I was stunned. I read back a full and complete iso image of Red Hat 4.2. No fading at all.
Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
Memorex silver/green CDR: mostly dead, some files readable. Faded in a few isolated patchy blotches.
The CDRW... just started this test. No results yet. Looks OK, though.
Overall, I'd say the blue CDRs are the best choice for long term data storage.
Re:Is this likely? (Score:4)
BTW, I think the original author missed one future problem - encrypted information. I foresee hardware-based encryption becoming almost ubiquitous so that most data is encrypted. If encryption becomes universal, then much info will be encrypted that really wasn't burn-before-reading secret. What happens to all that information - of potential interest to historians looking back on the 21st century - under those conditions?
Somewhat of a paradox (Score:3)
Point 1:
With the amount of data that we produce, archiving it will take an increasing amount of time. How much new content is created daily? At best, we will plateu in a state where as much effort is required to archive content as is needed to create new content.
With the emphasis placed squarely on non-duplication of effort, archiving becomes a secondary issue. Indexing, searching, sorting and categorizing of the archive becomes a first priority, since creative efforts should now check if they are redundant.
If the bold statement is to be a guideline, than the idea of an archive is moot, since all new work depends on old work, and so tracks well with where the author feels human effor should go. Much like with biological evolution, new data is the fittest of the old data that was applicable to the new context. I suppose that the call for archives is little more than a suggestion that we need an organized and deliberate fossil record of how we got to where we will be at some point in the future.
What is needed is an archive, yes, but an archive of what? Not of content, but of the essence of the content. The lessons learned, the conclusions drawn and the optimizations realized in the process of creating the content. The content is fleeting - though arguably of inherent value... Which brings us to...
Point 2:
Yes, some things are irreplacable. Who decides? Who defines what is art, what is fact, and what deserves eternal life?
Some things are of immediate and significant value, but for an unknown duration. The value of other things can not be realized for a very long time, and so the alternative is to store everything. Further more, the value of certain data is totally subjective, and this begs the question of "who's in charge" of defining that 1% that is to be kept.
On the small scale, this will lead to vanity. Any 'artist' will consider their work a masterpiece, and save it. (I have code I wrote in CS101, don't you?) Companies will store and archive all email, all financials, anything that can potentially be used to mine data or identify trends or fertalize litigation. People will pigeon-hole videos of their baby's first steps, though nobody outside themselves really cares - unless the child grows up to be the next Einstein, or Hitler.
"Hitler" raises an interesting question on the larger scale. Who has the responsibility of deciding what 'big' facts to store? And isn't that the path to propaganda, history-making, and such things?
And then, when the leadership changes, and the 'book burning' starts...
To bring the concept down from the paranoid-sphere, let's recall the
Same issue with Newton and Leibnitz. Leibnitz was the German Mathematician who beat Newton to the concepts of Calculus. Newton, a member of the Royal Academy of Sciences (or something to that effect) politicised HIS influence, and so was credited with all of the work - where his contribution was not complete.
Some things are not outright lies, but oral histories get lost while written records persist.
Who gets to choose what to write down?
A paranoid addition... (Score:4)
One thing I believe was missed in the original article is intentional change to the historical record. In addition to having to store old information, and worry about how we're going to get to it later, I think we need to pay at least half a though to intentional modification of the historical record.
With paper and ink, it's rather time consuming and expensive to alter historical documents, even assuming you can get near them. With digital media, the situation may be different - it may become very simple to alter historical documents, especially if you're the guy who's in charge of copying them to the newest form of media.
Aside from the obvious political reasons someone might want to do this (can you think of a fundamentalist movement of any sort that wouldn't modify old documents to read they way they would like, given the chance?), I can also see where money might come into play.
For instance, suppose MassiveDrugCo, Inc. is introducing a new drug which prevents newly detected disease Y. Now, in order to sell a lot of this drug, you have show that Y harms enough people to worry about. Unfortuately, the historical record being used for retrospective studies doesn't show that. So, instead of going back to the drawing board and finding something else to cure, MassiveDrugCo instead feeds a modified copy of the historical data to unsuspecting independant researchers. These honest and unbribable researchers draw the conclusion desired by MassiveDrugCo - in spite of the reality of the situation.
Civilization Bootstrapping (Score:4)
Are you trying to preserve episodes of the Simpsons so our relatively near term, technologically advanced descendants can watch them? Well, they're technologically more advanced and thus more clever than we; we just need to have suficiently stable media (micromachined gold plates would work nicely) and a either a simple minded encoding scheme or an easily readable description of the algorithm prepended. In the 22nd century, some bright Norwegian 16 year old armed with a yottaflop coputer will figure out how to read it if he cares enough.
A bigger concern (in my opinion) is what happens when our civilization collapses. Historically, it is almost certain happen sooner or later. Rome lasted well over a thousand years; if you told a 1st century CE roman that there would ever be an end to the empire he'd think you were crazy. Yet our civilization is in many ways much more fragile because the information it is based on is in much more ephemeral form (both media and format).
What we need is to devise a bootstrap procedure.
(1)Reading primers in various languages.
(2) Primers on basic technology: mathematics, simple mechanics, mining and elementary metallurgy.
These should be in highly durable form, but the problem is that you don't want people making off with them for building materials. The problem with using gold plates is that you don't want people to have access to them until the information on them is more valuable than the substrate. Perhaps these first items could be carved onto stone pillars inconveniently large to move.
Next, you need repositories on more advanced science and technology: chemical engineering, electronics and so forth. Perhaps you could rig a way to prevent savages from accessing these repositories; a mechanical puzzle perhaps, that requires a certain mathematical sophistication to solve. The most critical records could be kept in forms that could readily be read without mechanical assitance or with only simple mechanical assistance such as optical magnification (my local librarian likes micofilm, because she knows it will be readable for decades). Less critical things like old Simpsons episodes could be on very cryptic media that would require considerable technical finesse to read, but would be cheap to transfer to.
Pretty much, as you go from the most basic and critical information to the least critical information, you go from the easiest to read and most expensive to produce per bit, to the hardest to read and most convenient to produce.
Re:those AOL CD's (Score:3)
Actually no. Paper has proven to be one of our most durable ways of storing data. Egyptian papyrus from 3000 years ago is still more or less intact. CD's on the otherhand, will last for a 100 years in a "BEST CASE" scenario. Most will last much less time. CDRs might last 25 years. There are other variables besides media itself. I've seen several CD's from the early 80's that refuse to play these days. They're not at all scratched, but the theory goes that the original ink they used to print on them actually was a bit corrosive over a great span of time.
In order to remain readable, digital data must remain more or less intact. A few missing bits in the application needed to open a file can pretty much reduce your odds of opening that file by 100%. Analog data, on the other hand, degrades much more gracefully... It may start to fade, but there's no intermediary between having the data and being able to read it (you don't need an extra "application" to read a newspaper).
I've heard that this is actually going to be one of the least documented periods in human history, because all of our data is stored digitally and periodically purged. Even if it's not, places like NASA are generating data faster than they're able to back it up and move their old archives onto newer media.
"Proprietary Formats" are still a problem (Score:3)
The way that the DVD Fourm [dvdforum.com] (formerly known as the DVD Consortium, with oversees the DVDCCA... this is the group of companies that cross-license each other's patents and shares information regarding DVD development) currenly requires you to sign a non-disclosure agreement (NDA) to obtain the specifications, and that NDA also prohibits you from even discussing the specifications with anybody unless they have also signed the same NDA. Since this is covered under the trade secret laws, this particular bit of intellectual property is theirs theoretically forever. At least until you can hire a bunch of lawyers to demonstrate that a DVD is no longer a trade secret.
I've also set up a seperate mailing list from the main Linux Video group that is in the process of developing an Open Video Disc [mailto] specification which is trying to allow people to develop products without having to pay royalties or deal with patent infringments. Fees for most of the current video formats range from over $10,000 (for the DVD specs.... license fees are on top of that) to the MPEG Licensing Authority [mpegla.com] who is being quite reasonable for most close-source projects, but if you read the details of what you must do to license a product, is contrary to the nature of most open-source projects. It is still possible to write a GPL'ed MPEG player, but it would only be free as in speech and not free as in beer. In fact, you would probabally have to charge somebody to download the software. Shareware MPEG players are probabally skating on some very thin ice legally, and certainly part of the registration costs would have to go to the MPEGLA.
One of the things that is so nice about HTML is the fact that this standard is open, patent and royalty free. If CERN had tried to put a patent on HTML I doubt that the web would have developed nearly so quickly. Or rather imagine if Apple's hypercard system had been developed with the GPL and file formats were made open for anybody on any platform to use.
One of the things that I believe is killing the Unicode character encoding is that all kinds of intellectual property restrictions are placed on it, and you need to pay royalties to develop much software that uses it. Again, think what would have happened with ASCII had it been kept closed up, and why EBDIC isn't being used for character encoding.
More importantly, open and free specifications are critical to data preservation, and a point that really hasn't been brought up by Calc (the author of the original post on
A modest proposal (Score:3)
I suspect the problem of file formats is less serious than people make it out to be. A well-documented format should be reconstructable indefinitely. Few software companies don't document their file formats. Even without documentation, it ought to be easier than reconstructing dead languages. We learned to read Egyptian hieroglyphs primarily from one attested translation and a lot of careful deduction. Given a thousand Word 6 documents, I think a good computer archeologist ought to be able to construct a program to open and edit them.
Museums of old hardware, and perhaps some sort of custom computer factor to make ancient hardware strikes me as a good idea. It could be like blacksmiths at SCA festivals, "Ye Olde ASIC Mill."
The real problem strikes as the one most heavily emphasised in the article: decaying media. I suspect the best solution with presently forseeable technology would be to preserve data in crystalised DNA. Even in nature, DNA takes centuries to decay, and if it were crystalised and kept somewhere cool and dry, it would likely last for millenia. Encoding a document onto a billion strands of DNA weighs basically nothing and it would be a very highly redundant storage system.
It isn't easy to do right now, but I suspect that technology is right around the corner and probably only requires a little bit of research money to become practical.
Re:NASA problem too (Score:4)
Problem is, it's entirely possible for us to not understand the importance of a data collection for years. That old Landsat data would be a great baseline for information about global climate change.
Re:those AOL CD's (Score:3)
I read an article once in that hotbed of liberal thinking, readers digest, about book deterioration. Older books were printed with a different method, and will last a couple hundred years. Newer books will only last maybe 50 years.
This begs the question : how long will computer printouts last?
Re:A paranoid addition... (Score:3)
I believe it was King Tutankamun's father, Akenahten, who threw his world into a tizzy by rejecting the established religion and invented a new one that worshipped the sun. He went off a built a new city to go along with it too.
Well, the bureacracy of the day didn't like this at all because it messed with their job security. And as soon as he was dead, they went around hacking his face off anywhere it appeared (of course we're talking about monuments, etc., made from stone) and I believe they went after any mention of him in text (hieroglyphs) too.
And they almost got away with it and just about completely expunged his existence from their records. But they missed a few things and we've been able to piece together a little bit about him.
So anyhow, there's my Discovery channel understanding of that little story. What it means in relation to this subject I'm not quite sure. I thought it was a good idea to point out that this is certainly not a new issue.