Stories
Slash Boxes
Comments

News for nerds, stuff that matters

On Preservation of Digital Information

Posted by Hemos on Thu Feb 24, 2000 10:51 AM
from the keeping-things-for-future-generations dept.
Cacl, a PhD student at University of Michigan in their School of Information Divison has written a feature addressing the concerns and problems of preserving digital information. This is an area of study of his - and interesting to read about.

Preservation of Digital Information

Recently there was an Ask Slashdot about the the problem of preserving digital material. The basic idea was that we are creating a massive wealth of digital information, but have no clear plan for preserving it. What happens to all of those poems I write when I try to access them for my grandkids? What about the pictures of my kids I took with that digital camera? Can I still get to them in time to embarrass them in the future?

Obsolescence of digital media can happen in three different ways:

  • Media Decay: Even when magnetic media are kept in dry conditions, away from sunlight and pollution, and hardly ever accesses they will still decay. Electrons will wander over the substrate of the media, causing digital information to become lost. CD-ROMs luckily do not have this same problem with electron loss. They still are sensitive to sunlight and pollution though. Many people mentioned last week that distributors of blank CD media often make claims of an hundred years or more for the duration of their products. Research seems to indicate the truth is closer to 25 years,which seems like a long time, until you consider the factors below. Besides, information professionals often think in terms of centuries rather than decades.
  • Hardware obsolescence: Far more dangerous than the degradation of the actual information container is the loss of machines that can read it. For instance, the Inter-University Consortium of Political and Social Research received a bunch of data on old punch cards. The problem was they had no punch card reader. It took a decent chunk of time, and a good deal of money to eventually be able to read the data off of these cards, even requiring some old technicians to come out of retirement to help tweak the system. Hardware extinction is hardly a foreign topic to Slashdotters. It happens, and as technology increases its pace of change, it will happen more quickly.
  • Software obsolescence: The real stone in the shoe of digital preservation is obsolescence of the software needed to open the digital document. This can include drivers, OSS, or plain old application software. We all have piles of old software that were written for older systems, or come across an old file the bottom of a drawer where we can't even remember what application it used.

There are several strategies for preserving digital information. People mentioned some last week:

  • Transmogrification: printing the digital document into an analog form and preserving the analog copy. An example would be printing out a Web page and archiving the print of that Web page. This, obviously, takes out the main strength of a Web document, hyperactivity, and may also ignore important color and graphical content. An alternative form of this is the creation of hardcopy binary that could later be data entered into the computers of the future. The media suggested have ranged from acid free paper to stainless steel disks etched with the binary code. The two major problems with this idea are that any misrepresentation of the binary could have disastrous results for the renewal of the document, and transformation to hard copy limits the functionality of many types of digital documents to the point of uselessness.
  • Hardware museums: preserving the necessary technology needed to run the outdated software. There are several weaknesses to this plan. Even hardware that is carefully maintained breaks and becomes un-usable. In addition, there is no clear established agency that will be responsible for maintaining these machines. Spare parts eventually become impossible to find and legacy skills are required for maintenance. There must be technicians with the requisite skills to service these preserved machines. Finally, it does not create efficient use if all possible future users must bottleneck to just a handful of viewing sites to have access to the information.
  • Standards: reliance on industry-wide standardization of formats to prevent obsolescence. Market place pressures for software produces create an incentive for a company to differentiate their product from their competitors. While unrealistic in a capitalistic marketplace, standards such as SGML have proven successful for large scale digital document repositories, like the Making of America archive hosted by the University of Michigan. However, many of these large repositories also receive information from donors that is not in a standardized format, and do not feel comfortable turning away those documents.
  • Refreshing: moving a digital object from one medium to another. For instance, transferring information on a floppy disk to a CD-ROM. This definitely seemed to be the preferred method of most Slashdotters. While this takes care of degradation and obsolescence of the media, it does not solve the problem of software obsolescence. A perfectly readable copy of a digital document is useless if there is not software program available to translate it into human-readable form.
  • Migration: moving the digital document into newer formats. An example might be taking a Word 95 document and saving it as a Word 97 document. Single generation leaps are usually not a problem, so large volumes of information could be saved. Unfortunately, migrations over several generations are often impossible, as is migrating from a document type that was abandoned, and did not evolve. Also, information loss is common in migration, and may cause the document to become unreadable. While this may be the best single method available, it is very labor intensive, and some knowledge of the nature of documents would be essential to determining which information containers to migrate. For instance, often you lose aspects of a document (good and bad) when you migrate it, but which of those aspects are important?
  • Emulation: creating a program that will fake the original behavior of the environment in which the digital object resided. This is another very intriguing method that could be used. It's actually already pretty common. For instance, most processor chips include emulators for lower level processors. There also aleady exists on the Internet a very active group of people who are interested in emulating old computer platforms. Still, we need to do a lot of research yet on the cost of this method, and what sorts of metadata are necessary to bundle with the digital object to facilitate its eventual emulation. Another problem is the intellectual property hassle caused by emulation. Reverse engineering is a big no no, and there is no point in making the lawyers rich. This area is actually where Open Source can be of biggest help to preserving the longevity of different kinds of applications.

Many people in the discussion last week seemed to believe that simple refreshment or migration of the data would be a sufficient answer to the problem. At a personal level that may be true, but for anyone responsible for large amounts of digital information, neither is a completely convincing method. Here are a couple of reasons why:

  • Not all documents are the same- In the digital preservation literature, most people talk as if all digital information is in ASCII format. Au contraire. As computing becomes increasingly robust, so do the documents we create. Multimedia games, three dimensional engineering models, recorded speeches, linked spreadsheets, virtual museum exhibits and a host of other documents spurred by the development of the Web have cropped up. How are they going to be affected by migration to a new environment?
  • It's so darned expensive- It's a little gauche to talk about, but the Y2K bug caused what ended up being a huge migration of digital information. How much did the US alone spend on that fiasco? $8 billion? For smaller organization who do not prepare for the preservation of their digital information, the cost of emergency migrations could cause all sorts of budget trouble.

There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? Secondly, there is some data that is irreplacable.

Which is not to say that we should keep everything. In a traditional archive, only 1% of documents received are kept. Ninety nine out of one hundred documents are destroyed for various reasons. A similar ratio is not unreasonable for digital documents. Consider that 16 billion email messages are sent each day. It seems ridiculous to keep all of them, but how do we weed out the ones we do want to keep? Appraisal of digital documents for archival purposes is going to become a major issue in the not distant future. There are already examples of data that have been lost, or nearly lost. NASA lost a ton of data off of decayed tapes. The U.S. Census nearly lost the majority of the data from the 1960 census. These huge datasets are important for establishing a scientific record that reveals longitudinal effects.

Increasingly, the record of the human experience is kept in a digital format. The act of preserving that information is the act of creating the future's past, the literal reshaping of our world in the eyes of the future. Nobody knows the best answer yet. There is probably not a single answer that will fit absolutely all situations. Information professionals are just beginning to do research in the form of user testing, cost-benefit analysis and modeling to answer some of the thornier issues raised by the preservation of digital information. There are things out there worth saving, we just need to figure out the best way to do it.

Some links of interest in case you would like to read more:

  • a really good bibliography of related sources by Michael Day
  • an article by Jeffrey Rothenberg outlining some of the issues
  • a site at Leeds University with many related links
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
(1) | 2 | 3
  • Ok ... by Bad Mojo (Score:1) Thursday February 24 2000, @05:57AM
  • 99%?? by Sakhmet (Score:2) Thursday February 24 2000, @05:58AM
  • those AOL CD's (Score:4)

    by BrentRJones (68067) <me AT _remove_this_brentjones DOT org> on Thursday February 24 2000, @06:00AM (#1248687) Journal
    I'm keeping all those old AOL CD-ROMs. Some software archaelogist will need them to see what Internet pioneers struggled with.
  • Limited problem, if... by MPolo (Score:1) Thursday February 24 2000, @06:03AM
  • more info (Score:3)

    by meighan (151487) on Thursday February 24 2000, @06:04AM (#1248689) Homepage Journal
    If interested, there is a report from '96 which offers some more information on the subject. From the Task Force on Archiving of Digital Information, here [rlg.org]

    --

  • BBC article (Score:3)

    by Nafta (42011) on Thursday February 24 2000, @06:05AM (#1248690) Homepage
    BBC currently has an article [bbc.co.uk] on the same subject. This a great advantage of Open Source (preaching to the converted, I know) because that is the only open standard (and therefore durable) format. All other proprietary formats will come and go with the companies that make them.
  • magnetic storage (Score:4)

    by Signal 11 (7608) on Thursday February 24 2000, @06:06AM (#1248691)
    Unfortunately the solutions we've employed up until recently are fatally flawed - they all use magnetic storage. The problem is that the higher the density, the sooner "bit rot" occurs - those magnetized iron oxide particles work against each other to depolarize. After several years (or several dozen, depending on the media) the data's unsalvagable. That's problem #1.

    The solution would be to use an optical storage media, but as others have pointed out, CDR storage has a life expectancy of 75-100 years depending on the brand. Which wouldn't be too bad except you have to realize that in 100 years you need to start putting resources into copying all that data off and re-writing it again. After awhile you'll have a snowball effect where you spend more time writing the old data than the new!

    What we really need is a piece of technology that doesn't age - an entirely self-contained computer (nuclear powered, maybe?) that has the media, the reading/writing mechanisms and has several failsafe mechanisms to alert you well before any data is lost. Think of it as a computer time capsule - you bury it and in 500 years come back and it has all the human interface necessary to reproduce the data in a usable format. Of course, you'll still need someone who reads English then..

    agh, the problems, the problems....

  • Is this likely? by luckykaa (Score:1) Thursday February 24 2000, @06:07AM
  • My philosophy is ... by Cyclope (Score:1) Thursday February 24 2000, @06:09AM
  • Re:99%?? by Zurk (Score:1) Thursday February 24 2000, @06:11AM
  • Re:99%?? by Matty Boy (Score:1) Thursday February 24 2000, @06:12AM
  • A great challenge (Score:3)

    by AjR (148833) on Thursday February 24 2000, @06:13AM (#1248697) Homepage
    balancing the endless churning of the web against the need for a stable archive.

    Unless we take steps to archive, transcribe and preserve all this information (yes, grits, petrification et al) then we are in effect building a new Library of Alexandria.

    It would be the greatest loss ever for archaeologists of the future to be unable to access archives of the WWW. Every day is a unique snapshot of the world as the endless churning of webpage updates/dead link removals changes the WorldWideWeb.

    This information Ocean is something unique. Archiving such a huge store of information generates a challenge in itself.

    I don't often wax lyrical about the internet but it is in effect becoming a snapshot of our civilisation.

    What a loss for future generations if they cannot see the views of ordinary human beings (through the endless websites) preserved.
  • by dillon_rinker (17944) on Thursday February 24 2000, @06:16AM (#1248698) Homepage
    IIRC, Orson Scott Card addressed this issue in a story set in Isaac Asimpv's universe. The library on Trantor had indices of going back thousands of years, but the contents of the library had never been refreshed. The librarians knew exactly what they had lost.
  • by inquis (143542) on Thursday February 24 2000, @06:18AM (#1248700)

    When you look back at history, and you look back at documents that are a "mere" thousand years old, the wealth of information in these documents makes you wonder what could be found if all the documents from that time had survived. Just because the format is digital, rather than analog or (eek!) paper, does not mean that this media is impervious to decay.

    However, I think that decay is much, much more serious in digital media. The root of the problem is that if you are looking at physical document with water damage, even though the original "packets" of information (letters and words) are damaged, the human brain can sometimes extract meaning from smearing ink and crumbling paper. When an electron wanders on magnetic media or when a CD begins to decompose, that bit is lost forever. Digital media is much more sucepitble to lapsing into unintelligibility than physical media like paper.

    Preservation in a media that will not become obselete is the key. As mundane as it may sound, plain ASCII text will probably never become obselete because there is no real reason to come up with a new standard. Some people may scream at me: "*ML! *ML!", but at the rate that these things will obescelece, plain text will still be around when XSGHTML has been long dead.

    Just a thought. If you have something to add, feel free to respond.

    Brandon Nuttall, the inquisitor of Reinke

  • by dsplat (73054) on Thursday February 24 2000, @06:21AM (#1248701)
    This is something that is going to be more of a concern for those of us who conduct a significant portion of our lives online already. Ask yourselves, have you ever had a moment of unusual brilliance in which you posted something to Slashdot or Usenet which was truly worth saving? Can you find it now?

    Personally, I encountered the issue of software obsolescence well over a decade ago. I migrated my resume to TeX because it had already been through four other formats and I no longer had access to the tools to read them. I picked TeX because I firmly believed that a tool that I had the source for was likely to continue to be useful to me for a longer period. And the source for the document is ASCII text, which I was able to convert to HTML a couple of years ago with little trouble. I will not rely on the future availability of any tool that I have no control over.

    This is one of the reasons that The Unix Philosophy, a fine book, recommends text formats for data. You can manipulate it with a wide variety of tools including text editors. It is unlikely that we will abandon those completely in our lifetimes. It also suggests, if memory serves, keeping notes online in text form. They are more portable and more accessible that way.

    One worthwhile source of literature preserved as plain text files is Project Gutenburg [promo.net]. It is probably also the oldest such project around. It is to text in some senses what Free Software is to code. Although they aren't doing collaborative authoring projects, they are collaborating on getting old books whose copyrights have expired into electronic form. If you haven't ever visited their site, take a look.
  • Not a technological problem... by captaineo (Score:2) Thursday February 24 2000, @06:22AM
  • some other problems by sloth jr (Score:1) Thursday February 24 2000, @06:23AM
  • by fingal (49160) on Thursday February 24 2000, @06:29AM (#1248706) Homepage
    The problem of transient information storage is not just connected with reading old digital data stores, but also with analogue information as well, typically audio recordings.

    In this case, the main problem is not bit-rot (although this will occur sooner or later) but rather problems with not recalling the information for an extended period of time. For example:-

    • Reels of tape start to inprint signals to adjacent tape causing loud passages to have ghost versions either before or after them.
    • Tape actually becoming stuck to itself due to using bad binding materials leading to baking [audio-restoration.com] of tape as desperate restorative measures.
    These and other issues are discussed on www.audio-restoration.com [audio-restoration.com]. Does anybody know if there are similar problems associated with digital media (the cross-talk problem will be virtually negligible due to noise-floor issues being irrelevant)? If so then it makes archiving a much more difficult thing if you have to physically do something to the archives every couple of years (especially with the exponential growth rate of information generation).
  • Inversely proportional? by Yaruar (Score:2) Thursday February 24 2000, @06:32AM
  • Re:Ok ... by gfxguy (Score:1) Thursday February 24 2000, @06:34AM
  • Thanks to "proprietary formats" info will be lost. by Anonymous Coward (Score:1) Thursday February 24 2000, @06:36AM
  • Exponential data and storage by sterno (Score:2) Thursday February 24 2000, @06:38AM
  • Just carve it by Kaa (Score:2) Thursday February 24 2000, @06:38AM
  • Information evolution by Steve Burnap (Score:1) Thursday February 24 2000, @06:41AM
  • Data *and* code by AstronomyDomine (Score:2) Thursday February 24 2000, @06:42AM
  • Ermmmm.... Moderators! Bart's Swearing! by luckykaa (Score:1) Thursday February 24 2000, @06:42AM
  • Re:those AOL CD's by MupwI (Score:2) Thursday February 24 2000, @06:47AM
  • Help? ---> A question... by ATKeiper (Score:1) Thursday February 24 2000, @06:47AM
  • I'm off to save the world... by Chops-Frozen-Water (Score:1) Thursday February 24 2000, @06:48AM
  • Re:My philosophy is ... by Ioldanach (Score:1) Thursday February 24 2000, @06:48AM
  • Caching as a possible approach to preservation by adam (Score:2) Thursday February 24 2000, @06:48AM
  • by fingal (49160) on Thursday February 24 2000, @06:49AM (#1248722) Homepage
    One point to remember when looking at problems with digital information storage media is that they are not really intended to be archived. What they are (mostly) designed to do is read and write the information very fast at a high frequency with a high degree of accuracy. Most of them are quite good at this and the issue of bit-rot tends to go away if you are continually reading and writing your hard-disk.

    CD's and related optical media do have problems with sunlight, but you have to remember that they where created (AFAIK) by the audio industry which is one of the most notoriously fickle industries in the world: they want you to buy a new CD from a new group every week, not have a single CD that is perfect and that lasts forever. I think that the concept of people being able to listen to their CD's for 10 years is already far too long for them.

    The problem is that there doesn't really appear to be anyone making storage media that is optimised for long-term persistent storage. But do you think that such a format would be the way forward? Each year, we generate an exponentially larger amount of information. All the hard disks on the planet now would not be enough to store the new information that will be generated in the next 5 (wild guess) years. Therefore we are going to need progressively larger and efficient forms of data storage as the information bloat gets larger. As new formats come out, the important thing is to look at the movement of legacy data onto the new formats. If data is not treated as a static thing to be boxed up and forgotten, but rather as part of the on-going current set of information and transferred onto new technologies as they are developed then you will not have the situation where people are looking at a hard disk in 50 years time and going "what's an IDE interface?".

    Of course, then you have the 'minor' issue of application file formats...

  • by Anonymous Coward on Thursday February 24 2000, @06:50AM (#1248723)
    What is the lifespan of data stored on a CDR? An old 550(63min) CDR? a 650MB(74min) CDR? Green CDRs? Gold CDRs? Blue CDRs? A CDRW? etc.? Under non-ideal conditions?

    I put some CDRs out in the direct sun hede in the Las Vegas desert ofer the last summer. Blue, gold, green, pale green, and an RW. Both sides of the CDs had their chance to roast in the 100F+ (40C+) degree sun for several months each. And here's the results of attempting to read the data back on each type:

    Old TDK green CDR: dead, nothing readable. Faded to a mostly clear plastic disc!
    Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
    Verbatim (blue): I was stunned. I read back a full and complete iso image of Red Hat 4.2. No fading at all.
    Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
    Memorex silver/green CDR: mostly dead, some files readable. Faded in a few isolated patchy blotches.
    The CDRW... just started this test. No results yet. Looks OK, though.

    Overall, I'd say the blue CDRs are the best choice for long term data storage.

  • Moore's Law by stevelinton (Score:2) Thursday February 24 2000, @06:52AM
  • Re:magnetic storage by raygundan (Score:2) Thursday February 24 2000, @06:53AM
  • Re:more info by Gerald (Score:2) Thursday February 24 2000, @06:53AM
  • Re:A great challenge by pakratt (Score:2) Thursday February 24 2000, @06:54AM
  • Re:Data Decay, Readability, and ASCII text. by Steve Burnap (Score:2) Thursday February 24 2000, @06:57AM
  • related article by ATKeiper (Score:1) Thursday February 24 2000, @06:59AM
  • Analog vs. Digital by doranb (Score:1) Thursday February 24 2000, @06:59AM
  • Re:magnetic storage by Anonymous Coward (Score:1) Thursday February 24 2000, @07:01AM
  • Vanishing Web Content by raygundan (Score:1) Thursday February 24 2000, @07:03AM
  • How long can it last? by blackdefiance (Score:1) Thursday February 24 2000, @07:06AM
  • Re:Thanks to "proprietary formats" info will be lo by Detritus (Score:2) Thursday February 24 2000, @07:08AM
  • Re:Is this likely? by thunderbee (Score:1) Thursday February 24 2000, @07:09AM
  • Data Havens, Archive and standards oh my! by jfrisby (Score:2) Thursday February 24 2000, @07:10AM
  • Re:Analog vs. Digital by fingal (Score:1) Thursday February 24 2000, @07:11AM
  • My solution to access obselete documents.. by technos (Score:2) Thursday February 24 2000, @07:13AM
  • Re:Inversely proportional? by Yaruar (Score:1) Thursday February 24 2000, @07:14AM
  • Re:Data Decay, Readability, and ASCII text. by clintp (Score:2) Thursday February 24 2000, @07:21AM
  • Data preservation by LostOne (Score:2) Thursday February 24 2000, @07:21AM
  • Re:those AOL CD's by Shadowlion (Score:1) Thursday February 24 2000, @07:23AM
  • Why TeX is better than PostScript by tilly (Score:2) Thursday February 24 2000, @07:23AM
  • Re:Is this likely? (Score:4)

    by dierdorf (37660) on Thursday February 24 2000, @07:25AM (#1248749) Homepage
    Gee, I just happen to have a bunch of steel-ribbon tapes from a Univac I. Maybe that information is vital to civilization as we know it. Do you really think those tapes can be read today without tremendous expenditure of time and effort?

    BTW, I think the original author missed one future problem - encrypted information. I foresee hardware-based encryption becoming almost ubiquitous so that most data is encrypted. If encryption becomes universal, then much info will be encrypted that really wasn't burn-before-reading secret. What happens to all that information - of potential interest to historians looking back on the 21st century - under those conditions?

  • Interesting timing on this article by Mindwarp (Score:2) Thursday February 24 2000, @07:28AM
  • Re:How long can it last? by Kulibali (Score:2) Thursday February 24 2000, @07:29AM
  • Re:An excellent summary of the problem by dsplat (Score:2) Thursday February 24 2000, @07:33AM
  • Re:Old issue (at least in real life) by djfiander (Score:1) Thursday February 24 2000, @07:34AM
  • by jabber (13196) on Thursday February 24 2000, @07:36AM (#1248756) Homepage
    There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. [Point 1] Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? [Point 2] Secondly, there is some data that is irreplacable.

    Point 1:
    With the amount of data that we produce, archiving it will take an increasing amount of time. How much new content is created daily? At best, we will plateu in a state where as much effort is required to archive content as is needed to create new content.

    With the emphasis placed squarely on non-duplication of effort, archiving becomes a secondary issue. Indexing, searching, sorting and categorizing of the archive becomes a first priority, since creative efforts should now check if they are redundant.

    If the bold statement is to be a guideline, than the idea of an archive is moot, since all new work depends on old work, and so tracks well with where the author feels human effor should go. Much like with biological evolution, new data is the fittest of the old data that was applicable to the new context. I suppose that the call for archives is little more than a suggestion that we need an organized and deliberate fossil record of how we got to where we will be at some point in the future.

    What is needed is an archive, yes, but an archive of what? Not of content, but of the essence of the content. The lessons learned, the conclusions drawn and the optimizations realized in the process of creating the content. The content is fleeting - though arguably of inherent value... Which brings us to...

    Point 2:
    Yes, some things are irreplacable. Who decides? Who defines what is art, what is fact, and what deserves eternal life?

    Some things are of immediate and significant value, but for an unknown duration. The value of other things can not be realized for a very long time, and so the alternative is to store everything. Further more, the value of certain data is totally subjective, and this begs the question of "who's in charge" of defining that 1% that is to be kept.

    On the small scale, this will lead to vanity. Any 'artist' will consider their work a masterpiece, and save it. (I have code I wrote in CS101, don't you?) Companies will store and archive all email, all financials, anything that can potentially be used to mine data or identify trends or fertalize litigation. People will pigeon-hole videos of their baby's first steps, though nobody outside themselves really cares - unless the child grows up to be the next Einstein, or Hitler.

    "Hitler" raises an interesting question on the larger scale. Who has the responsibility of deciding what 'big' facts to store? And isn't that the path to propaganda, history-making, and such things?

    And then, when the leadership changes, and the 'book burning' starts...

    To bring the concept down from the paranoid-sphere, let's recall the /. article about Nikola Tesla. His work is not well known to most, because it was not made prominent, and subsequently, not well archived. We know of him, and we can dig for more about him, but the credit goes where it may not necessarily belong.

    Same issue with Newton and Leibnitz. Leibnitz was the German Mathematician who beat Newton to the concepts of Calculus. Newton, a member of the Royal Academy of Sciences (or something to that effect) politicised HIS influence, and so was credited with all of the work - where his contribution was not complete.

    Some things are not outright lies, but oral histories get lost while written records persist.

    Who gets to choose what to write down?
  • Resiliant media by grappler (Score:2) Thursday February 24 2000, @07:40AM
  • Memetic perspective by DezMo (Score:2) Thursday February 24 2000, @07:41AM
  • by mhkohne (3854) on Thursday February 24 2000, @07:41AM (#1248759) Homepage
    Disclaimer: I know I'm being a bit paranoid, but I think this should be brought up, at least for purposes of discussion. There is probably less to worry about here than in other places, but it still should, I think, be in the back of the mind of anyone trying to solve this problem.

    One thing I believe was missed in the original article is intentional change to the historical record. In addition to having to store old information, and worry about how we're going to get to it later, I think we need to pay at least half a though to intentional modification of the historical record.

    With paper and ink, it's rather time consuming and expensive to alter historical documents, even assuming you can get near them. With digital media, the situation may be different - it may become very simple to alter historical documents, especially if you're the guy who's in charge of copying them to the newest form of media.

    Aside from the obvious political reasons someone might want to do this (can you think of a fundamentalist movement of any sort that wouldn't modify old documents to read they way they would like, given the chance?), I can also see where money might come into play.

    For instance, suppose MassiveDrugCo, Inc. is introducing a new drug which prevents newly detected disease Y. Now, in order to sell a lot of this drug, you have show that Y harms enough people to worry about. Unfortuately, the historical record being used for retrospective studies doesn't show that. So, instead of going back to the drawing board and finding something else to cure, MassiveDrugCo instead feeds a modified copy of the historical data to unsuspecting independant researchers. These honest and unbribable researchers draw the conclusion desired by MassiveDrugCo - in spite of the reality of the situation.

  • Wired article on this topic by vallee (Score:2) Thursday February 24 2000, @07:42AM
  • Re:Just carve it by Cid Highwind (Score:1) Thursday February 24 2000, @07:44AM
  • by hey! (33014) on Thursday February 24 2000, @07:47AM (#1248762) Homepage Journal
    I think you have to ask, what are you preserving information for?

    Are you trying to preserve episodes of the Simpsons so our relatively near term, technologically advanced descendants can watch them? Well, they're technologically more advanced and thus more clever than we; we just need to have suficiently stable media (micromachined gold plates would work nicely) and a either a simple minded encoding scheme or an easily readable description of the algorithm prepended. In the 22nd century, some bright Norwegian 16 year old armed with a yottaflop coputer will figure out how to read it if he cares enough.

    A bigger concern (in my opinion) is what happens when our civilization collapses. Historically, it is almost certain happen sooner or later. Rome lasted well over a thousand years; if you told a 1st century CE roman that there would ever be an end to the empire he'd think you were crazy. Yet our civilization is in many ways much more fragile because the information it is based on is in much more ephemeral form (both media and format).

    What we need is to devise a bootstrap procedure.

    (1)Reading primers in various languages.

    (2) Primers on basic technology: mathematics, simple mechanics, mining and elementary metallurgy.

    These should be in highly durable form, but the problem is that you don't want people making off with them for building materials. The problem with using gold plates is that you don't want people to have access to them until the information on them is more valuable than the substrate. Perhaps these first items could be carved onto stone pillars inconveniently large to move.

    Next, you need repositories on more advanced science and technology: chemical engineering, electronics and so forth. Perhaps you could rig a way to prevent savages from accessing these repositories; a mechanical puzzle perhaps, that requires a certain mathematical sophistication to solve. The most critical records could be kept in forms that could readily be read without mechanical assitance or with only simple mechanical assistance such as optical magnification (my local librarian likes micofilm, because she knows it will be readable for decades). Less critical things like old Simpsons episodes could be on very cryptic media that would require considerable technical finesse to read, but would be cheap to transfer to.

    Pretty much, as you go from the most basic and critical information to the least critical information, you go from the easiest to read and most expensive to produce per bit, to the hardest to read and most convenient to produce.

  • Re:My philosophy is ... by plague3106 (Score:1) Thursday February 24 2000, @07:48AM
  • Value degradation by Nyarly (Score:2) Thursday February 24 2000, @07:52AM
  • Monkey by Louziffer (Score:1) Thursday February 24 2000, @07:55AM
  • Re:How long can it last? by blackdefiance (Score:1) Thursday February 24 2000, @08:02AM
  • Re:those AOL CD's (Score:3)

    by um... Lucas (13147) on Thursday February 24 2000, @08:03AM (#1248768) Journal
    My worry is that at some distant point in the future, all our paper will have rotted away and people think that the CDs were our primary means of communication, like Babylonian pottery shards..

    Actually no. Paper has proven to be one of our most durable ways of storing data. Egyptian papyrus from 3000 years ago is still more or less intact. CD's on the otherhand, will last for a 100 years in a "BEST CASE" scenario. Most will last much less time. CDRs might last 25 years. There are other variables besides media itself. I've seen several CD's from the early 80's that refuse to play these days. They're not at all scratched, but the theory goes that the original ink they used to print on them actually was a bit corrosive over a great span of time.

    In order to remain readable, digital data must remain more or less intact. A few missing bits in the application needed to open a file can pretty much reduce your odds of opening that file by 100%. Analog data, on the other hand, degrades much more gracefully... It may start to fade, but there's no intermediary between having the data and being able to read it (you don't need an extra "application" to read a newspaper).

    I've heard that this is actually going to be one of the least documented periods in human history, because all of our data is stored digitally and periodically purged. Even if it's not, places like NASA are generating data faster than they're able to back it up and move their old archives onto newer media.
  • Re:Dead Media Problems by jms (Score:2) Thursday February 24 2000, @08:04AM
  • I've been working with the Linux Video [linuxvideo.org] group where we've been trying to make an open source player for DVD discs. The ONLY problem that we're fighting right now is not the know-how to get it done, but rather trying to obtain the file format documents for DVD-Video and being able to use them legally. Indeed, the recent deCSS program is another really good example of how file format specifications can be illegal to implement, even if you have obtained the specifications legally.

    The way that the DVD Fourm [dvdforum.com] (formerly known as the DVD Consortium, with oversees the DVDCCA... this is the group of companies that cross-license each other's patents and shares information regarding DVD development) currenly requires you to sign a non-disclosure agreement (NDA) to obtain the specifications, and that NDA also prohibits you from even discussing the specifications with anybody unless they have also signed the same NDA. Since this is covered under the trade secret laws, this particular bit of intellectual property is theirs theoretically forever. At least until you can hire a bunch of lawyers to demonstrate that a DVD is no longer a trade secret.

    I've also set up a seperate mailing list from the main Linux Video group that is in the process of developing an Open Video Disc [mailto] specification which is trying to allow people to develop products without having to pay royalties or deal with patent infringments. Fees for most of the current video formats range from over $10,000 (for the DVD specs.... license fees are on top of that) to the MPEG Licensing Authority [mpegla.com] who is being quite reasonable for most close-source projects, but if you read the details of what you must do to license a product, is contrary to the nature of most open-source projects. It is still possible to write a GPL'ed MPEG player, but it would only be free as in speech and not free as in beer. In fact, you would probabally have to charge somebody to download the software. Shareware MPEG players are probabally skating on some very thin ice legally, and certainly part of the registration costs would have to go to the MPEGLA.

    One of the things that is so nice about HTML is the fact that this standard is open, patent and royalty free. If CERN had tried to put a patent on HTML I doubt that the web would have developed nearly so quickly. Or rather imagine if Apple's hypercard system had been developed with the GPL and file formats were made open for anybody on any platform to use.

    One of the things that I believe is killing the Unicode character encoding is that all kinds of intellectual property restrictions are placed on it, and you need to pay royalties to develop much software that uses it. Again, think what would have happened with ASCII had it been kept closed up, and why EBDIC isn't being used for character encoding.

    More importantly, open and free specifications are critical to data preservation, and a point that really hasn't been brought up by Calc (the author of the original post on /.)
  • NASA problem too by peter303 (Score:1) Thursday February 24 2000, @08:16AM
  • It's the first time we've tried to store data by scotpurl (Score:1) Thursday February 24 2000, @08:18AM
  • Black Hole Applications Software by Detritus (Score:2) Thursday February 24 2000, @08:18AM
  • Bit density by Tau Zero (Score:1) Thursday February 24 2000, @08:19AM
  • COPY, COPY, COPY by peter303 (Score:1) Thursday February 24 2000, @08:20AM
  • Reducing data for archiving purposes by tjwhaynes (Score:2) Thursday February 24 2000, @08:25AM
  • Re:Is this likely? by luckykaa (Score:1) Thursday February 24 2000, @08:25AM
  • Re:Analog vs. Digital by Detritus (Score:2) Thursday February 24 2000, @08:30AM
  • Re:Is this likely? by Psycho S. Illusion (Score:1) Thursday February 24 2000, @08:30AM
  • The Internet Archive by Pseudonymus Bosch (Score:2) Thursday February 24 2000, @08:31AM
  • Upgrade fever speeds the process by Junks Jerzey (Score:2) Thursday February 24 2000, @08:31AM
  • picky, picky, picky by unitron (Score:1) Thursday February 24 2000, @08:34AM
  • as in archival paper... by fantomas (Score:1) Thursday February 24 2000, @08:34AM
  • A modest proposal (Score:3)

    by vlax (1809) on Thursday February 24 2000, @08:35AM (#1248787)
    Archiving is important. I'm actually surprised at the number of /.'ers who just want to let the data die. I remember taking a tour of the Magninot Line in France. Having proven useless as a military outpost, the entire chain of caverns was converted to document storage decades ago. In a thousand years, archeologists will be able to substantially reconstruct live in twentieth century France. Information about births, deaths and marriages need never be lost. Detailed census reports can be preserved so historians can make new theories about the social behaviour of man. I think this is a fairly important task. Imagine how much easier it would be to reconstruct human history if past civilisation hadn't kept shoddy records.

    I suspect the problem of file formats is less serious than people make it out to be. A well-documented format should be reconstructable indefinitely. Few software companies don't document their file formats. Even without documentation, it ought to be easier than reconstructing dead languages. We learned to read Egyptian hieroglyphs primarily from one attested translation and a lot of careful deduction. Given a thousand Word 6 documents, I think a good computer archeologist ought to be able to construct a program to open and edit them.

    Museums of old hardware, and perhaps some sort of custom computer factor to make ancient hardware strikes me as a good idea. It could be like blacksmiths at SCA festivals, "Ye Olde ASIC Mill." :^) I doubt it would ever be profitable, but museums, even working ones, rarely are. Although who knows? A Commodore 64 could be an objet d'art in a hundred years, just as ugly African masks are now.

    The real problem strikes as the one most heavily emphasised in the article: decaying media. I suspect the best solution with presently forseeable technology would be to preserve data in crystalised DNA. Even in nature, DNA takes centuries to decay, and if it were crystalised and kept somewhere cool and dry, it would likely last for millenia. Encoding a document onto a billion strands of DNA weighs basically nothing and it would be a very highly redundant storage system.

    It isn't easy to do right now, but I suspect that technology is right around the corner and probably only requires a little bit of research money to become practical.
  • Sometimes the 99% is destructive by Tau Zero (Score:2) Thursday February 24 2000, @08:40AM
  • What's wrong with Reverse Engineering? by xant (Score:1) Thursday February 24 2000, @08:53AM
  • Gutenberg Factoid by Savage Henry Matisse (Score:1) Thursday February 24 2000, @08:54AM
  • A contrarian view by Tau Zero (Score:1) Thursday February 24 2000, @09:08AM
  • Re:magnetic storage by alleria (Score:1) Thursday February 24 2000, @09:14AM
  • by Mr. Slippery (47854) <tms AT infamous DOT net> on Thursday February 24 2000, @09:21AM (#1248795) Homepage
    Many of the earlyist mission datasets from the 60s and 70s are unrecoverable due to media degradation and format incompatibility.
    ...including, IIRC, a bunch of old Landsat data. "So what?" I hear you ask."If the data were important, it would have been accessed more often and ended up being transcribed and preserved."

    Problem is, it's entirely possible for us to not understand the importance of a data collection for years. That old Landsat data would be a great baseline for information about global climate change.

  • Re:An excellent summary of the problem by 0xdeadbeef (Score:2) Thursday February 24 2000, @09:23AM
  • Re:those AOL CD's (Score:3)

    by karb (66692) on Thursday February 24 2000, @09:29AM (#1248798)
    Egyptian papyrus from 3000 years ago is still more or less intact.

    I read an article once in that hotbed of liberal thinking, readers digest, about book deterioration. Older books were printed with a different method, and will last a couple hundred years. Newer books will only last maybe 50 years.

    This begs the question : how long will computer printouts last?

  • simplify by whatever3 (Score:1) Thursday February 24 2000, @09:30AM
  • But what is the solution? by -tji (Score:2) Thursday February 24 2000, @09:38AM
  • Re:Black Hole Applications Software by Mr. Slippery (Score:2) Thursday February 24 2000, @09:41AM
  • Is this relevant: Deep Time By Gregory Benford???? by Anonymous Coward (Score:1) Thursday February 24 2000, @09:50AM
  • Re:Data Decay, Readability, and ASCII text. by qbzzt (Score:1) Thursday February 24 2000, @09:51AM
  • Re:those AOL CD's by um... Lucas (Score:2) Thursday February 24 2000, @09:52AM
  • Problems with modern paper by Anonymous Coward (Score:1) Thursday February 24 2000, @09:54AM
  • by ludovicus (131490) on Thursday February 24 2000, @10:05AM (#1248808)
    This isn't anything new. I'll probably misspell all of the following names, but I think you'll get the gist of it.

    I believe it was King Tutankamun's father, Akenahten, who threw his world into a tizzy by rejecting the established religion and invented a new one that worshipped the sun. He went off a built a new city to go along with it too.

    Well, the bureacracy of the day didn't like this at all because it messed with their job security. And as soon as he was dead, they went around hacking his face off anywhere it appeared (of course we're talking about monuments, etc., made from stone) and I believe they went after any mention of him in text (hieroglyphs) too.

    And they almost got away with it and just about completely expunged his existence from their records. But they missed a few things and we've been able to piece together a little bit about him.

    So anyhow, there's my Discovery channel understanding of that little story. What it means in relation to this subject I'm not quite sure. I thought it was a good idea to point out that this is certainly not a new issue.
  • Re:Is this likely? by morzel (Score:1) Thursday February 24 2000, @10:09AM
  • Re:more info by Tuscahoma (Score:2) Thursday February 24 2000, @10:19AM
  • Very relevant project by Sanity (Score:2) Thursday February 24 2000, @10:30AM
  • Re:A great challenge by Ig0r (Score:1) Thursday February 24 2000, @10:36AM
  • Re:Information evolution by jwhyche (Score:1) Thursday February 24 2000, @10:41AM
  • Sexy Data by Anonymous Coward (Score:1) Thursday February 24 2000, @10:46AM
  • Re:Data Decay, Readability, and ASCII text. by s-gunn (Score:1) Thursday February 24 2000, @11:02AM
  • Re:An excellent summary of the problem by Oxryly (Score:2) Thursday February 24 2000, @11:15AM
  • The web embodies decay ... by shango dee (Score:2) Thursday February 24 2000, @11:31AM
  • Re:Information evolution by Steve Burnap (Score:1) Thursday February 24 2000, @11:33AM
  • Re:Ok ... by grouchomarxist (Score:1) Thursday February 24 2000, @11:36AM
  • Re:An excellent summary of the problem by dsplat (Score:2) Thursday February 24 2000, @11:38AM
  • Re:Civilization Bootstrapping by Once&FutureRocketman (Score:1) Thursday February 24 2000, @12:06PM
  • Re:Civilization Bootstrapping by jwhyche (Score:1) Thursday February 24 2000, @12:08PM
  • Re:An excellent summary of the problem by kimihia (Score:1) Thursday February 24 2000, @12:18PM
  • Re:BBC article by coaxial (Score:2) Thursday February 24 2000, @12:25PM
  • Strategy by i (Score:1) Thursday February 24 2000, @12:28PM
  • Re:Civilization Bootstrapping by Animats (Score:2) Thursday February 24 2000, @12:32PM
  • Re:Limited problem, if... by ShawnMcCool (Score:1) Thursday February 24 2000, @12:49PM
  • Re:Civilization Bootstrapping by hey! (Score:2) Thursday February 24 2000, @01:12PM
  • Re:Problems with modern paper by Anonymous Coward (Score:2) Thursday February 24 2000, @01:24PM
  • the final answer! by jlb (Score:1) Thursday February 24 2000, @01:33PM
  • Re:Data Decay, Readability, and ASCII text. by sjames (Score:2) Thursday February 24 2000, @01:58PM
  • Re:BBC article by Nafta (Score:1) Thursday February 24 2000, @01:59PM
  • Re:Data Decay, Readability, and ASCII text. by sjames (Score:2) Thursday February 24 2000, @02:02PM
  • Re:Is this likely? by Pxtl (Score:1) Thursday February 24 2000, @02:15PM
  • Re:Civilization Bootstrapping by kristau (Score:1) Thursday February 24 2000, @02:20PM
  • I propose... by Squeeze Truck (Score:1) Thursday February 24 2000, @02:44PM
  • Ancient Egyptians by Tech (Score:1) Thursday February 24 2000, @03:07PM
  • The Clock of the Long Now by grumling (Score:1) Thursday February 24 2000, @03:22PM
  • What about ... by N3mo (Score:1) Thursday February 24 2000, @03:52PM
  • I have the answer by bcilfone (Score:1) Thursday February 24 2000, @03:58PM
  • Digital media is analog underneath by dialect (Score:2) Thursday February 24 2000, @04:03PM
  • Re:magnetic storage by SEWilco (Score:1) Thursday February 24 2000, @04:47PM
  • Studying Real Analog Backup by Garund (Score:1) Thursday February 24 2000, @04:57PM
  • Re:How long can it last? (language change) by SPK (Score:1) Thursday February 24 2000, @06:47PM
  • CD-ROMs improving by gibber (Score:2) Thursday February 24 2000, @08:27PM
  • Re:those AOL CD's by MupwI (Score:1) Thursday February 24 2000, @11:43PM
  • Re:A contrarian view by stevelinton (Score:2) Friday February 25 2000, @12:49AM
  • Character sets... by guran (Score:2) Friday February 25 2000, @01:02AM
  • Rosetta stone by quiddity (Score:1) Friday February 25 2000, @02:36AM
  • Re:My philosophy is ... by plague3106 (Score:1) Friday February 25 2000, @02:41AM
  • Re:Thanks to "proprietary formats" info will be lo by RomulusNR (Score:1) Friday February 25 2000, @03:29AM
  • Re:Ok ... books last? by Lunchmeat (Score:1) Friday February 25 2000, @03:35AM
  • Re:those AOL CD's by RomulusNR (Score:1) Friday February 25 2000, @03:48AM
  • Re:magnetic storage by hotseat (Score:1) Friday February 25 2000, @03:54AM
  • even the bible has degraded by HelloKitty (Score:1) Friday February 25 2000, @04:02AM
  • Preservation by Chance by beroul (Score:1) Friday February 25 2000, @04:31AM
  • Re:those AOL CD's by karb (Score:1) Friday February 25 2000, @04:41AM
  • Re:A contrarian view by Tau Zero (Score:2) Friday February 25 2000, @05:03AM
  • Re:Civilization Bootstrapping by ballestra (Score:1) Friday February 25 2000, @05:16AM
  • Re:Umm.. by Teancum (Score:2) Friday February 25 2000, @05:19AM
  • Re:So just what *is* the life of a CDR? My results by Sponge! (Score:1) Friday February 25 2000, @07:48AM
  • Re:VA / Slash-dot Giveaway NOT AGAIN!!! by Kit Cosper (Score:1) Friday February 25 2000, @08:13AM
  • Re:A paranoid addition... by RomulusNR (Score:1) Friday February 25 2000, @09:00AM
  • We are the Internet Archive by bollacker (Score:1) Friday February 25 2000, @09:14AM
  • Worried about the wrong things, a bit. by RomulusNR (Score:1) Friday February 25 2000, @09:33AM
  • Re:Limited problem, if... by bonzoesc (Score:1) Friday February 25 2000, @09:46AM
  • Re:Character sets... by dsplat (Score:1) Tuesday February 29 2000, @06:08AM
  • Re:Character sets... by guran (Score:1) Wednesday March 01 2000, @12:26AM
  • 37 replies beneath your current threshold.
(1) | 2 | 3