Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×

On Preservation of Digital Information 199

Cacl, a PhD student at University of Michigan in their School of Information Divison has written a feature addressing the concerns and problems of preserving digital information. This is an area of study of his - and interesting to read about.

Preservation of Digital Information

Recently there was an Ask Slashdot about the the problem of preserving digital material. The basic idea was that we are creating a massive wealth of digital information, but have no clear plan for preserving it. What happens to all of those poems I write when I try to access them for my grandkids? What about the pictures of my kids I took with that digital camera? Can I still get to them in time to embarrass them in the future?

Obsolescence of digital media can happen in three different ways:

  • Media Decay: Even when magnetic media are kept in dry conditions, away from sunlight and pollution, and hardly ever accesses they will still decay. Electrons will wander over the substrate of the media, causing digital information to become lost. CD-ROMs luckily do not have this same problem with electron loss. They still are sensitive to sunlight and pollution though. Many people mentioned last week that distributors of blank CD media often make claims of an hundred years or more for the duration of their products. Research seems to indicate the truth is closer to 25 years,which seems like a long time, until you consider the factors below. Besides, information professionals often think in terms of centuries rather than decades.
  • Hardware obsolescence: Far more dangerous than the degradation of the actual information container is the loss of machines that can read it. For instance, the Inter-University Consortium of Political and Social Research received a bunch of data on old punch cards. The problem was they had no punch card reader. It took a decent chunk of time, and a good deal of money to eventually be able to read the data off of these cards, even requiring some old technicians to come out of retirement to help tweak the system. Hardware extinction is hardly a foreign topic to Slashdotters. It happens, and as technology increases its pace of change, it will happen more quickly.
  • Software obsolescence: The real stone in the shoe of digital preservation is obsolescence of the software needed to open the digital document. This can include drivers, OSS, or plain old application software. We all have piles of old software that were written for older systems, or come across an old file the bottom of a drawer where we can't even remember what application it used.

There are several strategies for preserving digital information. People mentioned some last week:

  • Transmogrification: printing the digital document into an analog form and preserving the analog copy. An example would be printing out a Web page and archiving the print of that Web page. This, obviously, takes out the main strength of a Web document, hyperactivity, and may also ignore important color and graphical content. An alternative form of this is the creation of hardcopy binary that could later be data entered into the computers of the future. The media suggested have ranged from acid free paper to stainless steel disks etched with the binary code. The two major problems with this idea are that any misrepresentation of the binary could have disastrous results for the renewal of the document, and transformation to hard copy limits the functionality of many types of digital documents to the point of uselessness.
  • Hardware museums: preserving the necessary technology needed to run the outdated software. There are several weaknesses to this plan. Even hardware that is carefully maintained breaks and becomes un-usable. In addition, there is no clear established agency that will be responsible for maintaining these machines. Spare parts eventually become impossible to find and legacy skills are required for maintenance. There must be technicians with the requisite skills to service these preserved machines. Finally, it does not create efficient use if all possible future users must bottleneck to just a handful of viewing sites to have access to the information.
  • Standards: reliance on industry-wide standardization of formats to prevent obsolescence. Market place pressures for software produces create an incentive for a company to differentiate their product from their competitors. While unrealistic in a capitalistic marketplace, standards such as SGML have proven successful for large scale digital document repositories, like the Making of America archive hosted by the University of Michigan. However, many of these large repositories also receive information from donors that is not in a standardized format, and do not feel comfortable turning away those documents.
  • Refreshing: moving a digital object from one medium to another. For instance, transferring information on a floppy disk to a CD-ROM. This definitely seemed to be the preferred method of most Slashdotters. While this takes care of degradation and obsolescence of the media, it does not solve the problem of software obsolescence. A perfectly readable copy of a digital document is useless if there is not software program available to translate it into human-readable form.
  • Migration: moving the digital document into newer formats. An example might be taking a Word 95 document and saving it as a Word 97 document. Single generation leaps are usually not a problem, so large volumes of information could be saved. Unfortunately, migrations over several generations are often impossible, as is migrating from a document type that was abandoned, and did not evolve. Also, information loss is common in migration, and may cause the document to become unreadable. While this may be the best single method available, it is very labor intensive, and some knowledge of the nature of documents would be essential to determining which information containers to migrate. For instance, often you lose aspects of a document (good and bad) when you migrate it, but which of those aspects are important?
  • Emulation: creating a program that will fake the original behavior of the environment in which the digital object resided. This is another very intriguing method that could be used. It's actually already pretty common. For instance, most processor chips include emulators for lower level processors. There also aleady exists on the Internet a very active group of people who are interested in emulating old computer platforms. Still, we need to do a lot of research yet on the cost of this method, and what sorts of metadata are necessary to bundle with the digital object to facilitate its eventual emulation. Another problem is the intellectual property hassle caused by emulation. Reverse engineering is a big no no, and there is no point in making the lawyers rich. This area is actually where Open Source can be of biggest help to preserving the longevity of different kinds of applications.

Many people in the discussion last week seemed to believe that simple refreshment or migration of the data would be a sufficient answer to the problem. At a personal level that may be true, but for anyone responsible for large amounts of digital information, neither is a completely convincing method. Here are a couple of reasons why:

  • Not all documents are the same- In the digital preservation literature, most people talk as if all digital information is in ASCII format. Au contraire. As computing becomes increasingly robust, so do the documents we create. Multimedia games, three dimensional engineering models, recorded speeches, linked spreadsheets, virtual museum exhibits and a host of other documents spurred by the development of the Web have cropped up. How are they going to be affected by migration to a new environment?
  • It's so darned expensive- It's a little gauche to talk about, but the Y2K bug caused what ended up being a huge migration of digital information. How much did the US alone spend on that fiasco? $8 billion? For smaller organization who do not prepare for the preservation of their digital information, the cost of emergency migrations could cause all sorts of budget trouble.

There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? Secondly, there is some data that is irreplacable.

Which is not to say that we should keep everything. In a traditional archive, only 1% of documents received are kept. Ninety nine out of one hundred documents are destroyed for various reasons. A similar ratio is not unreasonable for digital documents. Consider that 16 billion email messages are sent each day. It seems ridiculous to keep all of them, but how do we weed out the ones we do want to keep? Appraisal of digital documents for archival purposes is going to become a major issue in the not distant future. There are already examples of data that have been lost, or nearly lost. NASA lost a ton of data off of decayed tapes. The U.S. Census nearly lost the majority of the data from the 1960 census. These huge datasets are important for establishing a scientific record that reveals longitudinal effects.

Increasingly, the record of the human experience is kept in a digital format. The act of preserving that information is the act of creating the future's past, the literal reshaping of our world in the eyes of the future. Nobody knows the best answer yet. There is probably not a single answer that will fit absolutely all situations. Information professionals are just beginning to do research in the form of user testing, cost-benefit analysis and modeling to answer some of the thornier issues raised by the preservation of digital information. There are things out there worth saving, we just need to figure out the best way to do it.

Some links of interest in case you would like to read more:

  • a really good bibliography of related sources by Michael Day
  • an article by Jeffrey Rothenberg outlining some of the issues
  • a site at Leeds University with many related links
This discussion has been archived. No new comments can be posted.

On Preservation of Digital Information

Comments Filter:
  • Most content could be published in book format. I know books are so ... old-wave, but they work pretty well.


    Bad Mojo
  • I should think that a 99% destruction rate is awful! Kind of defeats the purpose of an "archive" doesn't it?

    With digital documents, there's no real reason not to save all of it, even if much of it is "tripe".

    Information is information, whether or not we find it useful. Some day, someone else might find our tripe is a goldmine of information, if only for anthropological study.

    Sakhmet.
    (The REAL McCoy)


    "The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."

  • by BrentRJones ( 68067 ) <slashdotme@NosPam.brentjones.org> on Thursday February 24, 2000 @07:00AM (#1248687) Homepage Journal
    I'm keeping all those old AOL CD-ROMs. Some software archaelogist will need them to see what Internet pioneers struggled with.
  • I don't think there's too much trouble with losing games and other applications as the hardware that runs them obsoleces... New ones will be created, and the best of the old will be ported.

    As to the data already archived on various media, there could indeed be a problem if people fail to move the data to newer media... Think of your pile of 5 1/4" disks that's just rotting in the corner because your new computer only has a 3 1/2" drive -- and that's not even a huge leap in technology.

    There's also the question of formats, especially for users of M$. After two revisions of the software, it can't read any of the old data! Try reading a Word 6 document in Word 97 for laughs, especially if you use any special characters ü á € in your documents...

  • by meighan ( 151487 ) on Thursday February 24, 2000 @07:04AM (#1248689) Homepage Journal
    If interested, there is a report from '96 which offers some more information on the subject. From the Task Force on Archiving of Digital Information, here [rlg.org]

    --

  • by Nafta ( 42011 ) on Thursday February 24, 2000 @07:05AM (#1248690) Homepage
    BBC currently has an article [bbc.co.uk] on the same subject. This a great advantage of Open Source (preaching to the converted, I know) because that is the only open standard (and therefore durable) format. All other proprietary formats will come and go with the companies that make them.
  • by Signal 11 ( 7608 ) on Thursday February 24, 2000 @07:06AM (#1248691)
    Unfortunately the solutions we've employed up until recently are fatally flawed - they all use magnetic storage. The problem is that the higher the density, the sooner "bit rot" occurs - those magnetized iron oxide particles work against each other to depolarize. After several years (or several dozen, depending on the media) the data's unsalvagable. That's problem #1.

    The solution would be to use an optical storage media, but as others have pointed out, CDR storage has a life expectancy of 75-100 years depending on the brand. Which wouldn't be too bad except you have to realize that in 100 years you need to start putting resources into copying all that data off and re-writing it again. After awhile you'll have a snowball effect where you spend more time writing the old data than the new!

    What we really need is a piece of technology that doesn't age - an entirely self-contained computer (nuclear powered, maybe?) that has the media, the reading/writing mechanisms and has several failsafe mechanisms to alert you well before any data is lost. Think of it as a computer time capsule - you bury it and in 500 years come back and it has all the human interface necessary to reproduce the data in a usable format. Of course, you'll still need someone who reads English then..

    agh, the problems, the problems....

  • A perfectly readable copy of a digital document is useless if there is not software program available to translate it into human-readable form.

    Is there an example of a computer system that doesn't exist anymore, and can't be emulated at a much greater speed than the origional using existing software? Even most arcade machines can be emulated these days
  • If something is of value and needs to be preserved, it will be preserved somehow (book, updating to a new software or whatever).

    If a piece of information has not been preserved and is now unaccessible, it probably means that it was of minimal value anyway.

    That's probably not the greatest way to look at this but I'm thinking that half of all the info that's presently out there is useless anyway and is just taking up space for nothing. Maybe it's a good thing that these will be lost with time. It's kind of like a good spring cleaning.


    *******************************
    This is where I should write something
    intelligent or funny but since I'm
  • unfortunately its impossible and worthless to preserve everything. data grows exponentially...i used to backup to paper and audio tape (300bps).moving slowly to floppy disk (360K) to higher density floppies (1.44MB) to hard disks (40MB seagates) to tape again (250MB HP colorado digital) to CDRs (650MB sony). im probably going to go DVD next. the point is that the number of audio tapes i used to have is now the same as the number of CDROMs i have ..even though the data storage capacity is exponentially higher, the data GROWS exponentially to fill available capacity. and i throw out most of the data i create...and it still GROWS.
  • Some day, someone else might find our tripe is a goldmine of information, if only for anthropological study.

    But just think, in hundreds of year's time someone might come across Microsoft's marketing literature and think its actually true!!

    DOWN WITH ARCHIVING!!!
  • by AjR ( 148833 ) on Thursday February 24, 2000 @07:13AM (#1248697) Homepage
    balancing the endless churning of the web against the need for a stable archive.

    Unless we take steps to archive, transcribe and preserve all this information (yes, grits, petrification et al) then we are in effect building a new Library of Alexandria.

    It would be the greatest loss ever for archaeologists of the future to be unable to access archives of the WWW. Every day is a unique snapshot of the world as the endless churning of webpage updates/dead link removals changes the WorldWideWeb.

    This information Ocean is something unique. Archiving such a huge store of information generates a challenge in itself.

    I don't often wax lyrical about the internet but it is in effect becoming a snapshot of our civilisation.

    What a loss for future generations if they cannot see the views of ordinary human beings (through the endless websites) preserved.
  • by dillon_rinker ( 17944 ) on Thursday February 24, 2000 @07:16AM (#1248698) Homepage
    IIRC, Orson Scott Card addressed this issue in a story set in Isaac Asimpv's universe. The library on Trantor had indices of going back thousands of years, but the contents of the library had never been refreshed. The librarians knew exactly what they had lost.
  • by inquis ( 143542 ) on Thursday February 24, 2000 @07:18AM (#1248700)

    When you look back at history, and you look back at documents that are a "mere" thousand years old, the wealth of information in these documents makes you wonder what could be found if all the documents from that time had survived. Just because the format is digital, rather than analog or (eek!) paper, does not mean that this media is impervious to decay.

    However, I think that decay is much, much more serious in digital media. The root of the problem is that if you are looking at physical document with water damage, even though the original "packets" of information (letters and words) are damaged, the human brain can sometimes extract meaning from smearing ink and crumbling paper. When an electron wanders on magnetic media or when a CD begins to decompose, that bit is lost forever. Digital media is much more sucepitble to lapsing into unintelligibility than physical media like paper.

    Preservation in a media that will not become obselete is the key. As mundane as it may sound, plain ASCII text will probably never become obselete because there is no real reason to come up with a new standard. Some people may scream at me: "*ML! *ML!", but at the rate that these things will obescelece, plain text will still be around when XSGHTML has been long dead.

    Just a thought. If you have something to add, feel free to respond.

    Brandon Nuttall, the inquisitor of Reinke

  • by dsplat ( 73054 ) on Thursday February 24, 2000 @07:21AM (#1248701)
    This is something that is going to be more of a concern for those of us who conduct a significant portion of our lives online already. Ask yourselves, have you ever had a moment of unusual brilliance in which you posted something to Slashdot or Usenet which was truly worth saving? Can you find it now?

    Personally, I encountered the issue of software obsolescence well over a decade ago. I migrated my resume to TeX because it had already been through four other formats and I no longer had access to the tools to read them. I picked TeX because I firmly believed that a tool that I had the source for was likely to continue to be useful to me for a longer period. And the source for the document is ASCII text, which I was able to convert to HTML a couple of years ago with little trouble. I will not rely on the future availability of any tool that I have no control over.

    This is one of the reasons that The Unix Philosophy, a fine book, recommends text formats for data. You can manipulate it with a wide variety of tools including text editors. It is unlikely that we will abandon those completely in our lifetimes. It also suggests, if memory serves, keeping notes online in text form. They are more portable and more accessible that way.

    One worthwhile source of literature preserved as plain text files is Project Gutenburg [promo.net]. It is probably also the oldest such project around. It is to text in some senses what Free Software is to code. Although they aren't doing collaborative authoring projects, they are collaborating on getting old books whose copyrights have expired into electronic form. If you haven't ever visited their site, take a look.
  • This is an excellent summary of the technical challenges to digital media preservation.

    But the technical issues are insignificant compared to the legal concerns - copyrights, patents, etc.

    Sure, most of these forms of copy limitation do expire, but until a large amount of "digital literature" becomes public domain, nobody's even going to *try* developing a preservation system, for fear of lawsuit by irate copyright-holders.

    My university's library collection totals nearly seven million books. Yet extracting information from this huge paper collection has been an incredible hassle... I would be willing to pay a significant annual fee if I could access every page in the library via a Web interface. I leave the juicy technical details to the reader's imagination. (I bet a few people with hand-held scanners and rudimentary OCR could digitize the entire library in a reasonable amount of time).

    But guess what - this is never going to happen in my lifetime.

    These seven million volumes of knowledge are never going to be preserved, because no library director in his/her right mind would risk slipping up and getting sued for violating a long-lasting copyright.
  • For me, the issue is knowing what it is I have, since knowing what to keep is dependent on this. While asset tracking is "biz as usual" for archivists, I'm not one of them. How do I keep track of whatever I have, over the last dozen years, and a ton of different machines, ranging from a MicroVax II to multi-processor SGI graphics boxen? And how do I track this info in a way that doesn't consume huge gobs of time and thought?

    This assumes that information SHOULD be thrown away. I'm not interested in becoming a pack rat, I already have enough "stuff" to keep track of, thanks. I suppose I'm just not all that interested in making my information, no matter how trivial, available to future archaelogists.

  • by fingal ( 49160 ) on Thursday February 24, 2000 @07:29AM (#1248706) Homepage
    The problem of transient information storage is not just connected with reading old digital data stores, but also with analogue information as well, typically audio recordings.

    In this case, the main problem is not bit-rot (although this will occur sooner or later) but rather problems with not recalling the information for an extended period of time. For example:-

    • Reels of tape start to inprint signals to adjacent tape causing loud passages to have ghost versions either before or after them.
    • Tape actually becoming stuck to itself due to using bad binding materials leading to baking [audio-restoration.com] of tape as desperate restorative measures.
    These and other issues are discussed on www.audio-restoration.com [audio-restoration.com]. Does anybody know if there are similar problems associated with digital media (the cross-talk problem will be virtually negligible due to noise-floor issues being irrelevant)? If so then it makes archiving a much more difficult thing if you have to physically do something to the archives every couple of years (especially with the exponential growth rate of information generation).
  • It just struck me that data storage times seems to be inversely proportional to the level of technology around in the age.

    Books can have a life of hundreds, if not thousands of years if treated right. Even with abuse it will survive for years.

    There is a problem of obsellescence of language, although usually there is a rosetta stone equivilent

    With modern Media technology is progressing so fast in an almost throwaway way. At my previous company we had good backups, but we had no way of accessing them as before we went to DAT and then DLT we didn't actually posess the devices needed to read the tapes and before that disks.

    It could be argued that with the internet archiving is going to be more dynamic and fluid, but where does this leave information, and especially information for future generations. It is all well and good moving from teh printed page to the digital page, but in 2000 years time will they be able to revive the contents of a hard disc, will the information on the internet evolved dynamically not leaving a snapshot. Or will they look through the books of our time???

    What will be our dead sea scrolls?

  • But you missed part of the argument:
    • It defeats the point of having a digital archive, which takes much less space than printed form, and is easier to use (searching, for example).

    • It doesn't work with other forms of media:
      • Music
      • Video
      • Any binary format: images (which could be printed, but that's even worse than printed text), DEM data, cat scans and MRI, etc.
    In fact, in my research into archiving (I work in the 3D department of a video production company), it turns out that the cheap CD's only last about 10 years. But we decided (which coincides with a lot of what this was saying) that any 10 year old data would be obsolete for our purposes, and CD's would be good enough.

    The project files in Alias|Wavefront, Maya, and Softimage formats, as well as other miscellaneous formats, are just not suited to printing. Even if they were, even if you printed out ASCII versions of Maya files, for example, imagine what you'd have to do to get it back in the computer to reuse the project!


    ----------

  • And since copyrights of data formats is author's (or company's) life plus 100 years (gee, thanks Sonny Bono for extending this, I won't miss you), we can never hope to see any legal 3rd party readers for these files. In the IP owner decides to sit on an old format and not support, we are officially hosed.
  • Sure, the data grows exponentially, but as you just pointed out, so does the storage media. At one point, holding on to all of the e-mail I ever received would have been a ludicrous concept. But now using CD-R or even just a big mirrored hard drive, I can keep a limitless archive. I think the bigger concern is not the limitations of the physical media's capacity to store everything, it is the ability to view that stuff a few years down the line. E-mail is all ascii text so it isn't too difficult to deal with, but as the data becomes more robust and complex, then the issues of obsolesence become more pronounced.

    ---

  • I would argue for the historically tested method of storing data: take a chisel and carve it into rock.

    The software obsolescence is not a big problem -- humans (we hope) are going to be around for some time and the brain wiring changes awfully slowly. Languages do get forgotten, but smart people are very good at understanding dead languages and will probably get only better. Readers are also not likely to be a problems: just like brain wiring, eyeball construction is quite stable and not going to be superseded by a better design any time soon.

    The media -- provided you pick a good hard rock like granite (avoid limestone and its derivatives like marble, they don't like acid raid) -- does not suffer from bit rot, completely ignores magnetic fields, stable with regards to solar radiation, and fairly resistant to pollutants.

    You are not limited to ASCII, and even have limited graphical capability. In fact, rock has a huge advantage over current digital media -- it's perfectly possible to create, view, and store 3D objects in rock. Just try that with your 21' monitor!

    Just in case you think I am being funny, there is a company which in exchange for a sum of money will take your text, etch it on metal plates (nickel, I believe), and store it in some cave. They are estimating >5,000 years MTBF. I still think a good slab of granite is better, though.

    Kaa
  • It seems to me that trying to generalize a way to archive information just isn't worth the effort. Information that people consider worthwhile will get copied because people want it and will thereby be saved. Information that isn't worth much won't get copied and thus forgotten.

    The common response to this is that we may not know what is worthwhile, or that future ages may not take appropriate care. Lost greek plays that would be worth millions now were overwritten by some monk's laundry list in a less enlightened age. We feel we must save our information from that fate. But that is an impossible task. Etch the information on steel disks and some future, more barbaric age may melt those disks down for swords.

    So forget about trying to save everything. Just work to save what you think is important. Yes, stuff will get lost, but that will happen anyway. You will never get perfection. More likely is that future generations will curse you for the stuff you thought to trivial for your archive project, while finding the information archived worthless.

  • This is actually a topic close to home for me. I've had to work both with archival data and legacy code for several years now. Recovering and transferring data from real-live systems isn't always trivial. A few years ago, I recovered some data from my advisor's old IBM workstation. It had a hard disk and several floppies full of EBCDIC data. It took me a few weeks of phone calls to networking support before I discovered the wonderful "dd" command on my own (no flames about taking two weeks, I'm an astronomer, not a CS major). Another example is when existing machines migrate to new operating systems. A big recent headache for me was getting Cray CTSS data migrated to UNICOS ASCII data. I think the key in instances like that is to make sure that old data standards are well known and easily translatable.

    Now, I'm dealing with legacy code, too. One solution of course is to write vanilla code in a common language, but who knows what language is going to be used in 25 years? C+++? Fortran 2020? And vanilla code isn't always optimal, when hardware vendors build cutesy hotrodding tricks into their architecture and compilers.

    Somebody just needs to build a giant computer version of babelfish for all languages ever. Starting with cave paintings. :)

  • Or if you prefer to avoid the Simpsons reference look at the link. It goes to http://www.hardcoresex.com

    (Then there the fact that the BBC website can probably handle more traffic than Slashdot so a mirror is pointless)
  • My worry is that at some distant point in the future, all our paper will have rotted away and people think that the CDs were our primary means of communication, like Babylonian pottery shards..

    "February 24, 6423:Archeologists have discovered evidence that ancient humans worshipped a God called 'McDonald,' whose temples were signified by golden arches..."
  • Thanks for a great summary of the problems with keeping data viable. Maybe some /. reader will create a start-up to help schools and businesses deal with the problem, perhaps by creating the "museum" you alluded to. (I notice, for instance, that the domain www.datadecay.com is still available.)

    I have a question, however, about the other end of the data life-cycle: its birth. Certainly data disappears, but what is the best way to describe or define "data," broadly generally? What is the best definition anybody here has ever heard for "information"? I'm having trouble finding a straight answer. Is data (information) a representation of something in the real world? Is it like a shadow of something else? We have seen how it can be created, we have seen how it can evolve, and we have seen how it can fade away and die, but what is the best definition of what it is?

    This is one of those philosophical questions that just nags at the mind. If anybody can suggest definitions (or resources), I'd be grateful.

    A. Keiper [mailto]
    The Center for the Study of Technology and Society [tecsoc.org]

  • ...as long as I don't run out of disk space. (Paraphrasing a comment I heard at a DC thinktank.)
    It was noted that storage requirements for geographic data (geologic, topographic, etc.) would require petabytes. Multiple petabytes. And a petabyte is 1000 terabytes (right?). And we're thinking 36GB hard drives and DVD-RAM drives have a lot of space...
    --
  • If a piece of information has not been preserved and is now unaccessible, it probably means that it was of minimal value anyway

    You didn't read the whole article, did you? Or perhaps the 1960 census resulted in information of "minimal value" that we didn't need lying around anyways? This is data that cannot be recreated, and is irrevocably lost.

  • There is a project that has started recently here at Stanford to investigate the possibility of using distributed web caches as a means of preserving information on the Web. The project is called "LOCKSS" (Lots Of Copies Keep Stuff Safe), and more information can be found at lockss.stanford.edu [stanford.edu].

    This project definitely does not address all the issues with digital-document preservation; it definitely does _not_ solve the document-format problem. Its goal is to make digital publishing "immutable" so that publishers cannot modify or withdraw their work after it is published.

    Disclaimer: I work for one of the groups which is participating with the LOCKSS project, but I'm not working directly with it.
  • by fingal ( 49160 ) on Thursday February 24, 2000 @07:49AM (#1248722) Homepage
    One point to remember when looking at problems with digital information storage media is that they are not really intended to be archived. What they are (mostly) designed to do is read and write the information very fast at a high frequency with a high degree of accuracy. Most of them are quite good at this and the issue of bit-rot tends to go away if you are continually reading and writing your hard-disk.

    CD's and related optical media do have problems with sunlight, but you have to remember that they where created (AFAIK) by the audio industry which is one of the most notoriously fickle industries in the world: they want you to buy a new CD from a new group every week, not have a single CD that is perfect and that lasts forever. I think that the concept of people being able to listen to their CD's for 10 years is already far too long for them.

    The problem is that there doesn't really appear to be anyone making storage media that is optimised for long-term persistent storage. But do you think that such a format would be the way forward? Each year, we generate an exponentially larger amount of information. All the hard disks on the planet now would not be enough to store the new information that will be generated in the next 5 (wild guess) years. Therefore we are going to need progressively larger and efficient forms of data storage as the information bloat gets larger. As new formats come out, the important thing is to look at the movement of legacy data onto the new formats. If data is not treated as a static thing to be boxed up and forgotten, but rather as part of the on-going current set of information and transferred onto new technologies as they are developed then you will not have the situation where people are looking at a hard disk in 50 years time and going "what's an IDE interface?".

    Of course, then you have the 'minor' issue of application file formats...

  • by Anonymous Coward on Thursday February 24, 2000 @07:50AM (#1248723)
    What is the lifespan of data stored on a CDR? An old 550(63min) CDR? a 650MB(74min) CDR? Green CDRs? Gold CDRs? Blue CDRs? A CDRW? etc.? Under non-ideal conditions?

    I put some CDRs out in the direct sun hede in the Las Vegas desert ofer the last summer. Blue, gold, green, pale green, and an RW. Both sides of the CDs had their chance to roast in the 100F+ (40C+) degree sun for several months each. And here's the results of attempting to read the data back on each type:

    Old TDK green CDR: dead, nothing readable. Faded to a mostly clear plastic disc!
    Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
    Verbatim (blue): I was stunned. I read back a full and complete iso image of Red Hat 4.2. No fading at all.
    Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
    Memorex silver/green CDR: mostly dead, some files readable. Faded in a few isolated patchy blotches.
    The CDRW... just started this test. No results yet. Looks OK, though.

    Overall, I'd say the blue CDRs are the best choice for long term data storage.

  • At the moment, Moore's law is the only thing that stops this problem becoming really acute. Although I keep all my email, and the total size of the archive grows almost exponentially, so does the size of my hard disk, and the speed at which I can run grep over it.

    To handle terabyte databases now, needs leading-edge hardware and state-of-the-art software specially optimised for the data format. In 20 years, however, we will just be able to haul the terabyte database into emacs, and hack up some macros to reformat it and search it.

    If Moore's law ever tops out, then we are in trouble!
  • Assuming media continues to get bigger, the snowball effect is mitigated significantly. If, in 10 years, I take the mp3's (on CDROM) from my entire music collection and move it to some new super-high-density media, it will probably fit on a single disk. Thus, next time around, copying all the older stuff requires me to copy only one disk. Every 10-20 years, when I have to re-archive everything, I will have only a TINY fraction of data from the previous cycle, because it will be so small compared to the new data.
  • There is also a page [missouri.edu] at the University of Missouri that talks about media lifespans:

    "Computer reel tapes, VCR tapes, and audio tapes last about as long as a Chevy or a poodle."

  • If you want to preserve the contents of the web for future generations (research, entertainment, whatever) then a huge, high power antenna should just broadcast non-stop internet.

    This would serve two purposes:
    1) Extra-terrestrial beings (assuming they have the technology and could decode it) could have a window into life on earth.
    2) Whenever mankind figures out how to make wormholes or travel faster than light they could simply warp out to whenever they want info from and recover that day's web broadcast.

    Altogether, not a bad idea, huh?
  • As mundane as it may sound, plain ASCII text will probably never become obselete because there is no real reason to come up with a new standard.

    Someone who speaks a language that doesn't use the basic roman character set may beg to differ. There are very real reasons to consider moving to something like Unicode.

  • From "Old Computers Lose History Record [bbc.co.uk]" (BBC, 23 Feb 00)

    Irony of ironies: Data records on floppy disks relating to an an archaeological dig decayed by 5 percent in under a decade - after everything had survived the journey from the Bronze Age intact.

    A. Keiper [mailto]
    The Center for the Study of Technology and Society [tecsoc.org]

  • I agree that the problem of preservation isn't exclusive to digital media, but one of the big differences is that analog media tends to degrade MUCH better than digital media. True, old records get scratchy or warp, and tapes can have their oxide coating flake off, but at least there's some data available (ie. you can still listen to the recording through the clicks or dropouts. With digital media, it's often and all-or-nothing affair. Either it's in perfect shape or it's gone. Of course this isn't always the case (sometimes you can extract some digital data from a damaged source) but it's much more difficult than with analog media.
  • by Anonymous Coward
    It's very hard to cheat the laws of thermodynamics. Things tend towards entropy. The closest the universe has come to escaping it is life (on a long term scale). Even that information mutates and is corrupted over time. I'm not sure we can save a perfect bit copy of everything, but we can carry on the legacy in some form. - Darwin
  • I'm not nearly as worried by media decay as I am about content just disappearing altogether. The internet saves us from media decay-- if I keep my files on a network-capable machine, then transfer to the next generation machine is easy. Every time I get a new PC, I plug it into the hub, and let the file copying begin! On the other hand, "disappearing info" on the web may result in all sorts of archival losses! Magazines and Newspapers are archived and kept in libraries for years. What about news web sites? I'm sure most large sites keep their own archives, but will anyone ever have access to this data again? Once it is replaced by newer info on a site, is it gone forever? I'm afraid that the popularity of the web may result in the loss of good data archives in libraries for the future.
  • Regardless of how it's stored, eventually the data itself becomes meaningless. I read an article that made this point last year. Ever try to read Chaucer in the original english? Same language, more or less, but over several hundred years it has become unintelligible to all but a handful of people. The way language is changing today, it could take even less time for all these articles on slashdot to become gibberish. So with a perfect medium, who would we be preserving things for? A handful of scholars, ignored by everyone? No one at all?
  • As far as I know, there is no copyright protection for file formats. You can copyright a document that describes a file format, but not the file format itself.
  • I had an apple//. There are emulators. But you need an apple// to read the floppies. As a matter of fact, they are still readable (I tried recently). But when my old apple finaly dies, nothing will read the disks. If they have not been made into disk images suitable for the emulator, content will be lost.
  • Not all information needs to be archived. Most of the e-mail I receive can go in the bit bucket for all I care. The rest, I archive. As for the information that can/should be archive, the author's statements to the contrary, industry standards can be used to archive what should be archived.

    Given a format that is a) adequately documented, b) accurately represents the data it encompasses, and c) has sufficient widespread adoption, we can simply archive to that format as we need to.

    Let's consider various and sundry data types, the prominent format for handling them, and the potential longevity of those formats.

    Text: For raw text of course you have ASCII. While not a permanent fixture, nobody can argue it's longevity. We'll call this the baseline. Moving up from ASCII you need some way of defining formatting and such. There are really only a couple realistic solutions. Either some SGML based system, HTML, or PDF. I'll get into the latter two cases a little further down. Let's say that for plain text, SGML has the best longevity because of widespread adoption, and simplicity.

    Rich Text (beyond simple formatting): As above, we need something better than ASCII. I'll vote for PDF here. It's a proprietary format, but it seems to be pretty well understood, and it does an accurate job of representing the original document. Mac OS X groks it very well, and Adobe has ensured that there's a viewer for every platform. If conversion tools can be made, then this is a good format.

    Images (bitmap): PNG, JPG, GIF, and TIFF. TIFF seems to be less relevent these days although most scanner software still produces it. JPG/GIF are where the majority of data presently exists, and PNG is where everything should be archive, IMHO... PNG being lossless, and supporting about every feature known to man, this seems to be the best solution. One could crawl the web, grabbing every single GIF or JPG, archive it to PNG format with no loss of data and quickly build a significant archive.

    Image (vector): Sorry, don't know much about the formats used here...

    Audio: The obvious solution for archival is uncompressed, raw audio in a well understood format like WAV. This is an area that doesn't seem to be changing much...

    Video: Again, I can't really comment on the formats here...

    Things become more complicated when you have interactive media, or other very specialized forms of data... But I'd rather save that for the experts...

    The author brings up the "loss of fidelity" issue when updating documents to a new format. I think this really only is an issue when making a lateral move. Converting from JPG to PNG wouldn't be a problem, nor GIF to PNG. Converting from WordPerfect to Word on the other hand, is problematic at best...

    Thus the need for archival formats with some longevity. Perhaps a commission should be formed on data archival formats? A group of OSS developers who do nothing but strictly define what format(s) are to be used for "data archival" purposes, and ensure that tools to read/write these formats are readily available on every platform -- including new ones as they come out.

    The trick is to avoid lateral conversions at all costs.
  • Hmmm. Yes and No. You can infer stuff from damaged audio recordings, but think about what is required to cause errors in a digital recording: An error will be when a 1 or a 0 is read for it's opposite value. In order for this to happen (using conventional readers) you are going to have to have a background noise floor of somewhere around 45% of your headroom (assuming that 1 is written as 100% 'on'). Even at this level you will get a reasonable chance of reading the data correctly. However, if you listen to an audio recording with 45% background noise added then it is going to be virtually impossible to clearly distinguish the clear sound.

    Yes, when digital breaks, it definately breaks (although checksums and duplication of data can reduce the chances of losing data), but the level that you can push the degredation of a digital device too before it breaks is really quite high.

  • I keep what I loosly term a knowledge base; Every bit of useful tech data I run across, or have reason to believe I will need again, gets stuffed into a designated folder on my HD and later archived. I have stuff going back to Phrack 4, WordStar copies of C128 documentation, programs I wrote fifteen years ago for a hardware platform that no longer exists, System 3000 performance data, etc. While at the time I put each of them in I had access to the machinery and software to read and run them, much of it is dead now. Now I take the extra step to make sure anything new will be readable in the future. If it requires a viewer, an emulator, etc, they are saved with it. When the day comes that ia32 everything runs on and the CD the data is held on are depreciated and forgotten, they will be replaced by DVD-ROM and an ia32 emulator before obselescence becomes such an unsurmountable hurdle.

    We must activly, and over the course of time, make sure what we do is available for posterity. Next time you burn MP3's to a CD-R, burn a copy of the mpg123 source too. Thirty years down the road, the information will be usable to anyone with the ability to read C and a DVD-ROM, even if MP3 is a forgotten format. When CDROM becomes hard to find, copy it to new media. I started on a Atari, and have manage to propogate that data through audio tape, floppy disc, magnetic tape and CD-R with little effort. Preservation shouldn't be an afterthought. Just do it!
  • I agree, information sizes are growing at an inverse rate, although this isn't entirely because of content, I would say mainly through format.

    Archiving should be all about finding effecient ways of storing information in retrievable ways for as long as possible

    However, archival seems to have become all about storing all the information (take the British Library or Library of congress which can't really keep up with publishing in terms of space and resources...)

    Maybe the answer is something like Guttenburg where plain text is used and a fluid medium is used, albeit one which seems to be stable (ie multiple mirrores servers with backup devices)

    I think there needs to be a shift in focus from the sheer need to store to the methods of storing and the reasons for storing.

    I think the internet is an interesting snapshot of our time, but I think it's transience and fluidity are then things that make it what it is and the things that make archival a difficult process...

    Hmm, more thought needed... I would plug my pmployers now as information management and storage is our thing, but that would be crass... (unless anyone in the field of Information storage wants a scholorship, in which case mail me for details.

  • Agreed. I think the original author missed some points, which a lot of people do, by not looking at storage in a historical perspective. If you take the long view, analog storage on paper is a very reliable and proven technology for preserving data for the long haul.

    We've recovered data from thousands of years ago on crubled bits of paper that are still quite legible despite the decay, and that paper was a new technology for some civilizations then.

    [Of course, a better argument can be made for simply using clay tablets and inscriptions in stone. We've recovered carvings MUCH older than anything that's been found on paper. But you have to draw the line for convenience somewhere. ]

    Any modern technology you're relying on is bound to be inadequate. Think about this: every technology for information storage invented in the last 200 years has failed for long-term use. A thousand years from now they'll look back on the 1880-2000 as a series of dark ages. The only thing that will remain are the paper records. Even paper that's badly treated remains ledgible for long periods. There are archeological surveys going on now in garbage dumps for large cities (NY, for example) that are finding well preserved newspapers from half a century ago. Newspaper is not a good paper, and newspaper ink is a poor ink. This says a lot for the staying power of a good technology.

    Film deteriorates, magnetic media loses bits and the substrates crack and crumble, records lose crispness, wax and foil canisters wear out. Take magnetic media for example: it was thought that with careful storage and infrequent use this stuff would last a long time. As it turns out, magnetic tape barely lasts 15 years under the best of conditions. We simply don't have enough experience with these technologies to know if they'll work.

    Just because your 10 year old CD's can still be played today, doesn't mean they'll work in 2025. As a technological culture, we don't have enough experience with the materials to know. Old transparent plastics grow cloudy eventually--optical storage will probably not be your saviour in the 21st century either.

    At least until we've had a century (or two, or three) to observe technologies like CD-ROM will we know how they'll work for long-term storage. Until then, don't bet the farm.

  • I can't help but wonder what future humans will think of our efforts to preserve information. Will they even have the records that show that we tried to preserve anything? Will they believe the records we leave behind are factual? How much of the fiction that is floating around will be mistaken for fact? How much of the information we currently have will survive only in fragments yanked out of context?

    This leads me to wonder how much context information do we need to bequeath to our decendants in order for them to be able to understand the information we leave behind? Consider how much information we have from ancient times which we do not truly understand because we do not have enough contextual information to really understand what was meant by this information. Look at how many conflicting translations there are of many of the documents that do still exist.

    Even if we manage to prevent the degradation of the media on which the information is stored and the devices and software necessary to read the information are preserved, what of language shifts and culture gaps across time? We will still have the problem of information being lost as meanings of words change with time or as information is translated from one language to another. This is, in fact, exactly the same problem we face with the various software revisions for products like MS Word.

    This is not to say, however, that we shouldn't make a significant effort to preserve information. I would also think that having a significant amount of contextual information (which should come along for the ride while preserving information) should help our decendants comprehend the information we leave behind. However, if our current track record for preserving contextual information is maintained, the outlook is not good for our decendants understanding our information in two or three centuries (assuming the information survives).

    Well, that's my 93.2 cents worth on the subject.


  • I doubt it (although I did get a chuckle out of the McDonalds bit).

    Assuming the future archaeologists uncover/engineer a way of reading our digital formats (and that assumes, of course, our digital formats - like CDs - exist in any number in several thousand years), they'll easily uncover evidence of how we communicated. Think about it - how many references are there to "printing," paper, books, television, movies, etc., in common use today? In my email archives, I probably have hundreds of referencings to printing things out or watching TV.

    Further, there will be documents inevitably left around. Look at such thing as the Dead Sea scrolls, which survived many thousands of years. If anything, they'll simply have a misunderstood idea of what we committed to paper (since "important" historical documents like the Constitution were written, and everyday crap like the specs on my desk will no doubt be destroyed, they may simply consider that paper was reserved only for important things).

    Just an observation.
  • I have several times had discussions with people who wondered why TeX continues to exist when PostScript is more universally available and gives a sufficiently good resolution for practically any purpose. These people generally fail to appreciate the following:
    1. PostScript is not guaranteed to give the same output from printer to printer, let alone over a period of decades.
    2. PostScript cannot easily be altered for different output formats. (eg The author and a publisher may wish to use different sizes of paper, or a different bibliography style.)
    3. Extracting content from PostScript is a very non-trivial process. TeX is simple ASCII.
    4. PostScript is insecure. PostScript is a full programming language, and the equivalent of the root password is rarely changed snd the default value is generally known. (IIRC it is "000000". I may have the number of 0's wrong.)
    5. Try editing a PostScript document (say to insert a correction). I dare you. :-)


    So if you need to store formatted documents for archival purposes in a system where you may later need to output the documents in a different form, you should look at TeX...

    Cheers,
    Ben
  • by dierdorf ( 37660 ) on Thursday February 24, 2000 @08:25AM (#1248749) Homepage
    Gee, I just happen to have a bunch of steel-ribbon tapes from a Univac I. Maybe that information is vital to civilization as we know it. Do you really think those tapes can be read today without tremendous expenditure of time and effort?

    BTW, I think the original author missed one future problem - encrypted information. I foresee hardware-based encryption becoming almost ubiquitous so that most data is encrypted. If encryption becomes universal, then much info will be encrypted that really wasn't burn-before-reading secret. What happens to all that information - of potential interest to historians looking back on the 21st century - under those conditions?

  • A couple of days ago the BBC reported in an article entitled Old computers lose history record [bbc.co.uk] how archaeological records are being lost due to exactly the issues raised in this story. The story reports that "[ironically, the] archaeological information held in magnetic format is decaying faster than it ever did in the ground".

    So, it looks like we're going to have to start transferring all those old ZX81 game tapes (Timex 2000 for our U.S. cousins) to CD-ROM then. That should be good for another 25 years of '3D Monster Maze' :-)

    --
  • Actually, this is not quite acturate. Languages tend to change most in isolation. That's why in places like central Africa and Papua New Guinea you have hundreds of very different languages within very small areas. With the rise of electronic communication, and especially the web, it seems likely that the rate of change in English will be a lot slower, since its used so widely.
  • What you are calling "text format" some of us choose to call 7-bit ASCII. And 7-bit ASCII wastes approximately 1/8 of the storage channel with the redundant eight bit that's always zero.


    Okay, you are right about that. I used ASCII as my example for three reasons. First, Slashdot is in English. Second, many if not most of the common character sets today are supersets of ASCII for compatibility. Finally, the primary but not sole input character set for TeX, which I mentioned, is ASCII.

    As for wasted space, the amount of redundant information in every written language that I am aware of is very high. The actual information content of a single character is only a bit or two in context. That can be demonstrated with any good compression program. So, I would suggest that for saving space, either we all need to abandon our human languages for one with no redundancy (not a likely proposition) or compress everything we want to save and document the compression algorithm in uncompressed files, preferrably with source code.

    I'm sorry, but "The Unix Philosophy" all boils down to trying to force all information metaphors to ultimately equate to an old crofty teletype.

    Force all information to flow through a 'tty' and you've already filtered out most of the digital content people use in the present time.


    I disagree about your premise, although your conclusion would follow from it. The idea is to have human readable streams of data that can be treated as if they are simply being set to a tty. HTML is an excellent example. With a browser, it is enormously powerful and useful. Yet at the core, it is a sequence of characters that I can type and read. I can edit it without special tools. Admittedly, those tools can make achieving just the look I want easier. They can speed my writing and make the results more reliable, but they aren't necessary.
  • Data preservation is not a new problem, it's one that traditional librarians and archivists have been dealing with for the entire 100 years of modern librarianship, and certainly for much longer than that in less academic ways. Can you say acidic paper? How about the restorations of the Mona Lisa and the ceiling of the Sistine Chapel?

    It's not at all surprising, to me at least, that this paper was written by somebody at what was once the UMich school of library science, until they discovered that they could pump up their prestige and funding by by going dot-edu.

    - David
  • by jabber ( 13196 ) on Thursday February 24, 2000 @08:36AM (#1248756) Homepage
    There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. [Point 1] Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? [Point 2] Secondly, there is some data that is irreplacable.

    Point 1:
    With the amount of data that we produce, archiving it will take an increasing amount of time. How much new content is created daily? At best, we will plateu in a state where as much effort is required to archive content as is needed to create new content.

    With the emphasis placed squarely on non-duplication of effort, archiving becomes a secondary issue. Indexing, searching, sorting and categorizing of the archive becomes a first priority, since creative efforts should now check if they are redundant.

    If the bold statement is to be a guideline, than the idea of an archive is moot, since all new work depends on old work, and so tracks well with where the author feels human effor should go. Much like with biological evolution, new data is the fittest of the old data that was applicable to the new context. I suppose that the call for archives is little more than a suggestion that we need an organized and deliberate fossil record of how we got to where we will be at some point in the future.

    What is needed is an archive, yes, but an archive of what? Not of content, but of the essence of the content. The lessons learned, the conclusions drawn and the optimizations realized in the process of creating the content. The content is fleeting - though arguably of inherent value... Which brings us to...

    Point 2:
    Yes, some things are irreplacable. Who decides? Who defines what is art, what is fact, and what deserves eternal life?

    Some things are of immediate and significant value, but for an unknown duration. The value of other things can not be realized for a very long time, and so the alternative is to store everything. Further more, the value of certain data is totally subjective, and this begs the question of "who's in charge" of defining that 1% that is to be kept.

    On the small scale, this will lead to vanity. Any 'artist' will consider their work a masterpiece, and save it. (I have code I wrote in CS101, don't you?) Companies will store and archive all email, all financials, anything that can potentially be used to mine data or identify trends or fertalize litigation. People will pigeon-hole videos of their baby's first steps, though nobody outside themselves really cares - unless the child grows up to be the next Einstein, or Hitler.

    "Hitler" raises an interesting question on the larger scale. Who has the responsibility of deciding what 'big' facts to store? And isn't that the path to propaganda, history-making, and such things?

    And then, when the leadership changes, and the 'book burning' starts...

    To bring the concept down from the paranoid-sphere, let's recall the /. article about Nikola Tesla. His work is not well known to most, because it was not made prominent, and subsequently, not well archived. We know of him, and we can dig for more about him, but the credit goes where it may not necessarily belong.

    Same issue with Newton and Leibnitz. Leibnitz was the German Mathematician who beat Newton to the concepts of Calculus. Newton, a member of the Royal Academy of Sciences (or something to that effect) politicised HIS influence, and so was credited with all of the work - where his contribution was not complete.

    Some things are not outright lies, but oral histories get lost while written records persist.

    Who gets to choose what to write down?
  • Perhaps a "resilient disk" standard ought to be created, for stuff you would really like to last. Perhaps a WORM (write once read many) optical disk, like a CDR, but made to be very resilient, perhaps lasting up to a thousand years.

    Perhaps they could even be made to work with existing CDROM drives and perhaps even existing CD writers. Then you just start selling a new kind of disk. Anyone that wants something to last, they put on those. If they want lots of space per penny, they can buy something else.

    --
    grappler
  • I find it interesting to think about this from the perspective of the notion of memes. What has evolved from human consciousness is a rich ecosystem that generates and values an enormous diversity of information. Thinking about what will be preserved, and how, gives rise to an image of our several billion minds, aided by technology as simple (!) as spoken language or as complex as electro/magneto/optical storage, operating as a kind of primordial informatic soup.

    Out of this fecund brew maybe, just maybe, a carrier as successful as DNA will emerge, with the capability to preserve the "best" of the information. Maybe it already has, in plain old text, which will be decipherable for as long as the bits can be gotten at, and which then has the benefit of the redundancy of human languages for further decoding and understanding. Then we drop down to the question of how exactly the bits manage to survive, and it seems the only ultimate answer is some human has to care enough to refresh them. Or be clever enough to teach them to take care of themselves.

    It also seems clearly impossible that everything can be preserved, and also impossible that what is preserved will always be something to be proud of. Some extinctions, however tragic, are inevitable, and some, however richly deserved, never occur. It's part of the beauty (and maybe mercy) of conscious life that there are moments that will never appear again, can never be adequately captured for later replay. Being aware of that fact is what encourages us once in a while to put down the camcorder, shut off the microphones, maybe even try to still the stream of words in our heads, and just drink it in.
  • by mhkohne ( 3854 ) on Thursday February 24, 2000 @08:41AM (#1248759) Homepage
    Disclaimer: I know I'm being a bit paranoid, but I think this should be brought up, at least for purposes of discussion. There is probably less to worry about here than in other places, but it still should, I think, be in the back of the mind of anyone trying to solve this problem.

    One thing I believe was missed in the original article is intentional change to the historical record. In addition to having to store old information, and worry about how we're going to get to it later, I think we need to pay at least half a though to intentional modification of the historical record.

    With paper and ink, it's rather time consuming and expensive to alter historical documents, even assuming you can get near them. With digital media, the situation may be different - it may become very simple to alter historical documents, especially if you're the guy who's in charge of copying them to the newest form of media.

    Aside from the obvious political reasons someone might want to do this (can you think of a fundamentalist movement of any sort that wouldn't modify old documents to read they way they would like, given the chance?), I can also see where money might come into play.

    For instance, suppose MassiveDrugCo, Inc. is introducing a new drug which prevents newly detected disease Y. Now, in order to sell a lot of this drug, you have show that Y harms enough people to worry about. Unfortuately, the historical record being used for retrospective studies doesn't show that. So, instead of going back to the drawing board and finding something else to cure, MassiveDrugCo instead feeds a modified copy of the historical data to unsuspecting independant researchers. These honest and unbribable researchers draw the conclusion desired by MassiveDrugCo - in spite of the reality of the situation.

  • Hi all,

    There is also a very well-written, very accessible article on this topic, titled "Saved", available at Wired magazine's archive [wired.com]. It was written by Steven Gulie, in 1998 and I distinctly remember reading it, thinking it had a profound impact on my thinking about this topic.

    Take a look. -Paul

  • <offtopic>
    Granite probably isn't the best choice either. Over time the feldspar in the granite breaks down, and the rock falls apart. Pollution and water accelerate this process. Basalt would probably be a better choice, or pure quartz, or some corrosion-resistant metal like gold or platinum.

    As for me, I'm backing up my data by encoding ASCII text as a pattern of platinum-plated titanium pins hammered into a slab of good dense shale. After that, I'll drop the slabs into the Mississippi delta, and in a few million years, my wit and wisdom will become part of the rock strata. The MTBF should be about 100 million years, barring a major tectonic event.
    </offtopic>
  • by hey! ( 33014 ) on Thursday February 24, 2000 @08:47AM (#1248762) Homepage Journal
    I think you have to ask, what are you preserving information for?

    Are you trying to preserve episodes of the Simpsons so our relatively near term, technologically advanced descendants can watch them? Well, they're technologically more advanced and thus more clever than we; we just need to have suficiently stable media (micromachined gold plates would work nicely) and a either a simple minded encoding scheme or an easily readable description of the algorithm prepended. In the 22nd century, some bright Norwegian 16 year old armed with a yottaflop coputer will figure out how to read it if he cares enough.

    A bigger concern (in my opinion) is what happens when our civilization collapses. Historically, it is almost certain happen sooner or later. Rome lasted well over a thousand years; if you told a 1st century CE roman that there would ever be an end to the empire he'd think you were crazy. Yet our civilization is in many ways much more fragile because the information it is based on is in much more ephemeral form (both media and format).

    What we need is to devise a bootstrap procedure.

    (1)Reading primers in various languages.

    (2) Primers on basic technology: mathematics, simple mechanics, mining and elementary metallurgy.

    These should be in highly durable form, but the problem is that you don't want people making off with them for building materials. The problem with using gold plates is that you don't want people to have access to them until the information on them is more valuable than the substrate. Perhaps these first items could be carved onto stone pillars inconveniently large to move.

    Next, you need repositories on more advanced science and technology: chemical engineering, electronics and so forth. Perhaps you could rig a way to prevent savages from accessing these repositories; a mechanical puzzle perhaps, that requires a certain mathematical sophistication to solve. The most critical records could be kept in forms that could readily be read without mechanical assitance or with only simple mechanical assistance such as optical magnification (my local librarian likes micofilm, because she knows it will be readable for decades). Less critical things like old Simpsons episodes could be on very cryptic media that would require considerable technical finesse to read, but would be cheap to transfer to.

    Pretty much, as you go from the most basic and critical information to the least critical information, you go from the easiest to read and most expensive to produce per bit, to the hardest to read and most convenient to produce.

  • This may be ill-considered, but it seems to me that data's value diminishes with time far faster than it's quality.

    Sure, poems and photo's for the grandkids. That's a hundred years, tops, and migration, translation and CDR covers it, fairly easily. As far as showing pictures to people who will have only vaguely heard of me? Or preserving the IRS tax code for four thousand years? Somewhere I'm sure is codified the idea that data is useless without context. If not, there it is, Nyarly's First Thought on information theory. I'm sure it is though...

    But me noodlings with fiction, my code, my photos and graphics won't be any more useful without the cultural context they were created for than an arbitrary collection of 16 bits without a description. Is that a Float or a Fixed? Is that English or Spanish?

    And if a modern creator does produce something of Eternal Meaning, there's precedent for it's propigation by those it has meaning for. Think of the Bible, or the Collected Works of Shakespeare. These continue to exist not because they were recorded perfectly on a perfect medium, but because people found them worthwhile enough to continue them.

    What good would a perfect storage method be, anyway? If people forget it, or if they cease to care, a record could be painted in Liquid Unobtainium on God's backside, and it would be just as lost as if someone had scratched it in sand. Or on the base of a bronze statue. "Look on my works, ye mighty..."

    Paper rots, stone erodes, metal corrodes. The only eternal medium is word of mouth. Anything else is just a memory aid.

  • by um... Lucas ( 13147 ) on Thursday February 24, 2000 @09:03AM (#1248768) Journal
    My worry is that at some distant point in the future, all our paper will have rotted away and people think that the CDs were our primary means of communication, like Babylonian pottery shards..

    Actually no. Paper has proven to be one of our most durable ways of storing data. Egyptian papyrus from 3000 years ago is still more or less intact. CD's on the otherhand, will last for a 100 years in a "BEST CASE" scenario. Most will last much less time. CDRs might last 25 years. There are other variables besides media itself. I've seen several CD's from the early 80's that refuse to play these days. They're not at all scratched, but the theory goes that the original ink they used to print on them actually was a bit corrosive over a great span of time.

    In order to remain readable, digital data must remain more or less intact. A few missing bits in the application needed to open a file can pretty much reduce your odds of opening that file by 100%. Analog data, on the other hand, degrades much more gracefully... It may start to fade, but there's no intermediary between having the data and being able to read it (you don't need an extra "application" to read a newspaper).

    I've heard that this is actually going to be one of the least documented periods in human history, because all of our data is stored digitally and periodically purged. Even if it's not, places like NASA are generating data faster than they're able to back it up and move their old archives onto newer media.

  • Some of you will, no doubt, remember the issue of whether or not Heisenberg was building an atomic bomb for the Nazis, and if so, was he actively interfering with the project because he disagreed with the Nazi's goals. It turned out that after the war, Heisenberg and some other scientists were being held in Britain. The British tape secretly recorded all of their conversations. The medium? Spools of wire. (Think of a spool of wire being used just like a magnetic tape.)

    A few years back some scholars wanted to listen to these recordings and had a terrible time finding a player. Eventually they found a collector who had one in working order. Wire recorders have not been made since the fifties. But they eventually found a player and carefully transcribed them. (And it seems that Heisenberg was actively trying to build a bomb, but lacked the resources to do so.)

    How interesting that they had problems finding a working wire recorder. At any time, there are between a half dozen and a dozen wire recorders for sale on eBay. The circuitry of a wire recorder is so simple that any good old-school tube radio repairman could get one working in an afternoon.

    Wire recordings are an example of an early technology that turned out, unintentionally, to be a fantastic archival medium. Sure, the recording is monophonic, and the frequency response is limited, but for voice recording, those are acceptable compromises, considering that a spool of stainless steel wire can last for centuries. Short of physically destroying the spool, or deliberately erasing it, it will not decay. There's no plastic backing to decay. There's no oxide particles to flake off. Just corrosion-proof steel wire. Fantastic!

    I have dozens and dozens of original wire recordings from the late 1940s and early 1950s, and they all sound as good today as a freshly recorded wire.

  • I've been working with the Linux Video [linuxvideo.org] group where we've been trying to make an open source player for DVD discs. The ONLY problem that we're fighting right now is not the know-how to get it done, but rather trying to obtain the file format documents for DVD-Video and being able to use them legally. Indeed, the recent deCSS program is another really good example of how file format specifications can be illegal to implement, even if you have obtained the specifications legally.

    The way that the DVD Fourm [dvdforum.com] (formerly known as the DVD Consortium, with oversees the DVDCCA... this is the group of companies that cross-license each other's patents and shares information regarding DVD development) currenly requires you to sign a non-disclosure agreement (NDA) to obtain the specifications, and that NDA also prohibits you from even discussing the specifications with anybody unless they have also signed the same NDA. Since this is covered under the trade secret laws, this particular bit of intellectual property is theirs theoretically forever. At least until you can hire a bunch of lawyers to demonstrate that a DVD is no longer a trade secret.

    I've also set up a seperate mailing list from the main Linux Video group that is in the process of developing an Open Video Disc [mailto] specification which is trying to allow people to develop products without having to pay royalties or deal with patent infringments. Fees for most of the current video formats range from over $10,000 (for the DVD specs.... license fees are on top of that) to the MPEG Licensing Authority [mpegla.com] who is being quite reasonable for most close-source projects, but if you read the details of what you must do to license a product, is contrary to the nature of most open-source projects. It is still possible to write a GPL'ed MPEG player, but it would only be free as in speech and not free as in beer. In fact, you would probabally have to charge somebody to download the software. Shareware MPEG players are probabally skating on some very thin ice legally, and certainly part of the registration costs would have to go to the MPEGLA.

    One of the things that is so nice about HTML is the fact that this standard is open, patent and royalty free. If CERN had tried to put a patent on HTML I doubt that the web would have developed nearly so quickly. Or rather imagine if Apple's hypercard system had been developed with the GPL and file formats were made open for anybody on any platform to use.

    One of the things that I believe is killing the Unicode character encoding is that all kinds of intellectual property restrictions are placed on it, and you need to pay royalties to develop much software that uses it. Again, think what would have happened with ASCII had it been kept closed up, and why EBDIC isn't being used for character encoding.

    More importantly, open and free specifications are critical to data preservation, and a point that really hasn't been brought up by Calc (the author of the original post on /.)
  • One problem that I've noticed with certain programs is that they let you import data from old or competing file formats but they do not let you export the data to other file formats. What happens when the program and/or computer becomes obsolete?

    One of the email programs that I use stores everything in a database file. Short of saving messages to files, one at a time, there is no way to extract the messages from the database.

  • One of the big problems with storing data is the sheer size of it. In astronomy, almost all data collected by telescopes, be they radio, optical or otherwise, goes through a stage known as 'Reduction'. I've put this in quotes, mainly because it doesn't necessarily reduce the size of it. In essence, Reduction is about obtaining the most important or most complete information out of the data and discarding or minimizing the redundant, the useless and the misleading out of the data so that future analysis can be carried out on the important stuff without having to wade through all the noise. For instance, 70 or 80 images of one optical observation in various wavelength bands will be collapsed into three to five optimal images, one for each band. In Radio Astronomy, collating 60 - 80 12hr observations into one file removes all the 'bad' data and is optimal for future reuse.

    To effectively make a useful archive requires some filtering of what goes into the archive. Nowadays I work for IBM on DB2 UDB, and the roadmaps suggest that the size of databases is growing exponentially - fortunately this is balanced by a proportional growth in both processor power and storage space and access speed. So while we have terabyte databases today, we could easily be looking at petabyte databases in a few years. These databases will probably hold a vast amount of digitized analogue information - memos, diagrams, papers - which currently is stored in more convential storage. The advantages of moving to a fully digital archive are great - searching and retrieval are faster, and the space saved by putting scans of 20 boxes of papers onto a hard drive or other storage are also great. However, there is a danger with archives growing out of control - if you initiate a search which will visit every part of a petabyte database, you are going to have to wait for it to finish, even with the best search algorithms and vastly faster hardware. Making sure that information is not multiply duplicated in the database, or that redundant data is not added without regard to the database retrieval performance is extremely important. If we set up a project to 'mirror the web' for archival purposes, we'll be hamstringing ourselves right at the start - most data is not needed for future reference. By applying methods to distill the important information, archives can be updated, maintained and searched without exhausting the available resources.

    Cheers,

    Toby Haynes

  • Yes, when digital breaks, it definately breaks (although checksums and duplication of data can reduce the chances of losing data), but the level that you can push the degredation of a digital device too before it breaks is really quite high.

    One major problem with digital formats is the absence of error recovery in common hardware and software. 99.9% of the data may be intact but one bad block at the beginning of a magnetic tape can make all of the data unrecoverable.

  • The Internet Archive [archive.org] is devoted to preserve the information contained in the Internet.
    And I have just found an article from Steve Baldwin, the guy from Ghost Sites [disobey.com]!
    --
  • Part of the solution is to avoid knee-jerk changes in format. For example, the Word file format gets changed every few years, but to what end? ASCII may eventually go out of date (as did EBCDIC), but at least a text file tends to be more future proof than a proprietary binary format. In terms of the web, there's already been a lot of nonsense caused by some people using Flash and other people using style sheets and other people using Microsoft or Netscape extensions to HTML. Is it really worth it? Or would it be better to stick to the least common denominator of pure HTML? I say yes, but apparently a significant number of web page creators disagree.
  • by vlax ( 1809 ) on Thursday February 24, 2000 @09:35AM (#1248787)
    Archiving is important. I'm actually surprised at the number of /.'ers who just want to let the data die. I remember taking a tour of the Magninot Line in France. Having proven useless as a military outpost, the entire chain of caverns was converted to document storage decades ago. In a thousand years, archeologists will be able to substantially reconstruct live in twentieth century France. Information about births, deaths and marriages need never be lost. Detailed census reports can be preserved so historians can make new theories about the social behaviour of man. I think this is a fairly important task. Imagine how much easier it would be to reconstruct human history if past civilisation hadn't kept shoddy records.

    I suspect the problem of file formats is less serious than people make it out to be. A well-documented format should be reconstructable indefinitely. Few software companies don't document their file formats. Even without documentation, it ought to be easier than reconstructing dead languages. We learned to read Egyptian hieroglyphs primarily from one attested translation and a lot of careful deduction. Given a thousand Word 6 documents, I think a good computer archeologist ought to be able to construct a program to open and edit them.

    Museums of old hardware, and perhaps some sort of custom computer factor to make ancient hardware strikes me as a good idea. It could be like blacksmiths at SCA festivals, "Ye Olde ASIC Mill." :^) I doubt it would ever be profitable, but museums, even working ones, rarely are. Although who knows? A Commodore 64 could be an objet d'art in a hundred years, just as ugly African masks are now.

    The real problem strikes as the one most heavily emphasised in the article: decaying media. I suspect the best solution with presently forseeable technology would be to preserve data in crystalised DNA. Even in nature, DNA takes centuries to decay, and if it were crystalised and kept somewhere cool and dry, it would likely last for millenia. Encoding a document onto a billion strands of DNA weighs basically nothing and it would be a very highly redundant storage system.

    It isn't easy to do right now, but I suspect that technology is right around the corner and probably only requires a little bit of research money to become practical.
  • I should think that a 99% destruction rate is awful! Kind of defeats the purpose of an "archive" doesn't it?
    If I didn't work so hard to keep down the amount of repetitive joke-list and similar traffic I get, I could easily trash 75-90% of the total volume of the e-mail I get with no loss of information. The real problem is when there is a strong disincentive to keeping data around. For instance, corporate memos. Our hyperlitigious society is turning "documentation" into land mines for companies and individuals alike, so there is a very strong incentive to dispose of everything which is not absolutely necessary to keep. If it doesn't exist, no attorney for the plaintiff bar can subpoena it and use it to sink you under a huge verdict.

    Worse yet, the labor involved with separating out the 1% of stuff that ought to be kept is going to mean a non-zero error rate; people will toss things that are still of value just because they have no time to examine them in detail. What are you going to do....
    --

  • by Mr. Slippery ( 47854 ) <tms&infamous,net> on Thursday February 24, 2000 @10:21AM (#1248795) Homepage
    Many of the earlyist mission datasets from the 60s and 70s are unrecoverable due to media degradation and format incompatibility.
    ...including, IIRC, a bunch of old Landsat data. "So what?" I hear you ask."If the data were important, it would have been accessed more often and ended up being transcribed and preserved."

    Problem is, it's entirely possible for us to not understand the importance of a data collection for years. That old Landsat data would be a great baseline for information about global climate change.

  • Yes, but compare it to Unicode, that wastes *nine* bits when used by all right-thinking people of the world. And my god, think of all those binary file formats that pad space for fields reserved for future use. And what about the people who don't compress their media, or those the whiners who think they are too good for lossy compression algorithms. Don't they realize that all *meaningful* information can be expressed in an mpeg bitstream?

    Don't even get me started on those luddites who still insist on using dried wood pulp as their storage medium. It's as if they think all information metaphores equate to a 16th century printing press.
  • by karb ( 66692 ) on Thursday February 24, 2000 @10:29AM (#1248798)
    Egyptian papyrus from 3000 years ago is still more or less intact.

    I read an article once in that hotbed of liberal thinking, readers digest, about book deterioration. Older books were printed with a different method, and will last a couple hundred years. Newer books will only last maybe 50 years.

    This begs the question : how long will computer printouts last?

  • He talks about making upgrades of all documents standard procedure. The example he gives is upgrading all documents from Word 95 to Word 97.

    But, isn't this missing the point?

    The problem exists because products like Word build in incompatibilities to force consumers to always purchase the newest product. We don't have to accept this.

    The solution is to promote open document standards for everything. This should be part of the decision process when organizations are choosing applications.

    Hopefully, in the near future, we will be able to choose an office suite that stores everything in XML format, and uses open object types like PNG or JPG images.

    Also, exporting images to a format like PDF or PostScript would solve a lot of problems. Open Source applications exist for both of these formats, ensuring that you are not at the mercy of the application vendor.

  • One of the email programs that I use stores everything in a database file. Short of saving messages to files, one at a time, there is no way to extract the messages from the database.
    Proprietary data formats are evil. Not only do they introduce the threat of software obsolecence, they prevent you from working with the information with any tools other than the creating software.

    I use MH to handle my mail. I can use all the standard MH tools as well as nice front ends like exmh, and they give a reasonable amount of power; but better yet is that, since MH stores one message per file in a plain format, I can use find, grep, perl, emacs, and all our other friends to manage my messages. I write my documents and correspondance in HTML and/or LaTeX (often via LyX) for the same reason. If it's supposed to have written-language content and I can't grep it, it sucks.

  • If you want to last, you can make it last. Acid free paper is available, and it does a great job at preserving documents. But today, so much of our information becomes out-dated, there seems to be little point in preserving what we know no longer applies. Old science books, etc...

    Laser prints, i'd guess, will be fairly durable... Inkjets prints, probably not... that's just a pure guess, though
  • by ludovicus ( 131490 ) on Thursday February 24, 2000 @11:05AM (#1248808)
    This isn't anything new. I'll probably misspell all of the following names, but I think you'll get the gist of it.

    I believe it was King Tutankamun's father, Akenahten, who threw his world into a tizzy by rejecting the established religion and invented a new one that worshipped the sun. He went off a built a new city to go along with it too.

    Well, the bureacracy of the day didn't like this at all because it messed with their job security. And as soon as he was dead, they went around hacking his face off anywhere it appeared (of course we're talking about monuments, etc., made from stone) and I believe they went after any mention of him in text (hieroglyphs) too.

    And they almost got away with it and just about completely expunged his existence from their records. But they missed a few things and we've been able to piece together a little bit about him.

    So anyhow, there's my Discovery channel understanding of that little story. What it means in relation to this subject I'm not quite sure. I thought it was a good idea to point out that this is certainly not a new issue.
  • There is an article [wired.com] in the Sept. 98 issue of Wired that addresses the same issues on a personal level; the writer looks for a medium to preserve a dying friend's voice and poetry.
  • The Freenet Project [sourceforge.net] is extremely relevant to this - if it isn't mentioned in some of the links given, it definitely should be.

    --


  • This doesn't address the most difficult parts of this problem: multimedia. Images and sounds don't have the equivalent of ASCII. There is no universal standard that all tools access the same. GIF used to be like that, but look what happened to it. JPEG is nice, but its lossy, so there goes your perfect archive.

    Then there is the further problem of giving Joe Computer User out there the capability of building a "digital" history. With companies like Kodak and Apple goading people to using proprietary data formats like FlashPix or Quicktime its an uphill battle. And once again, there's no ASCII equivalent to fall back upon.

    Ugh... this really brings back the corporatism fueled pessimism I was feeling earlier with the DVD/DeCSS debacle.

    Oxryly
  • Strangely enough, this is something I've been dwelling on a lot frequently. Everyone praising the web based news publications (katz! bah) and online magazines seem to always overlook the fact that once the issue is gone off the web, it's usually gone forever. And if not now, where will it be in 40 years? I can still go down into my basement, look through my families huge collection of periodicals, and find issues / articles from decades ago. With the quick pace of the web, such an act doesn't seem like it's going to be feasible.

    I don't know about anyone else, but there's something disturbing about this fact. Even the first few sites I've done back in the early / mid 90's have been lost forever, and while they were fairly insignificant, it's not an uncommon occurance for information to be lost.

  • Yes, but compare it to Unicode, that wastes *nine* bits when used by all right-thinking people of the world.


    <sarcasm>
    Oh yeah, all right thinking people speak languages that can be represented by the letters available in ASCII. Yes, Unicode was invented for all the wrong thinking people who insist on using those funny looking letters with lines or dots around them or arcane characters that no right thinking person can understand anyway.
    </sarcasm>

    If you use the UTF-8 encoding and you restrict your text to the characters available in ASCII, the resulting text is ASCII. Besides, do you have any idea how hard it is to write the credits for a big free software project these days in anything other than Unicode without mangling somebody's name?
  • This a great advantage of Open Source (preaching to the converted, I know) because that is the only open standard (and therefore durable) format. All other proprietary formats will come and go with the companies that make them.

    What? I'm sorry but open source IS NOT a magic bullet. There's two problems with your statment; and ironically they work in opposite directions.

    1. Just because a new dataformat comes around, doesn't mean people will use it. Look at PNG usage versus GIF usage.
    2. Say your format gets adopted. Now say a new format comes along that can handle something ultra-cool that the previous data format simply can't do. Why would people want to stay with the obsolete standard?
  • ... Primers on basic technology: mathematics, simple mechanics, mining and elementary metallurgy.

    Back in the 1950s and 1960s, the U.S. Office of Civil Defense actually did that. A library of information on how to make and do key practical and industrial operations was created, microfilmed, and thousands of copies placed in fallout shelters. This was beyond the usual survival-handbook stuff; more like "how to build or fix an oil refinery/power plant/water system/auto factory" information.

    If anyone knows where a copy of those microfilms still exist, please let me know. Thanks.

  • Your ideas are interesting and I think sound; however I think the moon is a little too remote. After all, by the time the future civilisation reaches the moon, it will almost certainly already have weapons of mass destruction; in all probability by the time they found it they would be more advanced than we, and not very amenable to learning from our mistakes.

  • by Anonymous Coward
    Papyrus only survives under very special circumstances (notably the arid conditions of Middle Eastern deserts). It rots as fast as anything else organic elsewhere in the ancient Mediterranean world and although we know it was used in the north-western provinces of the Roman Empire, all we have in Britain (for example) to prove its presence here are a tiny impression on a pre-Roman coin and a mineral-preserved fragment in a box of rusty Roman armour from near Hadrian's Wall.

    So much for papyrus; the Romans generated vast amounts of data and wrote a lot of it in ink on wooden tablets... a fact we only realised comparatively recently because so few survived. So much for wood...

    If you want a really durable medium for conveying information through the ages (and you have to exclude stone a) because it is much prized for re-use and b) because a lot of it, particularly the soft sandstones the Romans often used for monumental inscriptions, weathers rapidly) you need ceramics. Pots are vulnerable, but potsherds virtually indestructible. Ostraka (graffiti on sherds) survive now in as good condition as the day they were scratched, not something you can say of ancient papyrus, vellum, wood etc etc.

    So, start working out a way to put data onto dinner plates, and you have the perfect storage medium... sort of...

  • magnetic media or when a CD begins to decompose, that bit is lost forever.

    ECC would be a beginning (CDROM already uses it, that's why a 75 Minute CDROM only holds 650M of data). Microencoding of some sort in a durable substance is perfectly acceptable as long as the instructions for building the reader are in a more readily accessable format.

  • Just because your 10 year old CD's can still be played today, doesn't mean they'll work in 2025.

    Some of my 10 year old CDs WON'T play on a new CD player, but WILL on an ancient old player. Only ten years, and there's enough drift in standards to make that happen. Of course, neither player is top of the line, perhaps a really good one would play the old CDs.

  • Reading over many of the comments on decay of digital media it occurs to me that many people are missing the point that digital data is really analog when you get right down to the fundamental formatting. (until we're storing data in quantam media that is...)

    Even if a standard CD players can't play a degraded CD, if someone wants the data bad enough, they'll build an error correcting CD player that will reconstruct bits that a normal player can't read. Just like archeologists reconstruct paper or heiroglyphs or fossils today, future archeologists will no doubt reconstruct CD's and hard drives.

    Even today, data recovery specialists can read off multiple generations of files. Maybe archeologists will have optical readers which will read the CD/magnetic surface at many times their original resolution / sensitivity and reconstruct the data. Of course it would be nice for us to leave them some equivalent of the rosetta stone so they can decipher the various formats. But overall, I think today's digital media will be far more recoverable than people might think.

    Just a thought.
    -dialect
  • Newer CD-ROMs have a much better chance of surviving than original CDs did. There have been several major changes.
    • Plastics are now more flexible and less prone to shattering.
    • Thicker plastic places media farther from surface, hence farther from scratches. Scratches can also be filled since the actual media hasn't been damaged.
    • No new CDs from the "Crash Test Dummies" have been released. The last copy of "Mmmmm...Mmmmmm...Mmmmm...Mmmm..." I saw was the victim of a drill bit, a microwave, a utility knife and, eventually, a concrete wall.
    • Going back to aluminum foil presses rather than silver -- a couple of companies (one of the larger British CD manufacturers) attempted silver for a while until someone discovered that silver tarnishes over time.

    +------->
  • You have a fair point. The 1985 media could have been copied in 1995 onto newer media a fraction the size. If this was not done then a problem arises. It's not insuperable, unless the media decay, but it can be expensive. We can always hand-build a tape drive, or read the magnetic field directly off the tape with an electron microscope, but it costs.

    Books are not as good as you think though. People needed to (a) be able to read and (b) speak the relevant language, for archives of old books to be of any use. Neither is completely automatic. Also, to make real use of old books, the readers would need a fair amount of cultural context for them, and that is positively expensive to acquire.
  • Oh my,...

    Some of us don't have english as our mother tongue. That is something that is too often forgotten at places like redmont.

    (OK this in slightly OT, but I'll rant anyway.)
    Between DOS and windows that company we all love to hate decided to change character sets. Suddenly three letters in the swedish alphabet have a new character code. One and a half decades later (count that in internet time...) we are still struggling with documents with mixed encoding.

    That means every damn application has to provide a way to recode OEM to Ansi. AND deal with users who tries to do this conversion on files already converted.

    This is *before* dealing with unix and mac files.

    So if we cant read freaking text files after ten years, how are we supposed to read binaries?

    Sometimes I just get too tired...

  • Ah, but language and cultural context are "software" issues. I can look at Hebrew or Arabic text and copy it reasonably well even though I can't understand what it says. This would allow me to preserve it even if I couldn't make use of the content myself.
    --
  • Why not? There was some extream competition back in the early days of computing, and just about everything else was a propriatary standard anyway, why not the character encoding scheme as well? (which IBM did anyway)

    I think the real reason why ASCII was adopted had nothing to do with the computer industry, but rather with Western Union (which was a part of American Telephone and Telegraph at the time). All of the teletype machines used ASCII, and it proved to be a stock terminal option for many years. The control characters are a legacy of this heratige as well. The reason for the ASCII codes of #10 followed by #13 is that the teletype machines had to be told to scroll up the paper one line and then physically move the printer head to the left side of the terminal. How many systems still need these codes, even though all it really means is that the cursor is moved to a new position on the monitor?

Understanding is always the understanding of a smaller problem in relation to a bigger problem. -- P.D. Ouspensky

Working...