On Preservation of Digital Information 199
Preservation of Digital Information
Recently there was an Ask Slashdot about the the problem of preserving digital material. The basic idea was that we are creating a massive wealth of digital information, but have no clear plan for preserving it. What happens to all of those poems I write when I try to access them for my grandkids? What about the pictures of my kids I took with that digital camera? Can I still get to them in time to embarrass them in the future?
Obsolescence of digital media can happen in three different ways:
- Media Decay: Even when magnetic media are kept in dry conditions, away from sunlight and pollution, and hardly ever accesses they will still decay. Electrons will wander over the substrate of the media, causing digital information to become lost. CD-ROMs luckily do not have this same problem with electron loss. They still are sensitive to sunlight and pollution though. Many people mentioned last week that distributors of blank CD media often make claims of an hundred years or more for the duration of their products. Research seems to indicate the truth is closer to 25 years,which seems like a long time, until you consider the factors below. Besides, information professionals often think in terms of centuries rather than decades.
- Hardware obsolescence: Far more dangerous than the degradation of the actual information container is the loss of machines that can read it. For instance, the Inter-University Consortium of Political and Social Research received a bunch of data on old punch cards. The problem was they had no punch card reader. It took a decent chunk of time, and a good deal of money to eventually be able to read the data off of these cards, even requiring some old technicians to come out of retirement to help tweak the system. Hardware extinction is hardly a foreign topic to Slashdotters. It happens, and as technology increases its pace of change, it will happen more quickly.
- Software obsolescence: The real stone in the shoe of digital preservation is obsolescence of the software needed to open the digital document. This can include drivers, OSS, or plain old application software. We all have piles of old software that were written for older systems, or come across an old file the bottom of a drawer where we can't even remember what application it used.
There are several strategies for preserving digital information. People mentioned some last week:
- Transmogrification: printing the digital document into an analog form and preserving the analog copy. An example would be printing out a Web page and archiving the print of that Web page. This, obviously, takes out the main strength of a Web document, hyperactivity, and may also ignore important color and graphical content. An alternative form of this is the creation of hardcopy binary that could later be data entered into the computers of the future. The media suggested have ranged from acid free paper to stainless steel disks etched with the binary code. The two major problems with this idea are that any misrepresentation of the binary could have disastrous results for the renewal of the document, and transformation to hard copy limits the functionality of many types of digital documents to the point of uselessness.
- Hardware museums: preserving the necessary technology needed to run the outdated software. There are several weaknesses to this plan. Even hardware that is carefully maintained breaks and becomes un-usable. In addition, there is no clear established agency that will be responsible for maintaining these machines. Spare parts eventually become impossible to find and legacy skills are required for maintenance. There must be technicians with the requisite skills to service these preserved machines. Finally, it does not create efficient use if all possible future users must bottleneck to just a handful of viewing sites to have access to the information.
- Standards: reliance on industry-wide standardization of formats to prevent obsolescence. Market place pressures for software produces create an incentive for a company to differentiate their product from their competitors. While unrealistic in a capitalistic marketplace, standards such as SGML have proven successful for large scale digital document repositories, like the Making of America archive hosted by the University of Michigan. However, many of these large repositories also receive information from donors that is not in a standardized format, and do not feel comfortable turning away those documents.
- Refreshing: moving a digital object from one medium to another. For instance, transferring information on a floppy disk to a CD-ROM. This definitely seemed to be the preferred method of most Slashdotters. While this takes care of degradation and obsolescence of the media, it does not solve the problem of software obsolescence. A perfectly readable copy of a digital document is useless if there is not software program available to translate it into human-readable form.
- Migration: moving the digital document into newer formats. An example might be taking a Word 95 document and saving it as a Word 97 document. Single generation leaps are usually not a problem, so large volumes of information could be saved. Unfortunately, migrations over several generations are often impossible, as is migrating from a document type that was abandoned, and did not evolve. Also, information loss is common in migration, and may cause the document to become unreadable. While this may be the best single method available, it is very labor intensive, and some knowledge of the nature of documents would be essential to determining which information containers to migrate. For instance, often you lose aspects of a document (good and bad) when you migrate it, but which of those aspects are important?
- Emulation: creating a program that will fake the original behavior of the environment in which the digital object resided. This is another very intriguing method that could be used. It's actually already pretty common. For instance, most processor chips include emulators for lower level processors. There also aleady exists on the Internet a very active group of people who are interested in emulating old computer platforms. Still, we need to do a lot of research yet on the cost of this method, and what sorts of metadata are necessary to bundle with the digital object to facilitate its eventual emulation. Another problem is the intellectual property hassle caused by emulation. Reverse engineering is a big no no, and there is no point in making the lawyers rich. This area is actually where Open Source can be of biggest help to preserving the longevity of different kinds of applications.
Many people in the discussion last week seemed to believe that simple refreshment or migration of the data would be a sufficient answer to the problem. At a personal level that may be true, but for anyone responsible for large amounts of digital information, neither is a completely convincing method. Here are a couple of reasons why:
- Not all documents are the same- In the digital preservation literature, most people talk as if all digital information is in ASCII format. Au contraire. As computing becomes increasingly robust, so do the documents we create. Multimedia games, three dimensional engineering models, recorded speeches, linked spreadsheets, virtual museum exhibits and a host of other documents spurred by the development of the Web have cropped up. How are they going to be affected by migration to a new environment?
- It's so darned expensive- It's a little gauche to talk about, but the Y2K bug caused what ended up being a huge migration of digital information. How much did the US alone spend on that fiasco? $8 billion? For smaller organization who do not prepare for the preservation of their digital information, the cost of emergency migrations could cause all sorts of budget trouble.
There is some belief that there is no reason to preserve information at all. Most of what is created is just tripe anyway, and we should be more focused on creating content than preserving it. There are two reasons why some sort of preservation is important. First of all, it is inefficient to recreate information that already exists. Human energy is better spent on building upon existing knowledge to create new wisdom. How much do we already spin our wheels as several people collect the same data? What more could we be doing if we spent the energy instead on new pursuits? Secondly, there is some data that is irreplacable.
Which is not to say that we should keep everything. In a traditional archive, only 1% of documents received are kept. Ninety nine out of one hundred documents are destroyed for various reasons. A similar ratio is not unreasonable for digital documents. Consider that 16 billion email messages are sent each day. It seems ridiculous to keep all of them, but how do we weed out the ones we do want to keep? Appraisal of digital documents for archival purposes is going to become a major issue in the not distant future. There are already examples of data that have been lost, or nearly lost. NASA lost a ton of data off of decayed tapes. The U.S. Census nearly lost the majority of the data from the 1960 census. These huge datasets are important for establishing a scientific record that reveals longitudinal effects.
Increasingly, the record of the human experience is kept in a digital format. The act of preserving that information is the act of creating the future's past, the literal reshaping of our world in the eyes of the future. Nobody knows the best answer yet. There is probably not a single answer that will fit absolutely all situations. Information professionals are just beginning to do research in the form of user testing, cost-benefit analysis and modeling to answer some of the thornier issues raised by the preservation of digital information. There are things out there worth saving, we just need to figure out the best way to do it.
Some links of interest in case you would like to read more:
- a really good bibliography of related sources by Michael Day
- an article by Jeffrey Rothenberg outlining some of the issues
- a site at Leeds University with many related links
Ok ... (Score:1)
Bad Mojo
99%?? (Score:2)
With digital documents, there's no real reason not to save all of it, even if much of it is "tripe".
Information is information, whether or not we find it useful. Some day, someone else might find our tripe is a goldmine of information, if only for anthropological study.
Sakhmet.
(The REAL McCoy)
"The surest way to corrupt a youth is to instruct him to hold in higher esteem those who think alike than those who think differently."
those AOL CD's (Score:4)
Limited problem, if... (Score:1)
I don't think there's too much trouble with losing games and other applications as the hardware that runs them obsoleces... New ones will be created, and the best of the old will be ported.
As to the data already archived on various media, there could indeed be a problem if people fail to move the data to newer media... Think of your pile of 5 1/4" disks that's just rotting in the corner because your new computer only has a 3 1/2" drive -- and that's not even a huge leap in technology.
There's also the question of formats, especially for users of M$. After two revisions of the software, it can't read any of the old data! Try reading a Word 6 document in Word 97 for laughs, especially if you use any special characters ü á € in your documents...
more info (Score:3)
--
BBC article (Score:3)
magnetic storage (Score:4)
The solution would be to use an optical storage media, but as others have pointed out, CDR storage has a life expectancy of 75-100 years depending on the brand. Which wouldn't be too bad except you have to realize that in 100 years you need to start putting resources into copying all that data off and re-writing it again. After awhile you'll have a snowball effect where you spend more time writing the old data than the new!
What we really need is a piece of technology that doesn't age - an entirely self-contained computer (nuclear powered, maybe?) that has the media, the reading/writing mechanisms and has several failsafe mechanisms to alert you well before any data is lost. Think of it as a computer time capsule - you bury it and in 500 years come back and it has all the human interface necessary to reproduce the data in a usable format. Of course, you'll still need someone who reads English then..
agh, the problems, the problems....
Is this likely? (Score:1)
Is there an example of a computer system that doesn't exist anymore, and can't be emulated at a much greater speed than the origional using existing software? Even most arcade machines can be emulated these days
My philosophy is ... (Score:1)
If a piece of information has not been preserved and is now unaccessible, it probably means that it was of minimal value anyway.
That's probably not the greatest way to look at this but I'm thinking that half of all the info that's presently out there is useless anyway and is just taking up space for nothing. Maybe it's a good thing that these will be lost with time. It's kind of like a good spring cleaning.
*******************************
This is where I should write something
intelligent or funny but since I'm
Re:99%?? (Score:1)
Re:99%?? (Score:1)
But just think, in hundreds of year's time someone might come across Microsoft's marketing literature and think its actually true!!
DOWN WITH ARCHIVING!!!
A great challenge (Score:3)
Unless we take steps to archive, transcribe and preserve all this information (yes, grits, petrification et al) then we are in effect building a new Library of Alexandria.
It would be the greatest loss ever for archaeologists of the future to be unable to access archives of the WWW. Every day is a unique snapshot of the world as the endless churning of webpage updates/dead link removals changes the WorldWideWeb.
This information Ocean is something unique. Archiving such a huge store of information generates a challenge in itself.
I don't often wax lyrical about the internet but it is in effect becoming a snapshot of our civilisation.
What a loss for future generations if they cannot see the views of ordinary human beings (through the endless websites) preserved.
Old issue (at least in sci-fi) (Score:3)
Data Decay, Readability, and ASCII text. (Score:3)
When you look back at history, and you look back at documents that are a "mere" thousand years old, the wealth of information in these documents makes you wonder what could be found if all the documents from that time had survived. Just because the format is digital, rather than analog or (eek!) paper, does not mean that this media is impervious to decay.
However, I think that decay is much, much more serious in digital media. The root of the problem is that if you are looking at physical document with water damage, even though the original "packets" of information (letters and words) are damaged, the human brain can sometimes extract meaning from smearing ink and crumbling paper. When an electron wanders on magnetic media or when a CD begins to decompose, that bit is lost forever. Digital media is much more sucepitble to lapsing into unintelligibility than physical media like paper.
Preservation in a media that will not become obselete is the key. As mundane as it may sound, plain ASCII text will probably never become obselete because there is no real reason to come up with a new standard. Some people may scream at me: "*ML! *ML!", but at the rate that these things will obescelece, plain text will still be around when XSGHTML has been long dead.
Just a thought. If you have something to add, feel free to respond.
Brandon Nuttall, the inquisitor of Reinke
An excellent summary of the problem (Score:4)
Personally, I encountered the issue of software obsolescence well over a decade ago. I migrated my resume to TeX because it had already been through four other formats and I no longer had access to the tools to read them. I picked TeX because I firmly believed that a tool that I had the source for was likely to continue to be useful to me for a longer period. And the source for the document is ASCII text, which I was able to convert to HTML a couple of years ago with little trouble. I will not rely on the future availability of any tool that I have no control over.
This is one of the reasons that The Unix Philosophy, a fine book, recommends text formats for data. You can manipulate it with a wide variety of tools including text editors. It is unlikely that we will abandon those completely in our lifetimes. It also suggests, if memory serves, keeping notes online in text form. They are more portable and more accessible that way.
One worthwhile source of literature preserved as plain text files is Project Gutenburg [promo.net]. It is probably also the oldest such project around. It is to text in some senses what Free Software is to code. Although they aren't doing collaborative authoring projects, they are collaborating on getting old books whose copyrights have expired into electronic form. If you haven't ever visited their site, take a look.
Not a technological problem... (Score:2)
But the technical issues are insignificant compared to the legal concerns - copyrights, patents, etc.
Sure, most of these forms of copy limitation do expire, but until a large amount of "digital literature" becomes public domain, nobody's even going to *try* developing a preservation system, for fear of lawsuit by irate copyright-holders.
My university's library collection totals nearly seven million books. Yet extracting information from this huge paper collection has been an incredible hassle... I would be willing to pay a significant annual fee if I could access every page in the library via a Web interface. I leave the juicy technical details to the reader's imagination. (I bet a few people with hand-held scanners and rudimentary OCR could digitize the entire library in a reasonable amount of time).
But guess what - this is never going to happen in my lifetime.
These seven million volumes of knowledge are never going to be preserved, because no library director in his/her right mind would risk slipping up and getting sued for violating a long-lasting copyright.
some other problems (Score:1)
This assumes that information SHOULD be thrown away. I'm not interested in becoming a pack rat, I already have enough "stuff" to keep track of, thanks. I suppose I'm just not all that interested in making my information, no matter how trivial, available to future archaelogists.
It's not just digital magnetic information... (Score:3)
In this case, the main problem is not bit-rot (although this will occur sooner or later) but rather problems with not recalling the information for an extended period of time. For example:-
Inversely proportional? (Score:2)
Books can have a life of hundreds, if not thousands of years if treated right. Even with abuse it will survive for years.
There is a problem of obsellescence of language, although usually there is a rosetta stone equivilent
With modern Media technology is progressing so fast in an almost throwaway way. At my previous company we had good backups, but we had no way of accessing them as before we went to DAT and then DLT we didn't actually posess the devices needed to read the tapes and before that disks.It could be argued that with the internet archiving is going to be more dynamic and fluid, but where does this leave information, and especially information for future generations. It is all well and good moving from teh printed page to the digital page, but in 2000 years time will they be able to revive the contents of a hard disc, will the information on the internet evolved dynamically not leaving a snapshot. Or will they look through the books of our time???
What will be our dead sea scrolls?
Re:Ok ... (Score:1)
The project files in Alias|Wavefront, Maya, and Softimage formats, as well as other miscellaneous formats, are just not suited to printing. Even if they were, even if you printed out ASCII versions of Maya files, for example, imagine what you'd have to do to get it back in the computer to reuse the project!
----------
Thanks to "proprietary formats" info will be lost. (Score:1)
Exponential data and storage (Score:2)
---
Just carve it (Score:2)
The software obsolescence is not a big problem -- humans (we hope) are going to be around for some time and the brain wiring changes awfully slowly. Languages do get forgotten, but smart people are very good at understanding dead languages and will probably get only better. Readers are also not likely to be a problems: just like brain wiring, eyeball construction is quite stable and not going to be superseded by a better design any time soon.
The media -- provided you pick a good hard rock like granite (avoid limestone and its derivatives like marble, they don't like acid raid) -- does not suffer from bit rot, completely ignores magnetic fields, stable with regards to solar radiation, and fairly resistant to pollutants.
You are not limited to ASCII, and even have limited graphical capability. In fact, rock has a huge advantage over current digital media -- it's perfectly possible to create, view, and store 3D objects in rock. Just try that with your 21' monitor!
Just in case you think I am being funny, there is a company which in exchange for a sum of money will take your text, etch it on metal plates (nickel, I believe), and store it in some cave. They are estimating >5,000 years MTBF. I still think a good slab of granite is better, though.
Kaa
Information evolution (Score:1)
The common response to this is that we may not know what is worthwhile, or that future ages may not take appropriate care. Lost greek plays that would be worth millions now were overwritten by some monk's laundry list in a less enlightened age. We feel we must save our information from that fate. But that is an impossible task. Etch the information on steel disks and some future, more barbaric age may melt those disks down for swords.
So forget about trying to save everything. Just work to save what you think is important. Yes, stuff will get lost, but that will happen anyway. You will never get perfection. More likely is that future generations will curse you for the stuff you thought to trivial for your archive project, while finding the information archived worthless.
Data *and* code (Score:2)
Now, I'm dealing with legacy code, too. One solution of course is to write vanilla code in a common language, but who knows what language is going to be used in 25 years? C+++? Fortran 2020? And vanilla code isn't always optimal, when hardware vendors build cutesy hotrodding tricks into their architecture and compilers.
Somebody just needs to build a giant computer version of babelfish for all languages ever. Starting with cave paintings. :)
Ermmmm.... Moderators! Bart's Swearing! (Score:1)
(Then there the fact that the BBC website can probably handle more traffic than Slashdot so a mirror is pointless)
Re:those AOL CD's (Score:2)
"February 24, 6423:Archeologists have discovered evidence that ancient humans worshipped a God called 'McDonald,' whose temples were signified by golden arches..."
Help? ---> A question... (Score:1)
I have a question, however, about the other end of the data life-cycle: its birth. Certainly data disappears, but what is the best way to describe or define "data," broadly generally? What is the best definition anybody here has ever heard for "information"? I'm having trouble finding a straight answer. Is data (information) a representation of something in the real world? Is it like a shadow of something else? We have seen how it can be created, we have seen how it can evolve, and we have seen how it can fade away and die, but what is the best definition of what it is?
This is one of those philosophical questions that just nags at the mind. If anybody can suggest definitions (or resources), I'd be grateful.
A. Keiper [mailto]
The Center for the Study of Technology and Society [tecsoc.org]
I'm off to save the world... (Score:1)
It was noted that storage requirements for geographic data (geologic, topographic, etc.) would require petabytes. Multiple petabytes. And a petabyte is 1000 terabytes (right?). And we're thinking 36GB hard drives and DVD-RAM drives have a lot of space...
--
Re:My philosophy is ... (Score:1)
If a piece of information has not been preserved and is now unaccessible, it probably means that it was of minimal value anyway
You didn't read the whole article, did you? Or perhaps the 1960 census resulted in information of "minimal value" that we didn't need lying around anyways? This is data that cannot be recreated, and is irrevocably lost.
Caching as a possible approach to preservation (Score:2)
This project definitely does not address all the issues with digital-document preservation; it definitely does _not_ solve the document-format problem. Its goal is to make digital publishing "immutable" so that publishers cannot modify or withdraw their work after it is published.
Disclaimer: I work for one of the groups which is participating with the LOCKSS project, but I'm not working directly with it.
Re:Inversely proportional? (Score:3)
CD's and related optical media do have problems with sunlight, but you have to remember that they where created (AFAIK) by the audio industry which is one of the most notoriously fickle industries in the world: they want you to buy a new CD from a new group every week, not have a single CD that is perfect and that lasts forever. I think that the concept of people being able to listen to their CD's for 10 years is already far too long for them.
The problem is that there doesn't really appear to be anyone making storage media that is optimised for long-term persistent storage. But do you think that such a format would be the way forward? Each year, we generate an exponentially larger amount of information. All the hard disks on the planet now would not be enough to store the new information that will be generated in the next 5 (wild guess) years. Therefore we are going to need progressively larger and efficient forms of data storage as the information bloat gets larger. As new formats come out, the important thing is to look at the movement of legacy data onto the new formats. If data is not treated as a static thing to be boxed up and forgotten, but rather as part of the on-going current set of information and transferred onto new technologies as they are developed then you will not have the situation where people are looking at a hard disk in 50 years time and going "what's an IDE interface?".
Of course, then you have the 'minor' issue of application file formats...
So just what *is* the life of a CDR? My results. (Score:4)
I put some CDRs out in the direct sun hede in the Las Vegas desert ofer the last summer. Blue, gold, green, pale green, and an RW. Both sides of the CDs had their chance to roast in the 100F+ (40C+) degree sun for several months each. And here's the results of attempting to read the data back on each type:
Old TDK green CDR: dead, nothing readable. Faded to a mostly clear plastic disc!
Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
Verbatim (blue): I was stunned. I read back a full and complete iso image of Red Hat 4.2. No fading at all.
Ricoh gold/gold CDR: dead, nothing readable. The golds faded visibly first of them all. Area where data was stored faded to clear!
Memorex silver/green CDR: mostly dead, some files readable. Faded in a few isolated patchy blotches.
The CDRW... just started this test. No results yet. Looks OK, though.
Overall, I'd say the blue CDRs are the best choice for long term data storage.
Moore's Law (Score:2)
To handle terabyte databases now, needs leading-edge hardware and state-of-the-art software specially optimised for the data format. In 20 years, however, we will just be able to haul the terabyte database into emacs, and hack up some macros to reformat it and search it.
If Moore's law ever tops out, then we are in trouble!
Re:magnetic storage (Score:2)
Re:more info (Score:2)
"Computer reel tapes, VCR tapes, and audio tapes last about as long as a Chevy or a poodle."
Re:A great challenge (Score:2)
This would serve two purposes:
1) Extra-terrestrial beings (assuming they have the technology and could decode it) could have a window into life on earth.
2) Whenever mankind figures out how to make wormholes or travel faster than light they could simply warp out to whenever they want info from and recover that day's web broadcast.
Altogether, not a bad idea, huh?
Re:Data Decay, Readability, and ASCII text. (Score:2)
Someone who speaks a language that doesn't use the basic roman character set may beg to differ. There are very real reasons to consider moving to something like Unicode.
related article (Score:1)
Irony of ironies: Data records on floppy disks relating to an an archaeological dig decayed by 5 percent in under a decade - after everything had survived the journey from the Bronze Age intact.
A. Keiper [mailto]
The Center for the Study of Technology and Society [tecsoc.org]
Analog vs. Digital (Score:1)
Re:magnetic storage (Score:1)
Vanishing Web Content (Score:1)
How long can it last? (Score:1)
Re:Thanks to "proprietary formats" info will be lo (Score:2)
Re:Is this likely? (Score:1)
Data Havens, Archive and standards oh my! (Score:2)
Given a format that is a) adequately documented, b) accurately represents the data it encompasses, and c) has sufficient widespread adoption, we can simply archive to that format as we need to.
Let's consider various and sundry data types, the prominent format for handling them, and the potential longevity of those formats.
Text: For raw text of course you have ASCII. While not a permanent fixture, nobody can argue it's longevity. We'll call this the baseline. Moving up from ASCII you need some way of defining formatting and such. There are really only a couple realistic solutions. Either some SGML based system, HTML, or PDF. I'll get into the latter two cases a little further down. Let's say that for plain text, SGML has the best longevity because of widespread adoption, and simplicity.
Rich Text (beyond simple formatting): As above, we need something better than ASCII. I'll vote for PDF here. It's a proprietary format, but it seems to be pretty well understood, and it does an accurate job of representing the original document. Mac OS X groks it very well, and Adobe has ensured that there's a viewer for every platform. If conversion tools can be made, then this is a good format.
Images (bitmap): PNG, JPG, GIF, and TIFF. TIFF seems to be less relevent these days although most scanner software still produces it. JPG/GIF are where the majority of data presently exists, and PNG is where everything should be archive, IMHO... PNG being lossless, and supporting about every feature known to man, this seems to be the best solution. One could crawl the web, grabbing every single GIF or JPG, archive it to PNG format with no loss of data and quickly build a significant archive.
Image (vector): Sorry, don't know much about the formats used here...
Audio: The obvious solution for archival is uncompressed, raw audio in a well understood format like WAV. This is an area that doesn't seem to be changing much...
Video: Again, I can't really comment on the formats here...
Things become more complicated when you have interactive media, or other very specialized forms of data... But I'd rather save that for the experts...
The author brings up the "loss of fidelity" issue when updating documents to a new format. I think this really only is an issue when making a lateral move. Converting from JPG to PNG wouldn't be a problem, nor GIF to PNG. Converting from WordPerfect to Word on the other hand, is problematic at best...
Thus the need for archival formats with some longevity. Perhaps a commission should be formed on data archival formats? A group of OSS developers who do nothing but strictly define what format(s) are to be used for "data archival" purposes, and ensure that tools to read/write these formats are readily available on every platform -- including new ones as they come out.
The trick is to avoid lateral conversions at all costs.
Re:Analog vs. Digital (Score:1)
Yes, when digital breaks, it definately breaks (although checksums and duplication of data can reduce the chances of losing data), but the level that you can push the degredation of a digital device too before it breaks is really quite high.
My solution to access obselete documents.. (Score:2)
We must activly, and over the course of time, make sure what we do is available for posterity. Next time you burn MP3's to a CD-R, burn a copy of the mpg123 source too. Thirty years down the road, the information will be usable to anyone with the ability to read C and a DVD-ROM, even if MP3 is a forgotten format. When CDROM becomes hard to find, copy it to new media. I started on a Atari, and have manage to propogate that data through audio tape, floppy disc, magnetic tape and CD-R with little effort. Preservation shouldn't be an afterthought. Just do it!
Re:Inversely proportional? (Score:1)
Archiving should be all about finding effecient ways of storing information in retrievable ways for as long as possible
However, archival seems to have become all about storing all the information (take the British Library or Library of congress which can't really keep up with publishing in terms of space and resources...)
Maybe the answer is something like Guttenburg where plain text is used and a fluid medium is used, albeit one which seems to be stable (ie multiple mirrores servers with backup devices)
I think there needs to be a shift in focus from the sheer need to store to the methods of storing and the reasons for storing.
I think the internet is an interesting snapshot of our time, but I think it's transience and fluidity are then things that make it what it is and the things that make archival a difficult process...
Hmm, more thought needed... I would plug my pmployers now as information management and storage is our thing, but that would be crass... (unless anyone in the field of Information storage wants a scholorship, in which case mail me for details.
Re:Data Decay, Readability, and ASCII text. (Score:2)
We've recovered data from thousands of years ago on crubled bits of paper that are still quite legible despite the decay, and that paper was a new technology for some civilizations then.
[Of course, a better argument can be made for simply using clay tablets and inscriptions in stone. We've recovered carvings MUCH older than anything that's been found on paper. But you have to draw the line for convenience somewhere. ]
Any modern technology you're relying on is bound to be inadequate. Think about this: every technology for information storage invented in the last 200 years has failed for long-term use. A thousand years from now they'll look back on the 1880-2000 as a series of dark ages. The only thing that will remain are the paper records. Even paper that's badly treated remains ledgible for long periods. There are archeological surveys going on now in garbage dumps for large cities (NY, for example) that are finding well preserved newspapers from half a century ago. Newspaper is not a good paper, and newspaper ink is a poor ink. This says a lot for the staying power of a good technology.
Film deteriorates, magnetic media loses bits and the substrates crack and crumble, records lose crispness, wax and foil canisters wear out. Take magnetic media for example: it was thought that with careful storage and infrequent use this stuff would last a long time. As it turns out, magnetic tape barely lasts 15 years under the best of conditions. We simply don't have enough experience with these technologies to know if they'll work.
Just because your 10 year old CD's can still be played today, doesn't mean they'll work in 2025. As a technological culture, we don't have enough experience with the materials to know. Old transparent plastics grow cloudy eventually--optical storage will probably not be your saviour in the 21st century either.
At least until we've had a century (or two, or three) to observe technologies like CD-ROM will we know how they'll work for long-term storage. Until then, don't bet the farm.
Data preservation (Score:2)
This leads me to wonder how much context information do we need to bequeath to our decendants in order for them to be able to understand the information we leave behind? Consider how much information we have from ancient times which we do not truly understand because we do not have enough contextual information to really understand what was meant by this information. Look at how many conflicting translations there are of many of the documents that do still exist.
Even if we manage to prevent the degradation of the media on which the information is stored and the devices and software necessary to read the information are preserved, what of language shifts and culture gaps across time? We will still have the problem of information being lost as meanings of words change with time or as information is translated from one language to another. This is, in fact, exactly the same problem we face with the various software revisions for products like MS Word.
This is not to say, however, that we shouldn't make a significant effort to preserve information. I would also think that having a significant amount of contextual information (which should come along for the ride while preserving information) should help our decendants comprehend the information we leave behind. However, if our current track record for preserving contextual information is maintained, the outlook is not good for our decendants understanding our information in two or three centuries (assuming the information survives).
Well, that's my 93.2 cents worth on the subject.
Re:those AOL CD's (Score:1)
Assuming the future archaeologists uncover/engineer a way of reading our digital formats (and that assumes, of course, our digital formats - like CDs - exist in any number in several thousand years), they'll easily uncover evidence of how we communicated. Think about it - how many references are there to "printing," paper, books, television, movies, etc., in common use today? In my email archives, I probably have hundreds of referencings to printing things out or watching TV.
Further, there will be documents inevitably left around. Look at such thing as the Dead Sea scrolls, which survived many thousands of years. If anything, they'll simply have a misunderstood idea of what we committed to paper (since "important" historical documents like the Constitution were written, and everyday crap like the specs on my desk will no doubt be destroyed, they may simply consider that paper was reserved only for important things).
Just an observation.
Why TeX is better than PostScript (Score:2)
So if you need to store formatted documents for archival purposes in a system where you may later need to output the documents in a different form, you should look at TeX...
Cheers,
Ben
Re:Is this likely? (Score:4)
BTW, I think the original author missed one future problem - encrypted information. I foresee hardware-based encryption becoming almost ubiquitous so that most data is encrypted. If encryption becomes universal, then much info will be encrypted that really wasn't burn-before-reading secret. What happens to all that information - of potential interest to historians looking back on the 21st century - under those conditions?
Interesting timing on this article (Score:2)
So, it looks like we're going to have to start transferring all those old ZX81 game tapes (Timex 2000 for our U.S. cousins) to CD-ROM then. That should be good for another 25 years of '3D Monster Maze'
--
Re:How long can it last? (Score:2)
Re:An excellent summary of the problem (Score:2)
Okay, you are right about that. I used ASCII as my example for three reasons. First, Slashdot is in English. Second, many if not most of the common character sets today are supersets of ASCII for compatibility. Finally, the primary but not sole input character set for TeX, which I mentioned, is ASCII.
As for wasted space, the amount of redundant information in every written language that I am aware of is very high. The actual information content of a single character is only a bit or two in context. That can be demonstrated with any good compression program. So, I would suggest that for saving space, either we all need to abandon our human languages for one with no redundancy (not a likely proposition) or compress everything we want to save and document the compression algorithm in uncompressed files, preferrably with source code.
I disagree about your premise, although your conclusion would follow from it. The idea is to have human readable streams of data that can be treated as if they are simply being set to a tty. HTML is an excellent example. With a browser, it is enormously powerful and useful. Yet at the core, it is a sequence of characters that I can type and read. I can edit it without special tools. Admittedly, those tools can make achieving just the look I want easier. They can speed my writing and make the results more reliable, but they aren't necessary.
Re:Old issue (at least in real life) (Score:1)
It's not at all surprising, to me at least, that this paper was written by somebody at what was once the UMich school of library science, until they discovered that they could pump up their prestige and funding by by going dot-edu.
- David
Somewhat of a paradox (Score:3)
Point 1:
With the amount of data that we produce, archiving it will take an increasing amount of time. How much new content is created daily? At best, we will plateu in a state where as much effort is required to archive content as is needed to create new content.
With the emphasis placed squarely on non-duplication of effort, archiving becomes a secondary issue. Indexing, searching, sorting and categorizing of the archive becomes a first priority, since creative efforts should now check if they are redundant.
If the bold statement is to be a guideline, than the idea of an archive is moot, since all new work depends on old work, and so tracks well with where the author feels human effor should go. Much like with biological evolution, new data is the fittest of the old data that was applicable to the new context. I suppose that the call for archives is little more than a suggestion that we need an organized and deliberate fossil record of how we got to where we will be at some point in the future.
What is needed is an archive, yes, but an archive of what? Not of content, but of the essence of the content. The lessons learned, the conclusions drawn and the optimizations realized in the process of creating the content. The content is fleeting - though arguably of inherent value... Which brings us to...
Point 2:
Yes, some things are irreplacable. Who decides? Who defines what is art, what is fact, and what deserves eternal life?
Some things are of immediate and significant value, but for an unknown duration. The value of other things can not be realized for a very long time, and so the alternative is to store everything. Further more, the value of certain data is totally subjective, and this begs the question of "who's in charge" of defining that 1% that is to be kept.
On the small scale, this will lead to vanity. Any 'artist' will consider their work a masterpiece, and save it. (I have code I wrote in CS101, don't you?) Companies will store and archive all email, all financials, anything that can potentially be used to mine data or identify trends or fertalize litigation. People will pigeon-hole videos of their baby's first steps, though nobody outside themselves really cares - unless the child grows up to be the next Einstein, or Hitler.
"Hitler" raises an interesting question on the larger scale. Who has the responsibility of deciding what 'big' facts to store? And isn't that the path to propaganda, history-making, and such things?
And then, when the leadership changes, and the 'book burning' starts...
To bring the concept down from the paranoid-sphere, let's recall the
Same issue with Newton and Leibnitz. Leibnitz was the German Mathematician who beat Newton to the concepts of Calculus. Newton, a member of the Royal Academy of Sciences (or something to that effect) politicised HIS influence, and so was credited with all of the work - where his contribution was not complete.
Some things are not outright lies, but oral histories get lost while written records persist.
Who gets to choose what to write down?
Resiliant media (Score:2)
Perhaps they could even be made to work with existing CDROM drives and perhaps even existing CD writers. Then you just start selling a new kind of disk. Anyone that wants something to last, they put on those. If they want lots of space per penny, they can buy something else.
--
grappler
Memetic perspective (Score:2)
Out of this fecund brew maybe, just maybe, a carrier as successful as DNA will emerge, with the capability to preserve the "best" of the information. Maybe it already has, in plain old text, which will be decipherable for as long as the bits can be gotten at, and which then has the benefit of the redundancy of human languages for further decoding and understanding. Then we drop down to the question of how exactly the bits manage to survive, and it seems the only ultimate answer is some human has to care enough to refresh them. Or be clever enough to teach them to take care of themselves.
It also seems clearly impossible that everything can be preserved, and also impossible that what is preserved will always be something to be proud of. Some extinctions, however tragic, are inevitable, and some, however richly deserved, never occur. It's part of the beauty (and maybe mercy) of conscious life that there are moments that will never appear again, can never be adequately captured for later replay. Being aware of that fact is what encourages us once in a while to put down the camcorder, shut off the microphones, maybe even try to still the stream of words in our heads, and just drink it in.
A paranoid addition... (Score:4)
One thing I believe was missed in the original article is intentional change to the historical record. In addition to having to store old information, and worry about how we're going to get to it later, I think we need to pay at least half a though to intentional modification of the historical record.
With paper and ink, it's rather time consuming and expensive to alter historical documents, even assuming you can get near them. With digital media, the situation may be different - it may become very simple to alter historical documents, especially if you're the guy who's in charge of copying them to the newest form of media.
Aside from the obvious political reasons someone might want to do this (can you think of a fundamentalist movement of any sort that wouldn't modify old documents to read they way they would like, given the chance?), I can also see where money might come into play.
For instance, suppose MassiveDrugCo, Inc. is introducing a new drug which prevents newly detected disease Y. Now, in order to sell a lot of this drug, you have show that Y harms enough people to worry about. Unfortuately, the historical record being used for retrospective studies doesn't show that. So, instead of going back to the drawing board and finding something else to cure, MassiveDrugCo instead feeds a modified copy of the historical data to unsuspecting independant researchers. These honest and unbribable researchers draw the conclusion desired by MassiveDrugCo - in spite of the reality of the situation.
Wired article on this topic (Score:2)
There is also a very well-written, very accessible article on this topic, titled "Saved", available at Wired magazine's archive [wired.com]. It was written by Steven Gulie, in 1998 and I distinctly remember reading it, thinking it had a profound impact on my thinking about this topic.
Take a look. -Paul
Re:Just carve it (Score:1)
Granite probably isn't the best choice either. Over time the feldspar in the granite breaks down, and the rock falls apart. Pollution and water accelerate this process. Basalt would probably be a better choice, or pure quartz, or some corrosion-resistant metal like gold or platinum.
As for me, I'm backing up my data by encoding ASCII text as a pattern of platinum-plated titanium pins hammered into a slab of good dense shale. After that, I'll drop the slabs into the Mississippi delta, and in a few million years, my wit and wisdom will become part of the rock strata. The MTBF should be about 100 million years, barring a major tectonic event.
</offtopic>
Civilization Bootstrapping (Score:4)
Are you trying to preserve episodes of the Simpsons so our relatively near term, technologically advanced descendants can watch them? Well, they're technologically more advanced and thus more clever than we; we just need to have suficiently stable media (micromachined gold plates would work nicely) and a either a simple minded encoding scheme or an easily readable description of the algorithm prepended. In the 22nd century, some bright Norwegian 16 year old armed with a yottaflop coputer will figure out how to read it if he cares enough.
A bigger concern (in my opinion) is what happens when our civilization collapses. Historically, it is almost certain happen sooner or later. Rome lasted well over a thousand years; if you told a 1st century CE roman that there would ever be an end to the empire he'd think you were crazy. Yet our civilization is in many ways much more fragile because the information it is based on is in much more ephemeral form (both media and format).
What we need is to devise a bootstrap procedure.
(1)Reading primers in various languages.
(2) Primers on basic technology: mathematics, simple mechanics, mining and elementary metallurgy.
These should be in highly durable form, but the problem is that you don't want people making off with them for building materials. The problem with using gold plates is that you don't want people to have access to them until the information on them is more valuable than the substrate. Perhaps these first items could be carved onto stone pillars inconveniently large to move.
Next, you need repositories on more advanced science and technology: chemical engineering, electronics and so forth. Perhaps you could rig a way to prevent savages from accessing these repositories; a mechanical puzzle perhaps, that requires a certain mathematical sophistication to solve. The most critical records could be kept in forms that could readily be read without mechanical assitance or with only simple mechanical assistance such as optical magnification (my local librarian likes micofilm, because she knows it will be readable for decades). Less critical things like old Simpsons episodes could be on very cryptic media that would require considerable technical finesse to read, but would be cheap to transfer to.
Pretty much, as you go from the most basic and critical information to the least critical information, you go from the easiest to read and most expensive to produce per bit, to the hardest to read and most convenient to produce.
Value degradation (Score:2)
Sure, poems and photo's for the grandkids. That's a hundred years, tops, and migration, translation and CDR covers it, fairly easily. As far as showing pictures to people who will have only vaguely heard of me? Or preserving the IRS tax code for four thousand years? Somewhere I'm sure is codified the idea that data is useless without context. If not, there it is, Nyarly's First Thought on information theory. I'm sure it is though...
But me noodlings with fiction, my code, my photos and graphics won't be any more useful without the cultural context they were created for than an arbitrary collection of 16 bits without a description. Is that a Float or a Fixed? Is that English or Spanish?
And if a modern creator does produce something of Eternal Meaning, there's precedent for it's propigation by those it has meaning for. Think of the Bible, or the Collected Works of Shakespeare. These continue to exist not because they were recorded perfectly on a perfect medium, but because people found them worthwhile enough to continue them.
What good would a perfect storage method be, anyway? If people forget it, or if they cease to care, a record could be painted in Liquid Unobtainium on God's backside, and it would be just as lost as if someone had scratched it in sand. Or on the base of a bronze statue. "Look on my works, ye mighty..."
Paper rots, stone erodes, metal corrodes. The only eternal medium is word of mouth. Anything else is just a memory aid.
Re:those AOL CD's (Score:3)
Actually no. Paper has proven to be one of our most durable ways of storing data. Egyptian papyrus from 3000 years ago is still more or less intact. CD's on the otherhand, will last for a 100 years in a "BEST CASE" scenario. Most will last much less time. CDRs might last 25 years. There are other variables besides media itself. I've seen several CD's from the early 80's that refuse to play these days. They're not at all scratched, but the theory goes that the original ink they used to print on them actually was a bit corrosive over a great span of time.
In order to remain readable, digital data must remain more or less intact. A few missing bits in the application needed to open a file can pretty much reduce your odds of opening that file by 100%. Analog data, on the other hand, degrades much more gracefully... It may start to fade, but there's no intermediary between having the data and being able to read it (you don't need an extra "application" to read a newspaper).
I've heard that this is actually going to be one of the least documented periods in human history, because all of our data is stored digitally and periodically purged. Even if it's not, places like NASA are generating data faster than they're able to back it up and move their old archives onto newer media.
Re:Dead Media Problems (Score:2)
Some of you will, no doubt, remember the issue of whether or not Heisenberg was building an atomic bomb for the Nazis, and if so, was he actively interfering with the project because he disagreed with the Nazi's goals. It turned out that after the war, Heisenberg and some other scientists were being held in Britain. The British tape secretly recorded all of their conversations. The medium? Spools of wire. (Think of a spool of wire being used just like a magnetic tape.)
A few years back some scholars wanted to listen to these recordings and had a terrible time finding a player. Eventually they found a collector who had one in working order. Wire recorders have not been made since the fifties. But they eventually found a player and carefully transcribed them. (And it seems that Heisenberg was actively trying to build a bomb, but lacked the resources to do so.)
How interesting that they had problems finding a working wire recorder. At any time, there are between a half dozen and a dozen wire recorders for sale on eBay. The circuitry of a wire recorder is so simple that any good old-school tube radio repairman could get one working in an afternoon.
Wire recordings are an example of an early technology that turned out, unintentionally, to be a fantastic archival medium. Sure, the recording is monophonic, and the frequency response is limited, but for voice recording, those are acceptable compromises, considering that a spool of stainless steel wire can last for centuries. Short of physically destroying the spool, or deliberately erasing it, it will not decay. There's no plastic backing to decay. There's no oxide particles to flake off. Just corrosion-proof steel wire. Fantastic!
I have dozens and dozens of original wire recordings from the late 1940s and early 1950s, and they all sound as good today as a freshly recorded wire.
"Proprietary Formats" are still a problem (Score:3)
The way that the DVD Fourm [dvdforum.com] (formerly known as the DVD Consortium, with oversees the DVDCCA... this is the group of companies that cross-license each other's patents and shares information regarding DVD development) currenly requires you to sign a non-disclosure agreement (NDA) to obtain the specifications, and that NDA also prohibits you from even discussing the specifications with anybody unless they have also signed the same NDA. Since this is covered under the trade secret laws, this particular bit of intellectual property is theirs theoretically forever. At least until you can hire a bunch of lawyers to demonstrate that a DVD is no longer a trade secret.
I've also set up a seperate mailing list from the main Linux Video group that is in the process of developing an Open Video Disc [mailto] specification which is trying to allow people to develop products without having to pay royalties or deal with patent infringments. Fees for most of the current video formats range from over $10,000 (for the DVD specs.... license fees are on top of that) to the MPEG Licensing Authority [mpegla.com] who is being quite reasonable for most close-source projects, but if you read the details of what you must do to license a product, is contrary to the nature of most open-source projects. It is still possible to write a GPL'ed MPEG player, but it would only be free as in speech and not free as in beer. In fact, you would probabally have to charge somebody to download the software. Shareware MPEG players are probabally skating on some very thin ice legally, and certainly part of the registration costs would have to go to the MPEGLA.
One of the things that is so nice about HTML is the fact that this standard is open, patent and royalty free. If CERN had tried to put a patent on HTML I doubt that the web would have developed nearly so quickly. Or rather imagine if Apple's hypercard system had been developed with the GPL and file formats were made open for anybody on any platform to use.
One of the things that I believe is killing the Unicode character encoding is that all kinds of intellectual property restrictions are placed on it, and you need to pay royalties to develop much software that uses it. Again, think what would have happened with ASCII had it been kept closed up, and why EBDIC isn't being used for character encoding.
More importantly, open and free specifications are critical to data preservation, and a point that really hasn't been brought up by Calc (the author of the original post on
Black Hole Applications Software (Score:2)
One of the email programs that I use stores everything in a database file. Short of saving messages to files, one at a time, there is no way to extract the messages from the database.
Reducing data for archiving purposes (Score:2)
One of the big problems with storing data is the sheer size of it. In astronomy, almost all data collected by telescopes, be they radio, optical or otherwise, goes through a stage known as 'Reduction'. I've put this in quotes, mainly because it doesn't necessarily reduce the size of it. In essence, Reduction is about obtaining the most important or most complete information out of the data and discarding or minimizing the redundant, the useless and the misleading out of the data so that future analysis can be carried out on the important stuff without having to wade through all the noise. For instance, 70 or 80 images of one optical observation in various wavelength bands will be collapsed into three to five optimal images, one for each band. In Radio Astronomy, collating 60 - 80 12hr observations into one file removes all the 'bad' data and is optimal for future reuse.
To effectively make a useful archive requires some filtering of what goes into the archive. Nowadays I work for IBM on DB2 UDB, and the roadmaps suggest that the size of databases is growing exponentially - fortunately this is balanced by a proportional growth in both processor power and storage space and access speed. So while we have terabyte databases today, we could easily be looking at petabyte databases in a few years. These databases will probably hold a vast amount of digitized analogue information - memos, diagrams, papers - which currently is stored in more convential storage. The advantages of moving to a fully digital archive are great - searching and retrieval are faster, and the space saved by putting scans of 20 boxes of papers onto a hard drive or other storage are also great. However, there is a danger with archives growing out of control - if you initiate a search which will visit every part of a petabyte database, you are going to have to wait for it to finish, even with the best search algorithms and vastly faster hardware. Making sure that information is not multiply duplicated in the database, or that redundant data is not added without regard to the database retrieval performance is extremely important. If we set up a project to 'mirror the web' for archival purposes, we'll be hamstringing ourselves right at the start - most data is not needed for future reference. By applying methods to distill the important information, archives can be updated, maintained and searched without exhausting the available resources.
Cheers,
Toby Haynes
Re:Analog vs. Digital (Score:2)
One major problem with digital formats is the absence of error recovery in common hardware and software. 99.9% of the data may be intact but one bad block at the beginning of a magnetic tape can make all of the data unrecoverable.
The Internet Archive (Score:2)
And I have just found an article from Steve Baldwin, the guy from Ghost Sites [disobey.com]!
--
Upgrade fever speeds the process (Score:2)
A modest proposal (Score:3)
I suspect the problem of file formats is less serious than people make it out to be. A well-documented format should be reconstructable indefinitely. Few software companies don't document their file formats. Even without documentation, it ought to be easier than reconstructing dead languages. We learned to read Egyptian hieroglyphs primarily from one attested translation and a lot of careful deduction. Given a thousand Word 6 documents, I think a good computer archeologist ought to be able to construct a program to open and edit them.
Museums of old hardware, and perhaps some sort of custom computer factor to make ancient hardware strikes me as a good idea. It could be like blacksmiths at SCA festivals, "Ye Olde ASIC Mill."
The real problem strikes as the one most heavily emphasised in the article: decaying media. I suspect the best solution with presently forseeable technology would be to preserve data in crystalised DNA. Even in nature, DNA takes centuries to decay, and if it were crystalised and kept somewhere cool and dry, it would likely last for millenia. Encoding a document onto a billion strands of DNA weighs basically nothing and it would be a very highly redundant storage system.
It isn't easy to do right now, but I suspect that technology is right around the corner and probably only requires a little bit of research money to become practical.
Sometimes the 99% is destructive (Score:2)
Worse yet, the labor involved with separating out the 1% of stuff that ought to be kept is going to mean a non-zero error rate; people will toss things that are still of value just because they have no time to examine them in detail. What are you going to do....
--
Re:NASA problem too (Score:4)
Problem is, it's entirely possible for us to not understand the importance of a data collection for years. That old Landsat data would be a great baseline for information about global climate change.
Re:An excellent summary of the problem (Score:2)
Don't even get me started on those luddites who still insist on using dried wood pulp as their storage medium. It's as if they think all information metaphores equate to a 16th century printing press.
Re:those AOL CD's (Score:3)
I read an article once in that hotbed of liberal thinking, readers digest, about book deterioration. Older books were printed with a different method, and will last a couple hundred years. Newer books will only last maybe 50 years.
This begs the question : how long will computer printouts last?
But what is the solution? (Score:2)
But, isn't this missing the point?
The problem exists because products like Word build in incompatibilities to force consumers to always purchase the newest product. We don't have to accept this.
The solution is to promote open document standards for everything. This should be part of the decision process when organizations are choosing applications.
Hopefully, in the near future, we will be able to choose an office suite that stores everything in XML format, and uses open object types like PNG or JPG images.
Also, exporting images to a format like PDF or PostScript would solve a lot of problems. Open Source applications exist for both of these formats, ensuring that you are not at the mercy of the application vendor.
Re:Black Hole Applications Software (Score:2)
I use MH to handle my mail. I can use all the standard MH tools as well as nice front ends like exmh, and they give a reasonable amount of power; but better yet is that, since MH stores one message per file in a plain format, I can use find, grep, perl, emacs, and all our other friends to manage my messages. I write my documents and correspondance in HTML and/or LaTeX (often via LyX) for the same reason. If it's supposed to have written-language content and I can't grep it, it sucks.
Re:those AOL CD's (Score:2)
Laser prints, i'd guess, will be fairly durable... Inkjets prints, probably not... that's just a pure guess, though
Re:A paranoid addition... (Score:3)
I believe it was King Tutankamun's father, Akenahten, who threw his world into a tizzy by rejecting the established religion and invented a new one that worshipped the sun. He went off a built a new city to go along with it too.
Well, the bureacracy of the day didn't like this at all because it messed with their job security. And as soon as he was dead, they went around hacking his face off anywhere it appeared (of course we're talking about monuments, etc., made from stone) and I believe they went after any mention of him in text (hieroglyphs) too.
And they almost got away with it and just about completely expunged his existence from their records. But they missed a few things and we've been able to piece together a little bit about him.
So anyhow, there's my Discovery channel understanding of that little story. What it means in relation to this subject I'm not quite sure. I thought it was a good idea to point out that this is certainly not a new issue.
Re:more info (Score:2)
Very relevant project (Score:2)
--
Re:An excellent summary of the problem (Score:2)
This doesn't address the most difficult parts of this problem: multimedia. Images and sounds don't have the equivalent of ASCII. There is no universal standard that all tools access the same. GIF used to be like that, but look what happened to it. JPEG is nice, but its lossy, so there goes your perfect archive.
Then there is the further problem of giving Joe Computer User out there the capability of building a "digital" history. With companies like Kodak and Apple goading people to using proprietary data formats like FlashPix or Quicktime its an uphill battle. And once again, there's no ASCII equivalent to fall back upon.
Ugh... this really brings back the corporatism fueled pessimism I was feeling earlier with the DVD/DeCSS debacle.
Oxryly
The web embodies decay ... (Score:2)
I don't know about anyone else, but there's something disturbing about this fact. Even the first few sites I've done back in the early / mid 90's have been lost forever, and while they were fairly insignificant, it's not an uncommon occurance for information to be lost.
Re:An excellent summary of the problem (Score:2)
<sarcasm>
Oh yeah, all right thinking people speak languages that can be represented by the letters available in ASCII. Yes, Unicode was invented for all the wrong thinking people who insist on using those funny looking letters with lines or dots around them or arcane characters that no right thinking person can understand anyway.
</sarcasm>
If you use the UTF-8 encoding and you restrict your text to the characters available in ASCII, the resulting text is ASCII. Besides, do you have any idea how hard it is to write the credits for a big free software project these days in anything other than Unicode without mangling somebody's name?
Re:BBC article (Score:2)
What? I'm sorry but open source IS NOT a magic bullet. There's two problems with your statment; and ironically they work in opposite directions.
Re:Civilization Bootstrapping (Score:2)
Back in the 1950s and 1960s, the U.S. Office of Civil Defense actually did that. A library of information on how to make and do key practical and industrial operations was created, microfilmed, and thousands of copies placed in fallout shelters. This was beyond the usual survival-handbook stuff; more like "how to build or fix an oil refinery/power plant/water system/auto factory" information.
If anyone knows where a copy of those microfilms still exist, please let me know. Thanks.
Re:Civilization Bootstrapping (Score:2)
Re:Problems with modern paper (Score:2)
So much for papyrus; the Romans generated vast amounts of data and wrote a lot of it in ink on wooden tablets... a fact we only realised comparatively recently because so few survived. So much for wood...
If you want a really durable medium for conveying information through the ages (and you have to exclude stone a) because it is much prized for re-use and b) because a lot of it, particularly the soft sandstones the Romans often used for monumental inscriptions, weathers rapidly) you need ceramics. Pots are vulnerable, but potsherds virtually indestructible. Ostraka (graffiti on sherds) survive now in as good condition as the day they were scratched, not something you can say of ancient papyrus, vellum, wood etc etc.
So, start working out a way to put data onto dinner plates, and you have the perfect storage medium... sort of...
Re:Data Decay, Readability, and ASCII text. (Score:2)
magnetic media or when a CD begins to decompose, that bit is lost forever.
ECC would be a beginning (CDROM already uses it, that's why a 75 Minute CDROM only holds 650M of data). Microencoding of some sort in a durable substance is perfectly acceptable as long as the instructions for building the reader are in a more readily accessable format.
Re:Data Decay, Readability, and ASCII text. (Score:2)
Just because your 10 year old CD's can still be played today, doesn't mean they'll work in 2025.
Some of my 10 year old CDs WON'T play on a new CD player, but WILL on an ancient old player. Only ten years, and there's enough drift in standards to make that happen. Of course, neither player is top of the line, perhaps a really good one would play the old CDs.
Digital media is analog underneath (Score:2)
Even if a standard CD players can't play a degraded CD, if someone wants the data bad enough, they'll build an error correcting CD player that will reconstruct bits that a normal player can't read. Just like archeologists reconstruct paper or heiroglyphs or fossils today, future archeologists will no doubt reconstruct CD's and hard drives.
Even today, data recovery specialists can read off multiple generations of files. Maybe archeologists will have optical readers which will read the CD/magnetic surface at many times their original resolution / sensitivity and reconstruct the data. Of course it would be nice for us to leave them some equivalent of the rosetta stone so they can decipher the various formats. But overall, I think today's digital media will be far more recoverable than people might think.
Just a thought.
-dialect
CD-ROMs improving (Score:2)
+------->
Re:A contrarian view (Score:2)
Books are not as good as you think though. People needed to (a) be able to read and (b) speak the relevant language, for archives of old books to be of any use. Neither is completely automatic. Also, to make real use of old books, the readers would need a fair amount of cultural context for them, and that is positively expensive to acquire.
Character sets... (Score:2)
Some of us don't have english as our mother tongue. That is something that is too often forgotten at places like redmont.
(OK this in slightly OT, but I'll rant anyway.)
Between DOS and windows that company we all love to hate decided to change character sets. Suddenly three letters in the swedish alphabet have a new character code. One and a half decades later (count that in internet time...) we are still struggling with documents with mixed encoding.
That means every damn application has to provide a way to recode OEM to Ansi. AND deal with users who tries to do this conversion on files already converted.
This is *before* dealing with unix and mac files.
So if we cant read freaking text files after ten years, how are we supposed to read binaries?
Sometimes I just get too tired...
Re:A contrarian view (Score:2)
--
Re:Umm.. (Score:2)
I think the real reason why ASCII was adopted had nothing to do with the computer industry, but rather with Western Union (which was a part of American Telephone and Telegraph at the time). All of the teletype machines used ASCII, and it proved to be a stock terminal option for many years. The control characters are a legacy of this heratige as well. The reason for the ASCII codes of #10 followed by #13 is that the teletype machines had to be told to scroll up the paper one line and then physically move the printer head to the left side of the terminal. How many systems still need these codes, even though all it really means is that the cursor is moved to a new position on the monitor?