Project Gutenberg's 32nd Birthday 178
David Moynihan writes "July 4th marks the 32nd anniversary of that day in 1971 when Michael Hart first sped an all-caps version of the Declaration of Independence to anyone and everyone then on what later became the web, thus founding Project Gutenberg. Thanks to an army of volunteers and the Distributed Proofreaders, this is the last year PG will have fewer than 10,000 titles.
Strangely, Microsoft picked this dual anniversary of literacy and freedom to re-launch their Reader product, with three free bestsellers a week, if you activate the new version with Passport, sign a EULA, etc. Real reason for the upgrade might be that the DRM on MS's old Reader was cracked. If you're not into giving away data, or are running a system other than Windows, maybe you could take the time to tell a friend about free books online, or even help out by visiting the Distributed Proofers and editing one page per day."
Now for the marketing... (Score:5, Insightful)
Re:Now for the marketing... (Score:1, Insightful)
'reader' books not much cheaper (Score:4, Insightful)
I went to the MS Reader site and followed the links to the on-line publishers sites (such as B&N and amazon). In most cases, the reader format is only $1 cheaper, and sometimes $2 more expensive, than the corresponding paper book (soft or hardcover).
So... why in the world would anyone want to use a format that ties them to the computer?? With a paperback, I can read it anywhere, read for as long as I want without having to change batteries, and even pass the book onto a friend.
If they want to make the electonic formats more attractive, they need to make them a LOT cheaper than the corresponding paper version.
Huh??? (Score:2, Insightful)
XML please (Score:4, Insightful)
Re:Really great work by the guys behind the projec (Score:4, Insightful)
I remember poking around on PG not long ago but soon forgot about it.
If you're not looking for something specific then the site is kinda, meh. As you suggested, they need a news site, ratings, and other stats so you can see what's available.
And sections. "Technical", "Poetry", etc. Otherwise it's not very useful to the casual browser.
Re:XML please (Score:4, Insightful)
This is complete bullshit. With a proper setup you would convert the source into multiple output formats, including TXT, but you would keep the source in a format that maintains meta information such as formatting, chapters and pages. XML is used in the entire industry exactly with the expectation that it will be around for decades. Even if it won't, the open source code that we have to parse it will not magically disappear -- PG would keep using it to generate output texts from the XML source through all these years. You might as well argue that ASCII will go away.
Re:XML please (Score:2, Insightful)
XML hasn't been around long enough to say whether it is a fad or not. ASCII has been around longer than most of us have existed.
Re:XML please (Score:5, Insightful)
I think the basic problem with the Guttenberg/DP people is that they've been doing things a certain way for so long, and they don't want to retool. And I can see their point -- changing over to XML is a lot of work. And the core DP team already seems pretty busy keeping the web site going.
On the other hand, I do wish they'd make it a priority. Right now I'm a volunteer proofreader, concentrating on getting out the famous Britannica 11th edition [wikipedia.org]. The amount of information that gets lost in scanning in Greek and other text with weird phonological conventions is just appalling. And the conventions for math and science formulas and equations produces a complex linear format I can't believe anyone would actually want to read.
Then again, it wouldn't be that hard to go back and insert proper markup. For 90% of the text there's a simple transform between the Gutenberg conventions and a reasonable XML format. The other 10% probably need another look anyway, and wouldn't be hard to do if they've saved the scan images. I haven't had the heart to ask if they do.
Re:XML please (Score:3, Insightful)
Re:XML please (Score:4, Insightful)
As for XML being long dead, this is highly unlikely. XML is just structured data and is itself just text. It would be trivial 5, 10, or even 100 years from now to pull out the data from the xml format in any way you please. Unless the grammar is horribly mangled (MS Office), it would even be possible to infer it without even knowing the grammar. I would trust Gutenberg to collectively come up with a format which would be simple for proof readers and parsers alike.
Re:Now for the marketing... (Score:2, Insightful)
The first Gutenberg books I came across were being passed around BBSs at 2400 bps or so. When they started 32 years ago, 110, maybe 300 bps. Who cares? Check the size of the files, these aren't Word documents, you know.
Re:Huh??? (Score:3, Insightful)
I think they are saying in 1971 it was distributed to anyone and everyone...
Then, on what later became the web, they distributed it there too.
Keeping in mind the web ripped most of its ideas from gopher, and FTP before that, so the web wasnt a breakthrough idea out of nothingness.
But i dont think they meant it as 'distributed on one medium which later that medium turned into the web'
Thats atleast how i believe it was suppost to be read.. Hard to tell without commas and what not
We should all actually read this (Score:5, Insightful)
In this day in age when it seems everyone is a suspected terrorist and our liberties are stripped one by one in the name of homeland security, and in the name of the rights of large companies, I wish some of our elected officials would actually read these documents sometime.
A red white and blue flag isn't what makes this country great, nor does an extremely high gross domestic product -- it is the set of ideas that where written over 200 years ago that makes the USA great.
So everyone go to this site and read those documents. Even if you aren't American you should still read those documents because everyone has the right to the freedoms that our founding fathers wrote about.
Cheaper, but useful? (Score:3, Insightful)
A speedy internet connection and tons of computers wouldn't be needed to print out documents from Gutenberg.
It still costs money to turn downloaded digital copies of works into printed copies for 100 students in a grade level.
they would realize that it would be cheaper in the longrun to get texts off Gutenberg, instead of buying pre-bound books elsewhere.
Public domain etexts, such as those offered by Project Gutenberg, would be useful in schools only under limited circumstances. Though they would be useful in literature classes in high school (and possibly middle school), forget about them in elementary school, where most books are illustrated, because most PG editions leave out illustrations. Forget about them in science classes as well; the 1911 Encyclopaedia Britannica [1911encyclopedia.com] contains outdated views of anything scientific, and anything significantly newer is tied up forever in the Bono Act and its obligatory sequels. And what keeps a publisher from tying purchases of its science books to purchases of its literature books?
Re:XML please (Score:5, Insightful)
I think you're a little unclear as to what ASCII is. As the "A" in "ASCII" indicates, it's oriented towards American applications. And it consists of a mere 127 characters, which includes 32 control characters that you don't use in text.
In point of fact, Project Gutenberg has long outgrown the 96 graphic characters in ASCII, though I think they themselves are ignorant of the fact. The seem to have experimented with characters until they found a set that displays the same on "normal" Windows, Macs and Unix/Linux. The result is something they call "extended ASCII" but that's actually subset of both ISO's Latin1 character set [czyborra.com] and Microsoft's Latin1 code page [microsoft.com].
When is this an issue? Well, I'm a DP volunteer, and I'm concentrating on the Britannica 11th edition. Lots of geographic entries, all of which contain degree symbols. This symbol is not in ASCII! If you follow the DP instructions, you end up entering byte 186 (decimal). If you're using the ISO or Microsoft Latin1 set (and if your computer is localized for the U.S., Canada, or Western Europe, you probably are) then 186 does in fact display as a degree symbol. But if your system is localized for Eastern Europe, you're probably using Latin2, and this byte stands for an S with a cedilla accent!
In short, "ASCII" is actually less universal than well-formed HTML. In which you represent the degree symbol with a character entity (°) that's the same everywhere.
Hardly a representative example. The Declaration of Independence [archives.gov] was hand-written, and thus doesn't include a lot of fancy fonts or formatting. A better example is a contemporary novel, such as 1984.
As it happens I just finished re-reading this one. I read a Plucker [plkr.org] file that somebody had transformed from an HTML version [adelaide.edu.au], which in turn came from the Project Gutenberg "ASCII" version. Readable enough. But all the typographic nicities -- italics, boldface, etc. -- were reduced to ALL CAPS in the text version, and that was retained in the HTML version. Pretty distracting -- made me feel like somebody was shouting at me. Double Plus Ungood! Thoughtcrime!
You make it sound easy. A lot of information is lost when your primary version is "ASCII". It all has to be put back by hand. There's no avoiding this for the large body of existing Gutenberg texts. And of course as recently as 5 years ago, there wasn't a real choice anyway. Even HTML had issues, and serious XML tools didn't exist.
But now XML technology is pretty mature. It makes sense to store new Gutenberg texts in XML. If people still want "ASCII" copies, the XML is easily transformed into that. Though I a lot more people will want the HTML version -- a format which is actually accessible to more people than "ASCII".
There are two reasons this won't happen soon.
The first is that somebody will have to design and implement the necessary XML apps for inputing and proofreading the texts. (Which would alsio elminate a lot of the errors proofreaders make, like entering [Greek: Tau] when they mean [Greek: T].) A huge project. As it stands, the people who maintain the DP web site have their work cut out just to keep the existing software working. That's a vali
Size (Score:3, Insightful)
Re:XML please (Score:3, Insightful)
Using ASCII presupposes that all the important texts you want to preserve are in American English. Since a fair amount of the important pieces of literature come from mainland Europe (actually even the British £ sign isn't in ASCII), it is clearly not up to the job and should be replaced.
Further, authors often use devices like italics or bold to add emphasis to their work and nowadays even completely different fonts and typefaces. Translating these works to ASCII with no markup actually destroys some of the information in the original works.
I'm not an enthusiastic fan of XML - too many people advocate it as a silver bullet - but this this sort of thing seems to be an ideal application.
Re:XML please (Score:3, Insightful)
ASCII is simply 127 or 255 characters or so. Writing software to translate it is trivial, and it can even be decoded by hand, if necessary.
XML adds a lot of complexity beyond this, which hampers a person's ability to read a file with practically no software tools.
Also, XML is not as ubiquitous as you think, and huge numbers of people don't know how to use the tools to work with it.
A sterling mistake (Score:3, Insightful)
You made a similar mistake when you entered that character, since you just entered it from your keyboard. (A natural mistake if you have a British keyboard, as I assume you do.) On some web sites, this would only read correctly on systems similarly configured. However, Slashdot puts out the header:
which should prevent that. Still, the character entity £ is more portable, and will work even when the web page doesn't specify a character set -- and most do not.On the other hand, Slashcode sometimes mangles eight-bit characters when it archives them. So if you seek true immortality, use the character entity!
How to sperad the word... (Score:5, Insightful)
A while back, I used wget to mirror the entire Project Gutenberg works. (I did it off-hours, and contacted them to see if it was a problem, or if there was some other more effecient way to do things)
Anyhow, with my GBs of text, I used bzip2 -9 to compress each text file. In the end, the entire collection of PG was able to fit on one CD. Since most people don't have bzip2 support I also included the free archiver, Ultimate Zip [ultimatezip.com] on the CD as well. I also put a read-me on the CD (that would appear as the first file) with basic instructions what to do.
One of the great things about CDs is how easy they are to transfer... One stamp, and a 5cent CD envelope, and you can send 2 CDs anywhere in the country (this predated Netflix AFAIK).
Anyhow, I sent these CDs to two different people, and the next time I talked with them, I found out they'd made several coppies of it. Basically, they heard someone mention some subject that related to one of the files on the CD, brought up the CD, and offered to make a copy for them. This happened a few times that I know of, and quite possibly many times that I don't know of. Quite as easy way to spread the word.
Of course, with that said, I don't read the PG texts myself... There are two reasons. The first is that I have yet to come across decent software designed for long-term reading. Something that saves your place (automatically?), something with a legible font, and something with light colored text on a dark background, which brings me to my next point...
The second reason is that monitors are all backlit... That means, reading on a computer screen is like reading text on a floursent lightbulb. It's possible for a while, but your eyes are quickly fatigued. The only screen I have that doesn't do that is my 640x240 B&W LCD screen on my Psion handheld. As good as that is, it's just too small for efective reading. Someone needs to create a non-backlit LCD screen, approx 6" (about the size of a book page) that is small, light, silent, compatible with everything, and most importantly, it needs to have good software that makes reading less work than it normally is on a computer... Until then, relectronic reading isn't going to really be feasable. Screw electronic paper, just give me a screen that doesn't hurt my eyes, and I'm set to go.