Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Yahoo Competes with Google in Book Scanning

Posted by ScuttleMonkey on Mon Oct 03, 2005 04:04 PM
from the my-literary-collection-is-bigger-than-yours dept.
UltimaGuy writes "A consortium backed by Yahoo has launched an ambitious effort to digitize classic books and technical papers and make them freely available on the Web. The company is partnering with the newly formed Open Content Alliance, which aims to offer PDF documents of books to the public at no charge. Consumers will be able to search the contents of the Open Content Alliance's database and download the entire content of any work, such as a scanned copy of a book."
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • RIAA Problems Solved (Score:5, Funny)

    by GreggyBUIUC (262370) on Monday October 03 2005, @04:07PM (#13707510)
    Someone start up a "Open Content Alliance" for music... then we can digitize and share it all we want.

  • by Anonymous Coward on Monday October 03 2005, @04:07PM (#13707512)
    I can't wait to read the whole book on one page.
  • no mention of project gutenberg (Score:4, Insightful)

    by justforaday (560408) on Monday October 03 2005, @04:08PM (#13707526)
    I find it interesting that in all the articles I've looked at today about this that only one has mentioned Project Gutenberg. Naturally, I can't recall which source it was...
  • What a concept. (Score:5, Informative)

    by Anonymous Coward on Monday October 03 2005, @04:09PM (#13707539)
    I liked the idea the first time I heard it - back when it was called Project Gutenburg. :P
  • ...that we don't?

    It seems to me that they're throwing money at an unnecessary application. Does Yahoo know something that we don't? I'd venture that they're starting with PD books to shake the bugs out of their platform so the app works well in round 2.

    Round 2 (current commercial books) won't occur without a massive copyright law change or support of the Author's Guild.

    Hmm.
  • Project Gutenberg (Score:5, Informative)

    by timeToy (643583) on Monday October 03 2005, @04:11PM (#13707546)
    16k ebooks to choose from today, more to come, no Google, no Yahoo.
    http://www.gutenberg.org/ [gutenberg.org]
  • Whew! (Score:5, Interesting)

    by op12 (830015) on Monday October 03 2005, @04:12PM (#13707555)
    (http://symbii.com/)
    I almost panicked after seeing we had gone so long without a Google-related article.

    The opt-in rather than opt-out strategy is really what Google probably should have done, but it'll be interesting to see who comes out as a winner, Yahoo or Google, in all of this.
    • Re:Whew! by ChocoBean (Score:1) Monday October 03 2005, @04:18PM
  • by Anonymous Coward on Monday October 03 2005, @04:12PM (#13707556)
    In the US, books published after 1922 can still be public domain if the author was American, it was originally published in the US, and the copyright was not extended at the end of the original copyright period. Google Library does not seem to be making an exception for this, will OCA? Project Gutenberg does.
  • Not really an up-stage (Score:4, Informative)

    by ChocoBean (890202) on Monday October 03 2005, @04:14PM (#13707569)

    Actually this won't "Upstage" google in any way.

    FTA:
    all the content will be made available so it can be indexed by all the other major search engines, including Google's

    Yahoo is just going to scan, scan and scan. We all already prefer google's indexing and searching and cleaner interfaces, so the only thing Yahoo! will accomplish by this is help google print along, sheilding all (other) copyright law suits. Once the stuff is online, we all know that Google-bots will be all over it "like a fly on a pile of very seductive manure (Zapp)"

    Excellent.

    I just hope publishers realise that in this case neither google or yahoo is trying to be their best friend.

  • What about China? (Score:4, Interesting)

    by DAldredge (2353) <SlashdotEmail@GMail.Com> on Monday October 03 2005, @04:14PM (#13707573)
    (Last Journal: Sunday October 14, @10:49PM)
    Will Yahoo provide sorted or unsorted lists of books that China's Internet uses view to the thugs that run China?
  • by doctor_no (214917) on Monday October 03 2005, @04:17PM (#13707595)
    Seems like the crucial difference between Google's efforts and the OCA(Open Content Alliance) is that Google has a "opt-out" policy for copyrighted material, while OCA specifically requires the copyright holder to contact them and essentially allow them to use the material.

    The OCA likely won't be sued by the Writer's Guild like Google, however, for searching material Google will likely be better being that Google's search will likely include a massive plethora of copyrighted material, legal or not. Also, it seems that Google themselves will be allowed to use all the material from the OCA into their project as well.
    • 1 reply beneath your current threshold.
  • Companies should Get Original (Score:2, Insightful)

    by TarrySingh (916400) on Monday October 03 2005, @04:18PM (#13707605)
    (http://tarrysingh.blogspot.com/)
    Why can't companies come up with some cooler ideas? Why ape each other? First Google and hten Yahoo, Sure MS will also want to play.
  • NOT competing (Score:5, Informative)

    by daniil (775990) <evilbj8rn@hotmail.com> on Monday October 03 2005, @04:23PM (#13707633)
    (Last Journal: Thursday September 28 2006, @01:06PM)
    There's a slight difference between an 'Internet-based library' and 'searching inside books'.
  • by merreborn (853723) on Monday October 03 2005, @04:26PM (#13707655)
    Google Print's goal is to allow people to search book content, WITHOUT giving them the content of the book.

    For example, searching "Zoroastrianism" would return a list of book titles on the subject, and links to purchase the books in question. You CANNOT download the content of the book!

    The OCA (The group Yahoo just joined) is an opt-in, full content hosting project.

    Searching "Zoroastrianism" would return a (much smaller) list of books, with the *full* content of the book available for download with the explicit consent of the publisher/author!
  • Sad thing about Yahoo though (Score:3, Interesting)

    by totallygeek (263191) on Monday October 03 2005, @04:26PM (#13707656)
    (http://www.totallygeek.com/)
    You will be reading the content to Moby Dick on Yahoo [yahoo.com] and in the top right it will say, "content provided by Google [google.com]."
  • Annoying (Score:2)

    by rm999 (775449) on Monday October 03 2005, @04:30PM (#13707690)
    I am getting tired of the big internet companies straight up copying each other. Yes, it means that products slowly get improved over time (eg. yahoo mail -> gmail -> yahoo mail) but it also means that the companies aren't innovating enough. Yahoo is spending time and money on providing a product that is already offered. We would probably be better off if they spent the effort on providing a unique service - like scanned magazines or something.
    • Re:Annoying by ScentCone (Score:3) Monday October 03 2005, @04:44PM
      • Re:Annoying by rm999 (Score:2) Monday October 03 2005, @05:15PM
        • Re:Annoying (Score:4, Informative)

          "very few new features come out"

          Have you seen Google Earth?

          How about the disaster wiki that went together in about 20 minutes, where people were posting status reports of New Orleans properties?

          I think you're damning with faint praise. Google, at least, consistently builds superb offerings, and the price is right. Not quite sure what you're grousing about...
          [ Parent ]
  • by megify (873328) on Monday October 03 2005, @04:30PM (#13707691)
    I think this is only good for short documents....
    I think if I read Finnegan's Wake or Hawaii on-screen, my eyes would bleed and tear themselves out of my skull. (not to mention downloading PDFs for days.) In that case, I'd much rather just go buy a paperback for $3. Then I don't have to read on-screen, the pages are conveniently sized and bound, and I can take my book to places I wouldn't bring a laptop. Like a bubble bath, bed, or my commute to work every day.
  • by pin_gween (870994) on Monday October 03 2005, @04:31PM (#13707699)
    will it take to download that PDF of War and Peace?
  • by dananderson (1880) on Monday October 03 2005, @04:41PM (#13707754)
    (http://dan.drydog.com/)
    I find it funny (in an ironic way only) that the University of California is allowing its public domain books to be scanned by Yahoo. At the same time, UC libraries prohibit scanning for Project Gutenberg [gutenberg.org] or other true "open" content projects unless they receive $$$$ in royalities.

    I hate to see a University pander to commercial interests, while at the same time, welcome commercial interests such as Yahoo. Money talks, and I'm sure UC is being paid a lot, but libraries are supposed to be public resources too, not exclusive profit-centers :-(.

  • Reading Between the Lines (Score:2, Redundant)

    by 99BottlesOfBeerInMyF (813746) on Monday October 03 2005, @04:48PM (#13707794)

    Reading between the lines for this proposal we seem to have another print.google.com, except it will not index a huge number of works whose copyright holders do not "opt in" to the program. The advantage to this is that it may make some copyright holders feel better about the whole thing and, hopefully submit entire works to be viewed by the public. It is also possible that Yahoo is worried about the legal issues and want to wait and see how google weathers any legal challenges.

    From a purely technical perspective, this system seems inferior in most ways. It only displays full text and does not give copyright authors the ability to show only an excerpt, or a set number of pages. Although, providing them as PDFs is nice. I wish Google would add that feature for works that are shown in their entirety. In general though, if I'm looking for particular data I don't see why I'd use yahoo which will have a much smaller index of work.

  • PDF?! yuck (Score:2, Insightful)

    by BillHop (82717) on Monday October 03 2005, @04:48PM (#13707797)
    Does anyone else find there is no way to read a PDF with the scroll buttons (mouse wheel, etc.) without the viewer constantly breaking your flow by jumping to the next page?

    This goes along with the concept that for an electronic format, I do NOT need a sentence (or even worse, hyphenated word) broken up by two inches of top and bottom margin filled with page numbers, miscellaneous watermarks, repetitive titles, etc.

    PS. This being flamebait does not make it false.
    • Re:PDF?! yuck (Score:5, Informative)

      by Fiver- (169605) on Monday October 03 2005, @05:11PM (#13707938)
      "Does anyone else find there is no way to read a PDF with the scroll buttons..."

      No. I just set it to Continuous. See those four icons in the lower right corner? (assuming you've got a recent version) Play with those. You want the second button from the left

      "This goes along with the concept that for an electronic format, I do NOT need a sentence (or even worse, hyphenated word) broken up by two inches of top and bottom margin filled with page numbers, miscellaneous watermarks, repetitive titles, etc."

      Well, the whole purpose of PDF is to "preserve the look and integrity of your original documents ... regardless of the application and platform used to create it." Blame the creators of that particular pdf file if you don't like the headers, footers and margin size. When I make pdf books to read on the train...I just finished Dream Quest of Unknown Kadath by Lovecraft...I open the original ascii text file in Word, make the top & bottom margins tiny, change the font to something tolerable and export it.
      [ Parent ]
  • Bookripper on its way? (Score:5, Interesting)

    Google maintains its scanning represents "fair use" allowed under the law because it only allows Web surfers to view excerpts from copyrighted books.


    Soon after Google Mail was introduced, somebody created a SourceForge project that lets you use Google Mail as a database. How long until somebody releases a "Bookripper" app that assembles a whole book from search extracts? As I understand it Google displays two pages at a time (or wait, that's Amazon, but I bet they're similar). All you would need to know is a quote from a book's first page as a seed, and you should be able to grab the whole book by doing a series of searches using text from the second page returned by each search. The trick would be to knit the pieces together and eliminate the overlapping text. Seems almost trivial. Another possibility would be to search for random words and look for overlaps between the results, assembling them like a linear jigsaw puzzle until there are no gaps.
  • "Do no Evil" done right (Score:5, Insightful)

    by Chunni Babu (920014) on Monday October 03 2005, @05:03PM (#13707883)
    (Last Journal: Monday October 10 2005, @01:10PM)

    Now this is a right step towards making book contents searcheable online. I will hate to see one company like Google copying and caching all books in its massive cluster of servers. I know that Google kool-aid that "we are about general good" is running deeply in the veins of slashdot types.

    Since when was scanning books from libraries and making them available to public for a profit was considered "fair use"? This kind of stuff is done by pirates. Go to the major cities in China and India and you will see piles of copied book in the streets all sold for 1/10th the original price without giving anything back to the authors. The pirates can say that they are doing a favor to the authors by driving them out of obscurity.

    The message the alliance is sending out to the authors is

    • we are not for profit
    • we will scan your book only if you want us to do so
    • your book will be indexed based on your approval and copyright agreement with you and the publishers
    Compare this to what Google is telling the authors
    • we will scan your book, fill a form and tell us if you don't want us to do so
    • we will take sale comissions from amazon, buy.com, bn.com, etc. without sharing anything with you
    • if we show ads, we will share the profits with you
    • we will show excerpts of your book, so if a researcher is researching on a topic he can find what you have written about a topic without ever having to buy your book, too bad, heh heh, write a fiction book dude
    • we will cache your book in our servers and only we will reserve the right to profit from your scanned book
    So much for do no evil. Kudos to yahoo for bringing the open content alliance, gutenberg, and other similar projects to limelight - these are some really nice collections that were hidden by the noise created by 'google print'.
  • by Anonymous Coward on Monday October 03 2005, @05:04PM (#13707887)

    I've read through the first few posts, and people really don't have a clue about what this is all about. "Open Content Alliance"... It means what it says. Open f'ing content. Let there be content available to the masses... Is it more important that I can get a snippet from some copyrighted text, or that millions of children can read Alice in Wonderland with all it's wonderful illustrations.

    This is beyond PDF or anything like that. Some people want PDF, so Adobe will make them. Some people want decent OCR versions, perhaps to go into Distrubuted Proof readers or into someone's text-only PDA. It's ALL possible. This is NOT an exclusive club, it's an INCLUSIVE community that is dedicated to Open f'ing Content.

    Why don't you people get it. By allowing people to have full texts of some of humanities greatest works we are doing more than a few snippets of the latest Ken Follet novel... a lot more.

    It's bigger than Yahoo or Google. Yahoo is NOT an also-ran.... The Internet Archive has been scanning books and hosting Milloins Books project texts as well as Project Gutenberg texts for a long time... long before Yahoo or even Google were in the picture. Ignorant comments made here suggest somehow Yahoo is following.

    I say Yahoo is leading by embracing a project that by definition is bigger than themselves. Good for them.

  • by obli (650741) on Monday October 03 2005, @05:04PM (#13707888)
    With all this competition it won't be long until they start indexing comic books, then I'll finally be able to find that awesome Donald Duck comic I read back in 1996 and couldn't ever find again.
    • 1 reply beneath your current threshold.
  • Will they come in opendocument format? Or proprietary PDF?

    Just wondering.
  • New and Radical (Score:4, Funny)

    by Corydon76 (46817) on Monday October 03 2005, @05:46PM (#13708160)
    (http://drunkcoder.com/)
    Hey, wow, that is completely original [gutenberg.org]. Nobody else could have possibly thought [promo.net] of this idea before [wikipedia.org].
    • More like... by Grendel Drago (Score:2) Tuesday October 04 2005, @12:50AM
  • Noticed on boingboing.net that a Chinese company is marketing a DRM-free version of an ebook reader [boingboing.net] using an eInk screen.

    Although I don't think it's on sale, it is the Holy EBook Reader Grail we've been seeking for ten years.

    If we're gonna download ebooks, we should have a reader to read them with, no?
  • by baomike (143457) on Monday October 03 2005, @07:37PM (#13708896)
    THey can't even keep there message boards and finance services running. I doubt that they will be any more succeful at this.
    Google is safe.
    I give 'em three years and they're toast.
  • Bad Redundancy (Score:1)

    by Kattana (635282) on Monday October 03 2005, @07:42PM (#13708925)
    What a waste of time, now we have 3 projects doing the same thing, can they not think of something origional to do? Now effectivly we have 1 foss collection of books, and 2 redundant proprietary collections because they could not work together, oh no then we might have all this scanning done 3 times faster, someone needs to scan them a book to teach them cooperation.
    No amount of computer resources can make up for human inefficiency.
  • by icepick72 (834363) on Monday October 03 2005, @08:20PM (#13709101)
    they must be using rooms full of trained squirrels.
  • by Leobinus (782479) on Monday October 03 2005, @08:58PM (#13709282)
    Instead of just providing the texts, why doesn't Yahoo or Google go in-depth with a collection of texts, and provide insights on them? Open Source Shakespeare [opensource...speare.org] is an example of what I mean. There's an automatically-generated concordance [opensource...speare.org], you can look at all of a character's lines at once [opensource...speare.org], and see statistics [opensource...speare.org] about the plays, etc. Those are actual research tools. Being able to search a text is useful, but you could do that in 1975. I would be more impressed if these big search companies figured out how to do something more useful.
  • here [userfriendly.org]
  • Re:Why PDF? (Score:5, Informative)

    by david duncan scott (206421) on Monday October 03 2005, @04:25PM (#13707653)
    10 years down the road when everything is in PDF format, whose to stop them from charging us to view material in their format?

    The fact that it's an open, documented [adobe.com] format?

    Adobe has made their money the old-fashioned way, by making tools that work well, rather than by locking people into a format. GhostScript, among others, will read those PDF's with or without Adobe.

    [ Parent ]
    • Re:Why PDF? by amliebsch (Score:2) Monday October 03 2005, @07:05PM
      • Re:Why PDF? by david duncan scott (Score:2) Tuesday October 04 2005, @12:46PM
  • Re:Dupe (Score:1)

    by JordanL (886154) <jordan.ledoux@g[ ]l.com ['mai' in gap]> on Monday October 03 2005, @04:42PM (#13707765)
    Wow... I never checked out the pricing... I retract that. Quite the bargain actually.
    [ Parent ]
    • Re:Dupe (Score:4, Funny)

      by Nuttles1 (578165) on Monday October 03 2005, @04:47PM (#13707793)
      You must not be a true /.er because you know that if you were you would read up on every bit of documentation about anything that we do....Like how we alway RTFA...errr....wait, scratch that
      [ Parent ]
      • Re:Dupe by JordanL (Score:1) Monday October 03 2005, @04:49PM
        • Re:Dupe by Nuttles1 (Score:1) Monday October 03 2005, @04:51PM
          • Re:Dupe by JordanL (Score:1) Monday October 03 2005, @04:57PM
  • by expro (597113) on Monday October 03 2005, @05:03PM (#13707882)

    If there was ever anything we need competition in, it is search engines. Whether project Gutenberg needed any competition is another question.

    I don't see a lot of similarity between this project and the one Google is doing. Open versus proprietary. Free (free as in speech) information versus non-free information.

    In the case of other search engines Google has put out of business (Altavista, although the web site still exists, no longer exists as the more-advanced search engine it was using the facilities of others), the competition did not make them improve at all, beyond their insight to make searching a popularity contest instead of an accurate search.

    [ Parent ]
  • Re:its to see... (Score:4, Insightful)

    by twiddlingbits (707452) on Monday October 03 2005, @05:05PM (#13707900)
    PDFs of "public domain" or donated works will always be available. Amazon has gotten enough sh*t about the excerpts that they publish to entice the reader to buy the book. Google "e-book" and you'll see Yahoo! is nowhere near the only source. There is even an open-source e-book idea at Open eBook - http://www.openebook.org/ [openebook.org] -- Information on the publication specification for electronic books that will allow compatibility between different e-book devices.

    I just wonder how Yahoo! will make $$$ of this very small market of public domain works, or if they DO get repro rights to other books what the price model is to download them, or will you just see advertisements in your e-books? The authors are not going to give up their $$$ nor is Yahoo so somebody is going to have to pay for this content.

    [ Parent ]
  • Re:Google/Yahoo (Score:1, Interesting)

    by crashelite (882844) on Monday October 03 2005, @05:11PM (#13707943)
    google isnt evil so they cant become a monopoly like M$...
    [ Parent ]
  • Re:that one thing (Score:1)

    by ValuJet (587148) on Monday October 03 2005, @07:56PM (#13708999)
    repeat after me...

    Library
    Library
    Library
    [ Parent ]
  • Re:Why PDF? (Score:2)

    by TTK Ciar (698795) on Monday October 03 2005, @08:20PM (#13709106)
    (http://www.ciar.org/ttk | Last Journal: Monday October 15, @05:30PM)

    You're right, sorta. The djvu [freshmeat.net] format is better than PDF for scanned books in most respects. Looks better, compresses better (and compresses by default), decompresses + renders faster while using less memory, more easily transformed to/from other formats due to availability of high-quality open source and free tools, etc. The Internet Archive's books collection has several books archived in djvu format.

    The downside is that most users do not have a djvu reader installed on their computers, and even though it's trivial to download and install djview for free, most people will not bother. The Internet Archive more or less solves this problem with a java applet which turns users' web browsers into djvu readers. This should work for other content providers as well, except nobody knows about it, so everyone stops at "oh no, nobody has a viewer installed". The end.

    On a slightly different note, though, PDF isn't that bad. It's an open format, and even though most people seem to think Acrobat is the only viewer, there are others like xpdf, which is faster, more stable, and easier to use than Acrobat (though not as fully-featured).

    -- TTK

    [ Parent ]
  • 12 replies beneath your current threshold.