Forgot your password?
typodupeerror
AI

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft (wired.com) 27

Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. From a report: The dataset was created by Harvard's newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "It's gone through rigorous review," he says.

Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. "I think about it a bit like the way that Linux has become a foundational operating system for so much of the world," he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

This discussion has been archived. No new comments can be posted.

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Comments Filter:
  • by etash ( 1907284 ) on Thursday December 12, 2024 @03:44AM (#65007411)
    but where is the actual dataset for download? the wired link doesn't provide one
    • Re:link? (Score:5, Interesting)

      by ISayWeOnlyToBePolite ( 721679 ) on Thursday December 12, 2024 @04:42AM (#65007471)

      but where is the actual dataset for download? the wired link doesn't provide one

      The paywalled article ( you can read it at https://archive.is/DrzFn [archive.is] ) states:

      The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, but the search giant hasn’t publicly agreed to host it yet, though Harvard says it’s optimistic it will. (Google did not respond to WIRED’s requests for comment.)

      The collective minds of Harvard University, Open AI, Microsoft and Google being unable to figure out a way distribute a public domain dataset is the only noteworthy part of the article.

      • Re: (Score:2, Troll)

        "It's releasing" is a squirrelly use of present progressive.

        "It intends to release" would have been clear, accurate, and generated fewer clicks.

        I took a look at Wired's twitter feed a few weeks ago, coincidentally, and they are such fake news.

        Not even a Columbia Journalism student would have made that mistake accidentally.

        I enjoyed their magazine in the 90's.

      • Torrent it?

        It will save a bunch of bandwidth costs (no doubt google got bandwidth to spare, but just in case it's a cost issue), and it will pretty much be available forever as long as there is at least 1 seed somewhere.

      • by Rei ( 128717 )

        This is annoying. Make the press release when it's out. What am I supposed to do, set calendar reminders to check to see if it's out yet? Just wait until you're actually ready.

    • bittorrent or IPFS would be perfect for this.

      Using btfs (bittorrent filesystem) one could load the torrent as a filesystem, even if it huge, but only download the data to be stored locally when it is requested, instead of downloading all the data in the torrent.

      • Need this to be available in a searchable github type repository with commit logs, history, versions, etc.
        Also, this is needed to be indexed by author, date, subject, etc....

        The Library of Congress, British Library and other national libraries should link to the repository.

  • by chthon ( 580889 ) on Thursday December 12, 2024 @03:46AM (#65007413) Journal

    Not to berate old books (I am a fan of Project Gutenberg), but after training you end up with a dataset of information based upon old books. There might be things worthwhile for historic reasons, there might be things that are finished (like the books of Dickens), but you will also have a whole lot of data that has been surpassed.

    In first instance, don't you have just a glorified index? That might be the nicest use case of the current state of AI.

    This index can be split up into two parts, STEM and literature. The STEM part might be of interest for STEM historians. The literature part, for training LLMs on what? Writing texts and books in the style of certain authors? If you can't be creative, an LLM will not help with that.

    • Victorian AI (Score:4, Insightful)

      by Roger W Moore ( 538166 ) on Thursday December 12, 2024 @08:48AM (#65007673) Journal

      after training you end up with a dataset of information based upon old books

      You'll end up with an AL model that acts like an old, well-educated relative. It will have impeccable grammar, a great knowledge of both culture and geography, albeit from 100 years ago, and every so often it will say something that will trigger every SJW in reach while also being triggered itself if you tell it some of the things happening today. It would actually be interesting to see how a group of these bots would interact - it's probably the closest we'll ever get to knowing what it was like to have a conversation in Victorian times.

      • Re:Victorian AI (Score:4, Interesting)

        by Rei ( 128717 ) on Thursday December 12, 2024 @09:25AM (#65007759) Homepage

        I can only speak for myself, but it's not normal to train on a single dataset. For non-translation tasks I have parsers for something 8 different dataset types (some of which parse many different datasets), while as for translation tasks I have dozens of datasets (of varying numbers of languages and varying quality).

        Historic data is good to have in the mix.

        As for the tone, that doesn't come from the foundation - that comes from the finetune.

        Speaking of old datasets, someone the other day released "SPQR-LLM", which involves the use of two Pleias strict RAG models (only use info from queried sources) to query a database of ancient Roman and Greek texts, so its knowledge is stuck in antiquity (but still responds in English). I was chatting with the author the other day and he's thinking about not just RAG, but doing a heavyweight many-epoch finetune from all ancient texts on an existing model (there's not enough data to train a model from scratch, but one should be able to really heavily alter the preexisting foundation), so that the model itself has only an ancient understanding - and then doing an English/Latin/Greek instruct finetune atop it, so you can chat with an "ancient chatbot".

      • something that will trigger every SJW in reach while also being triggered itself

        Interesting. I would have expected you, instead, to complain that works of minorities and women will be over-represented in this terrible, woke training set, given the lower likelihood that the copyright on such works survived the prior registration-and-renewal regime.

        Just didn't think of it? Or were you expressing a preference for belittling your current political opponents rather than extending your own reactionary thinking
        • What political opponents? I was merely musing that an AI trained on old books would likely have a point-of-view consistent with the world of a century ago given the time it takes for copyright to expire nowadays and therefore would likely get into trouble very quickly in certain circles. Registration and renewal I believe was something specific to US copyright law so I did not give it any thought since I'm unamerican.
  • Books from a century or two ago are often full of purple prose: flowery verbosity.

    • Re:SwirlyGPT (Score:4, Interesting)

      by Anonymous Coward on Thursday December 12, 2024 @04:59AM (#65007493)

      Books from a century or two ago are often full of purple prose: flowery verbosity.

      The bespoke hypocrisy coming from the social media generation, is gilded with only the finest narcissism.

      (We at least know the books from a century ago were written by humans. As opposed to whatever they will be re-written by in the future to serve a purpose yet unknown to us.)

    • by quenda ( 644621 ) on Thursday December 12, 2024 @07:18AM (#65007575)

      fix by training telegram archives stop

  • by Mirnotoriety ( 10462951 ) on Thursday December 12, 2024 @09:46AM (#65007819)
    One A.I. to rule them all
    One OS to track them
    One Framework to bring them all
    and in the LLM bind them
    In the land of Redmond where the where the shadows lie
  • Harvard has turned into a DEI hell hole and their data is tainted
  • Should LLMs choose naught but correspondents in the publik domain for their edification, t'would perforce make perceiving AI Slop a facile and certain enterprise.

Computers can figure out all kinds of problems, except the things in the world that just don't add up.

Working...