Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft (wired.com) 27

Posted by msmash on Thursday December 12, 2024 @03:35AM from the moving-forward dept.

Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. From a report: The dataset was created by Harvard's newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "It's gone through rigorous review," he says.

Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. "I think about it a bit like the way that Linux has become a foundational operating system for so much of the world," he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 27 Comments Log In/Create an Account

Comments Filter:

link? (Score:3)

by etash ( 1907284 ) writes: on Thursday December 12, 2024 @03:44AM (#65007411)

but where is the actual dataset for download? the wired link doesn't provide one

- Re:link? (Score:5, Interesting)
  
  by ISayWeOnlyToBePolite ( 721679 ) writes: on Thursday December 12, 2024 @04:42AM (#65007471)
  
  but where is the actual dataset for download? the wired link doesn't provide one
  The paywalled article ( you can read it at https://archive.is/DrzFn [archive.is] ) states:
  The exact way the books dataset will be released is not settled. The Institutional Data Initiative has asked Google to work together on public distribution, but the search giant hasn’t publicly agreed to host it yet, though Harvard says it’s optimistic it will. (Google did not respond to WIRED’s requests for comment.)
  The collective minds of Harvard University, Open AI, Microsoft and Google being unable to figure out a way distribute a public domain dataset is the only noteworthy part of the article.
  
  - Re: (Score:2, Troll)
    
    by bill_mcgonigle ( 4333 ) * writes:
    
    "It's releasing" is a squirrelly use of present progressive.
    "It intends to release" would have been clear, accurate, and generated fewer clicks.
    I took a look at Wired's twitter feed a few weeks ago, coincidentally, and they are such fake news.
    Not even a Columbia Journalism student would have made that mistake accidentally.
    I enjoyed their magazine in the 90's.
  - Re: (Score:2)
    
    by Deal In One ( 6459326 ) writes:
    
    Torrent it?
    It will save a bunch of bandwidth costs (no doubt google got bandwidth to spare, but just in case it's a cost issue), and it will pretty much be available forever as long as there is at least 1 seed somewhere.
  - Re: (Score:2)
    
    by Rei ( 128717 ) writes:
    
    This is annoying. Make the press release when it's out. What am I supposed to do, set calendar reminders to check to see if it's out yet? Just wait until you're actually ready.
- bittorrent and IPFS would be perfect for this (Score:2)
  
  by bd580slashdot ( 1948328 ) writes:
  
  bittorrent or IPFS would be perfect for this.
  Using btfs (bittorrent filesystem) one could load the torrent as a filesystem, even if it huge, but only download the data to be stored locally when it is requested, instead of downloading all the data in the torrent.
  - need github or source repository (Score:2)
    
    by will4 ( 7250692 ) writes:
    
    Need this to be available in a searchable github type repository with commit logs, history, versions, etc.
    Also, this is needed to be indexed by author, date, subject, etc....
    The Library of Congress, British Library and other national libraries should link to the repository.
Old books (Score:3)

by chthon ( 580889 ) writes: on Thursday December 12, 2024 @03:46AM (#65007413) Journal

Not to berate old books (I am a fan of Project Gutenberg), but after training you end up with a dataset of information based upon old books. There might be things worthwhile for historic reasons, there might be things that are finished (like the books of Dickens), but you will also have a whole lot of data that has been surpassed.
In first instance, don't you have just a glorified index? That might be the nicest use case of the current state of AI.
This index can be split up into two parts, STEM and literature. The STEM part might be of interest for STEM historians. The literature part, for training LLMs on what? Writing texts and books in the style of certain authors? If you can't be creative, an LLM will not help with that.

- Victorian AI (Score:4, Insightful)
  
  by Roger W Moore ( 538166 ) writes: on Thursday December 12, 2024 @08:48AM (#65007673) Journal
  
  after training you end up with a dataset of information based upon old books
  You'll end up with an AL model that acts like an old, well-educated relative. It will have impeccable grammar, a great knowledge of both culture and geography, albeit from 100 years ago, and every so often it will say something that will trigger every SJW in reach while also being triggered itself if you tell it some of the things happening today. It would actually be interesting to see how a group of these bots would interact - it's probably the closest we'll ever get to knowing what it was like to have a conversation in Victorian times.
  
  - Re:Victorian AI (Score:4, Interesting)
    
    by Rei ( 128717 ) writes: on Thursday December 12, 2024 @09:25AM (#65007759) Homepage
    
    I can only speak for myself, but it's not normal to train on a single dataset. For non-translation tasks I have parsers for something 8 different dataset types (some of which parse many different datasets), while as for translation tasks I have dozens of datasets (of varying numbers of languages and varying quality).
    Historic data is good to have in the mix.
    As for the tone, that doesn't come from the foundation - that comes from the finetune.
    Speaking of old datasets, someone the other day released "SPQR-LLM", which involves the use of two Pleias strict RAG models (only use info from queried sources) to query a database of ancient Roman and Greek texts, so its knowledge is stuck in antiquity (but still responds in English). I was chatting with the author the other day and he's thinking about not just RAG, but doing a heavyweight many-epoch finetune from all ancient texts on an existing model (there's not enough data to train a model from scratch, but one should be able to really heavily alter the preexisting foundation), so that the model itself has only an ancient understanding - and then doing an English/Latin/Greek instruct finetune atop it, so you can chat with an "ancient chatbot".
    
  - Re: Victorian AI (Score:3)
    
    by dpille ( 547949 ) writes:
    
    something that will trigger every SJW in reach while also being triggered itself
    
    Interesting. I would have expected you, instead, to complain that works of minorities and women will be over-represented in this terrible, woke training set, given the lower likelihood that the copyright on such works survived the prior registration-and-renewal regime.
    
    Just didn't think of it? Or were you expressing a preference for belittling your current political opponents rather than extending your own reactionary thinking
    - Re: (Score:2)
      
      by Roger W Moore ( 538166 ) writes:
      
      What political opponents? I was merely musing that an AI trained on old books would likely have a point-of-view consistent with the world of a century ago given the time it takes for copyright to expire nowadays and therefore would likely get into trouble very quickly in certain circles. Registration and renewal I believe was something specific to US copyright law so I did not give it any thought since I'm unamerican.
      - Re: (Score:2)
        
        by Roger W Moore ( 538166 ) writes:
        
        That would have been inaccurate since what it says would only be offensive to a certain segment of today's population which was an important part of the point. Indeed the seond part of what you quoted pointed out that the old-fashioned AI too would be triggered if you told it what is going on in today's world which was poking equally at the other side of the current ideological debate but I guess you missed that because you clearly are very much on one side of that debate.
SwirlyGPT (Score:1)

by Tablizer ( 95088 ) writes:

Books from a century or two ago are often full of purple prose: flowery verbosity.
- Re:SwirlyGPT (Score:4, Interesting)
  
  by Anonymous Coward writes: on Thursday December 12, 2024 @04:59AM (#65007493)
  
  Books from a century or two ago are often full of purple prose: flowery verbosity.
  The bespoke hypocrisy coming from the social media generation, is gilded with only the finest narcissism.
  (We at least know the books from a century ago were written by humans. As opposed to whatever they will be re-written by in the future to serve a purpose yet unknown to us.)
  
  - Re: (Score:1)
    
    by Tablizer ( 95088 ) writes:
    
    See, SarcasmGPT is already infested!
- Re:SwirlyGPT (Score:4, Funny)
  
  by quenda ( 644621 ) writes: on Thursday December 12, 2024 @07:18AM (#65007575)
  
  fix by training telegram archives stop
  
- Re: (Score:3)
  
  by Bumbul ( 7920730 ) writes:
  
  How did they legally acquire the rights to all these books? If any one of us downloads a copyrighted book for free, isn't that illegal?
  There's a hint on that on the first row of TFA. Spoiler: it starts with "public" end ends with "domain".
- Re: (Score:2)
  
  by bill_mcgonigle ( 4333 ) * writes:
  
  > legally acquire the rights to all these books?
```
a high-quality dataset of nearly one million public-domain books
```
Who can stand against the power of ClippyAI (Score:3)

by Mirnotoriety ( 10462951 ) writes: on Thursday December 12, 2024 @09:46AM (#65007819)

One A.I. to rule them all
One OS to track them
One Framework to bring them all
and in the LLM bind them
In the land of Redmond where the where the shadows lie

- Re: (Score:1)
  
  by Tablizer ( 95088 ) writes:
  
  Clippy is the real Skynet.
DEI tainted (Score:1)

by elcor ( 4519045 ) writes:

Harvard has turned into a DEI hell hole and their data is tainted
- Re: (Score:1)
  
  by Tablizer ( 95088 ) writes:
  
  Good! It agitates you Talibanjelicals! That makes me happy.
Me thinks per-chance 'tis a good thing (Score:2)

by newbie_fantod ( 514871 ) writes:

Should LLMs choose naught but correspondents in the publik domain for their edification, t'would perforce make perceiving AI Slop a facile and certain enterprise.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft (wired.com) 27

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft More Login

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

link? (Score:3)

Re:link? (Score:5, Interesting)

Re: (Score:2, Troll)

Re: (Score:2)

Re: (Score:2)

bittorrent and IPFS would be perfect for this (Score:2)

need github or source repository (Score:2)

Old books (Score:3)

Victorian AI (Score:4, Insightful)

Re:Victorian AI (Score:4, Interesting)

Re: Victorian AI (Score:3)

Re: (Score:2)

Re: (Score:2)

SwirlyGPT (Score:1)

Re:SwirlyGPT (Score:4, Interesting)

Re: (Score:1)

Re:SwirlyGPT (Score:4, Funny)

Re: (Score:3)

Re: (Score:2)

Who can stand against the power of ClippyAI (Score:3)

Re: (Score:1)

DEI tainted (Score:1)

Re: (Score:1)

Me thinks per-chance 'tis a good thing (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot