Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

Text Mining the New York Times 104

Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."
This discussion has been archived. No new comments can be posted.

Text Mining the New York Times

Comments Filter:
  • by Anonymous Coward on Saturday July 29, 2006 @07:41AM (#15805152)

    An artificial intelligence [earthlink.net] could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.

    Technological Singularity -- [blogcharm.com] -- here we come!

  • by Anonymous Coward on Saturday July 29, 2006 @08:22AM (#15805231)
    Text modeling is mostly viewed as an unsupervised machine learning problem (as nobody will go through thousands of articles and tag each and every word, i.e. assign a topic to it). However support vector machines are very good classifiers for supervised data, e.g. digits recognition (you just learn your svm for a training sample of pictures of 9's tagged as a 9, the svm should then return the correct class for a new digit).

    The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
  • by soapbox ( 695743 ) * on Saturday July 29, 2006 @09:18AM (#15805373) Homepage
    Phil Schrodt at the U of Kansas has been doing something similar for years using The Kansas Event Data System [ku.edu] (and its new update, TABARI [ku.edu]). He started using Reuters news summaries to feed the KEDS engine back in the 1990s.

    Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database [vranet.com] using machine-based coding.

    These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
  • by docl ( 601856 ) on Saturday July 29, 2006 @10:07AM (#15805550)
    Right. And, unsupervised learning can be useful in some areas. Does anybody know how Google news works? It seems to work reasonably well, and seems to be solving the same problem.

    Also note that for most purposes however classification is becoming less of a big deal. Read Clay Shirky's article [shirky.com] to understand why. Shirkey talks about ontologies specifically, but the gist is the same -- basically, tagging each and every word isn't as crazy an idea if the end goal is just "I want to find something related" which is the most common case.
  • Why is this news? (Score:4, Informative)

    by Lam1969 ( 911085 ) on Saturday July 29, 2006 @04:23PM (#15807217)
    This is interesting, but the idea has been around for more than 50 years, and practiced using automated computers (as opposed to human coders) since the 1960s. Lerner and de Sola Pool came up with the idea of using "themes" to analyze political texts at Stanford in 1954, and hundreds or even thousands of studies using automated text analysis tools have been performed since then. You can download a free text analysis tool called Yoshikoder [yoshikoder.org], which will perform frequency counts of all words in a text, as well as dictionary analysis, and several other functions. So why is this news now? I think the press release is really leaving out some key information. I think the more relevant questions that should have been addressed in the original release is how the text was prepared for analysis, because most websites and online databases of news articles (LexisNexis, Factiva, etc.) don't allow batch downloads of huge amounts of news text in XML or some other format that can be easily parsed by text analysis programs.
  • by jrtom ( 821901 ) on Sunday July 30, 2006 @01:30AM (#15809478) Homepage
    I'm a PhD student in the research group that worked on this. My research is somewhat different (machine learning and data mining on social network data sets) but I've gone to a lot of meeting and presentations on this work, and I've used the model they're describing in my own research. Certainly people have worked on document classification before, but posters that are suggesting that this isn't new don't understand what this method accomplishes. For example:
    • basically, the model assigns a probability distribution over topics to each document
      i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
    • topics are learned from the documents automatically, not pre-defined
      this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
    • the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
      side benefit: you can also discover misattributions (e.g., authors with the same name)
    For a good high level description of what these models are doing, see Mark Steyvers' research page [uci.edu] (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser [uci.edu].
  • Re:Homeland security (Score:1, Informative)

    by Anonymous Coward on Sunday July 30, 2006 @01:35PM (#15811979)
    We did this 2 years ago, filed patents. We have a real-time implementation at http://wizag.com/ [wizag.com] in the form of TopicClouds and TopicMaps. It is applied to to hundreds of thousands of news and blogs (including Slashdot). Both the nodes and the links in the TopicMaps are clickable. Once you create an account, the system creates a personalized TopicCloud for each user.

I've noticed several design suggestions in your code.

Working...