Text Mining the New York Times 104
Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."
Artificial intelligence implications? (Score:2, Informative)
An artificial intelligence [earthlink.net] could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.
Technological Singularity -- [blogcharm.com] -- here we come!
Re:Support Vector Machine? (Score:4, Informative)
The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
Earlier modes of text mining (Score:5, Informative)
Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database [vranet.com] using machine-based coding.
These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
Re:Support Vector Machine? (Score:2, Informative)
Also note that for most purposes however classification is becoming less of a big deal. Read Clay Shirky's article [shirky.com] to understand why. Shirkey talks about ontologies specifically, but the gist is the same -- basically, tagging each and every word isn't as crazy an idea if the end goal is just "I want to find something related" which is the most common case.
Why is this news? (Score:4, Informative)
brief explanation of the method (Score:4, Informative)
i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
side benefit: you can also discover misattributions (e.g., authors with the same name)
Re:Homeland security (Score:1, Informative)