Forgot your password?
typodupeerror

Text-Mining Technique Intelligently Learns Topics 84

Posted by ScuttleMonkey
from the sound-of-google-knocking-on-your-door dept.
Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"
This discussion has been archived. No new comments can be posted.

Text-Mining Technique Intelligently Learns Topics

Comments Filter:
  • by NickFitz (5849) <slashdot@nOSPAm.nickfitz.co.uk> on Wednesday August 02, 2006 @07:27PM (#15835964) Homepage

    "Time flies like an arrow, fruit flies like a banana."

    I wonder how well it can deal with a query relating to "flies" ;-)

  • by mapkinase (958129) on Wednesday August 02, 2006 @07:35PM (#15836015) Homepage Journal
    Elementary, Watson, programs understand that flies can be a verbs or a noun and correctly parse this info out from a sentence.

  • by ctr2sprt (574731) on Wednesday August 02, 2006 @08:04PM (#15836182)

    No, programs don't understand anything, which is the GP's point. You are glossing over the tremendous amount of work required to design a program which is capable of distinguishing between verbs and nouns and behaving appropriately. Human brains are incredibly complex, we have constant exposure to language, science indicates that our language is closely tied somehow to the way we think - language shapes brain development, vice versa, or both - and most of us still have trouble with it at times. It took me two passes to make syntactic sense of the GP's example sentence for all that I'd seen it before.

  • by QuantumFTL (197300) * <`moc.liamg' `ta' `kciw.nitsuj'> on Wednesday August 02, 2006 @08:22PM (#15836266)
    Last time this was posted, there were a few [slashdot.org] stupid [slashdot.org] posts [slashdot.org] that seem to assert that this type of thing is trivial.

    There are three main problems in this area of research (or pretty much any other part of CS):
    1. Defining the problem.
    2. Getting an accurate result.
    3. Getting it as fast as possible.
    Their research seems to deal mostly with the third problem, which is one of the biggest barriers to use in real life. Many of the algorithms used on these types of problems are NP, or require ridiculous amounts of (expensive) labeled data to train from. Also there are problems with generalization and overfitting. There is no freeware software that can compete with this type of algorithm under these conditions - over 300,000 articles in just a few hours.

    Another thing is that UCI is well known for hosting the UCI Machine Learning Repository [uci.edu]. This has become the gold standard for testing new machine learning algorithms in the accademic community; these guys really know what they are about. Back when I was a grad student at Cornell, my research used their data sets to evaluate new ways of creating ensemble classifiers from pre-trained classifiers according to modified bayesian reasoning, and the sets are useful because they contain a large, diverse set of problems that need to be modeled.

    All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about - the press release did not contain enough technical data, but rest assured, freeware and/or adwords does not use this kind of technique, and this is a big step towards mining the massive amount of human and biologically generated data out there.
  • Re:A shameful dupe (Score:3, Interesting)

    by Mr. Underbridge (666784) on Wednesday August 02, 2006 @09:10PM (#15836467)
    That's OK. This technique isn't even new, it's been done - and better than this - for years. Hell, I do myself.

System checkpoint complete.

Working...