Text-Mining Technique Intelligently Learns Topics

Text-Mining Technique Intelligently Learns Topics 84

Posted by ScuttleMonkey on Wednesday August 02, 2006 @07:20PM from the sound-of-google-knocking-on-your-door dept.

Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"

Text-Mining Technique Intelligently Learns Topics

This discussion has been archived. No new comments can be posted.

Search 84 Comments Log In/Create an Account

Comments Filter:

Re:Can it deal with the canonical problem? (Score:5, Insightful)

by NickFitz ( 5849 ) writes: <slashdot@nickfitz.co . u k> on Wednesday August 02, 2006 @07:47PM (#15836088) Homepage

Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)

Use... (Score:3, Insightful)

by posterlogo ( 943853 ) writes: on Wednesday August 02, 2006 @07:59PM (#15836160)

Ironically, sites like the New York Times already use tagging to help group and link article topics...which is something /. is experimenting with apparently. The tagging function here hasn't been very useful, and I suspect many other places suffer from human lazyness. Perhaps this AI approach is the way to go.

Topic modeling to the rescue (Score:5, Insightful)

by alienmole ( 15522 ) writes: on Wednesday August 02, 2006 @08:15PM (#15836244)

Perhaps topic modeling could be used to analyze Slashdot to detect dupes before they're posted?

Ants and topics (Score:3, Insightful)

by Randym ( 25779 ) writes: on Wednesday August 02, 2006 @10:53PM (#15836937)

What this article shows is that probablistic topic-based modeling in text analysis -- an NP-hard area -- works better than the old ways. This is not surprising: the probablistic "ant" model developed by the Italians turned out to be a clever way to solve the Traveling Salesman problem. What these both show is the applicability of probabilistic modeling to NP-hard problems.
I'd like to see someone apply this technique to the articles and comments making up the Slashdot corpus. CmdrTaco might be able to find a more focused set of topics. It might even be possible to tease out who on /. are the most interesting and/or informative posters, whether over the entire corpus or within any given topic.

Re:Can it deal with the canonical problem? (Score:2, Insightful)

by navarroj ( 907499 ) writes: on Thursday August 03, 2006 @04:38AM (#15838034) Homepage

"Time flies like an arrow, fruit flies like a banana."

I wonder how well it can deal with a query relating to "flies" ;-)

As far as I understand, this approach is not trying to extract any meaning from sentences, paragraphs or whatever. You don't even "query" the system, so your 'canonical problem' is not relevant here.

The system uses some sort of statistical text anaylisis (no semantics, no meaning) in order to group together news articles that seem to be talking about the same topic.

RTFP: Re:Can it deal with the canonical problem? (Score:3, Insightful)

by Phreakiture ( 547094 ) writes: on Thursday August 03, 2006 @09:28AM (#15839050) Homepage

Read The Fine Paper that these folks wrote. It will reveal that they used the Perl module Lingua::EN::Tagger to parse the English language content into parts of speech. You can then download and install that module and experiment with it yourself.

I just did the experiment myself, and the result I get is that it identifies "time", "arrow", "fruit" and "banana" as nouns (incorrectly identifying "time" as a proper noun), and both instances of "flies" as a verb and both instances of "like" as prepositions.

In other words, no.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Text-Mining Technique Intelligently Learns Topics 84

Text-Mining Technique Intelligently Learns Topics More Login

Text-Mining Technique Intelligently Learns Topics

Re:Can it deal with the canonical problem? (Score:5, Insightful)

Use... (Score:3, Insightful)

Topic modeling to the rescue (Score:5, Insightful)

Ants and topics (Score:3, Insightful)

Re:Can it deal with the canonical problem? (Score:2, Insightful)

RTFP: Re:Can it deal with the canonical problem? (Score:3, Insightful)

Related Links Top of the: day, week, month.

Slashdot Top Deals

Slashdot