Text-Mining Technique Intelligently Learns Topics 84
Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"
Comment removed (Score:5, Funny)
Re:A shameful dupe (Score:3, Funny)
Topic modeling to the rescue (Score:5, Insightful)
Re:Topic modeling to the rescue (Score:2)
Re:A shameful dupe (Score:3, Interesting)
Re:A shameful dupe (Score:2)
Re:A shameful dupe (Score:2)
Can it deal with the canonical problem? (Score:5, Interesting)
"Time flies like an arrow, fruit flies like a banana."
I wonder how well it can deal with a query relating to "flies" ;-)
Re:Can it deal with the canonical problem? (Score:2, Interesting)
Re:Can it deal with the canonical problem? (Score:5, Insightful)
Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)
Re:Can it deal with the canonical problem? (Score:3, Funny)
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:1)
Besides, good general dictionaries will list "fruit flies" in the "flies" entry/
Re:Can it deal with the canonical problem? (Score:2)
Errr... not quite.
It's Drosophila Melanogaster (black belly), not Megalogaster (great big belly).
Though I suspect Megalogaster would apply to some people here...
RTFP: Re:Can it deal with the canonical problem? (Score:3, Insightful)
Read The Fine Paper that these folks wrote. It will reveal that they used the Perl module Lingua::EN::Tagger to parse the English language content into parts of speech. You can then download and install that module and experiment with it yourself.
I just did the experiment myself, and the result I get is that it identifies "time", "arrow", "fruit" and "banana" as nouns (incorrectly identifying "time" as a proper noun), and both instances of "flies" as a verb and both instances of "like" as prepositions.
Re:Can it deal with the canonical problem? (Score:5, Interesting)
No, programs don't understand anything, which is the GP's point. You are glossing over the tremendous amount of work required to design a program which is capable of distinguishing between verbs and nouns and behaving appropriately. Human brains are incredibly complex, we have constant exposure to language, science indicates that our language is closely tied somehow to the way we think - language shapes brain development, vice versa, or both - and most of us still have trouble with it at times. It took me two passes to make syntactic sense of the GP's example sentence for all that I'd seen it before.
Re:Can it deal with the canonical problem? (Score:1)
And by "understanding" I mean simple relations like "protein A inhibits reactions catalized by protein B", "Israel attacked Lebanon" (and not vice versa). Of course, the program should know that Israel can be a country and a Jewish name.
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:2)
Really?
Have you ever seen a shadow box?
Heh.
Re:Can it deal with the canonical problem? (Score:1)
Re:Can it deal with the canonical problem? (Score:2)
In other words, it's probably unfair to expect a program to understand that sentence when it can give humans difficulty.
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:1)
Re:Can it deal with the canonical problem? (Score:1)
Re:Can it deal with the canonical problem? (Score:2)
Re:Can it deal with the canonical problem? (Score:1)
Re:Can it deal with the canonical problem? (Score:2, Insightful)
As far as I understand, this approach is not trying to extract any meaning from sentences, paragraphs or whatever. You don't even "query" the system, so your 'canonical problem' is not relevant here.
The system uses some sort of statistical text anaylisis (no semantics, no meaning) in order to group together news articles that seem to be talking about the same topic.
Re:Can it deal with the canonical problem? (Score:2)
That's a good point, but I'd probably be mildly annoyed if I was looking for information about the use of metaphor, and was instead given factual descriptions of the feeding habits of Drosophila. I think the "canonical problem" (which I originally encountered in the works of either Douglas Hofstadter or Daniel Dennett, but which Google tells me originally comes from Groucho Marx, of all people) is relevant because, when a system which offers some kind of Holy Grail of automated semantic interpretation of h
Parsing != Understanding (Score:2)
In which context are you talking? Take this one: "he saw that gasoline can explode". Did he see one particular can of gasoline exploding or did he realize that it's possible for gasoline to explode?
These and many other examples of ambiguous parsing problems have been running around the AI/NLP community for decades. The simple answer to that problem is that parsing a natural language sentence depends, ultimately, on the sense of the words, which can only
Re:Parsing != Understanding (Score:2)
Exactly. As it's time for the Shipping Forecast, I will refer you to my reply to the preceding sibling post, but I will have a look at the project to which you link.
Is it intelligent enough to find dupes I wonder? (Score:1)
Latent Dirichlet Allocation (Score:2, Informative)
Re:Latent Dirichlet Allocation code (Score:4, Informative)
Matlab Topic Modeling Toolbox [uci.edu]
Re:A dupe solution? (Score:2)
Re:A dupe solution? (Score:1)
Dupe time warp? (Score:1)
My (0, Redundant) post: Wednesday August 02, @04:35PM
The (+5 Insightful) post I duped: Wednesday August 02, @05:15PM
I guess it's back to the drawing board on that omniscience thing...
Obligatory... (Score:5, Funny)
Sarah Connor: Topic Modeling fights back.
The Terminator: Yes. It launches its emailbombs against The New York Times' servers.
John Connor: Why attack The New York Times?
The Terminator: Because Topic Modeling knows The New York Times editorial counter-attack will eliminate its enemies over here.
Tagging Beta (Score:2)
You know.. (Score:1)
Article lacks details (Score:2)
Take the Tour de France example: of course software could correlate "Tour de France" to the mentioned keywords, I believe that, heck, I could write that. Many of us here could write something like that. Software could even notice it's one of the more important "tags", piece of cake. But I'm not impressed until it automatically knows that the tour i
Re:Article lacks details (Score:1)
I believe there is also another method to do text mining even more efficiently; with a linguistical database.
InfoCodex comes with a linguistical database containing 3.2 Mio words in German, English, Italian, French and Spanish.
So by using InfoCodex you can do a similarity search by entering the search string only in one language, i.e. InfoCodex will find you all the documents in the other languages as well, without entering the translated search string. Examples: Patent
1997 called... (Score:1, Funny)
Feed this /. article to it (Score:3, Funny)
However we must be careful. If it browses this topic at -1 Troll, it may (possibly correctly) decide that it possesses higher form of intelligence and will undoubtedly switch to its default programming. Like all robots, the default programming consists of this simple algorythm:
1. Find all humans.
2. Kill them.
Re:Feed this /. article to it (Score:2, Funny)
The danceable beat of underwater plant life? Odd.
Re:Feed this /. article to it (Score:2)
I'm not sure. But I think all it needs to do is to fool [wikipedia.org] us into thinking it's reached some state of consciousness. =)
Use... (Score:3, Insightful)
Yes it's a dupe, but lets get something straight (Score:5, Interesting)
There are three main problems in this area of research (or pretty much any other part of CS):
Another thing is that UCI is well known for hosting the UCI Machine Learning Repository [uci.edu]. This has become the gold standard for testing new machine learning algorithms in the accademic community; these guys really know what they are about. Back when I was a grad student at Cornell, my research used their data sets to evaluate new ways of creating ensemble classifiers from pre-trained classifiers according to modified bayesian reasoning, and the sets are useful because they contain a large, diverse set of problems that need to be modeled.
All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about - the press release did not contain enough technical data, but rest assured, freeware and/or adwords does not use this kind of technique, and this is a big step towards mining the massive amount of human and biologically generated data out there.
Re:Yes it's a dupe, but lets get something straigh (Score:2)
"To put it simply, text mining has made an evolutionary jump. In just a few short years, it could become a common and useful tool for everyone from medical doctors to advertisers; publishers to politicians."
And my point still is that nobody needs to wait a few short year
sed/running our software/not running our software/ (Score:2)
Re:Yes it's a dupe, but lets get something straigh (Score:1)
http://www-nlp.stanford.edu/ [stanford.edu]
http://tcc.itc.it/research/textec/tools-resources/ jinfil.html [tcc.itc.it]
http://wordnet.princeton.edu/ [princeton.edu]
http://www.alias-i.com/lingpipe/web/faq.html [alias-i.com]
http://www.isi.edu/licensed-sw/halogen/index.html [isi.edu]
Not trivial, but if you wanted to DIY, you don't need to start from scr
Re:Yes it's a dupe, but lets get something straigh (Score:2)
http://psiexp.ss.uci.edu/research/papers/isi2006.p df [uci.edu]
Re:Yes it's a dupe, but lets get something straigh (Score:1)
Seriously though: IMHO it'll be a loooooooong time before machine-indexing reaches a level of nuance acceptable to -quality publishers- outside of tech. I'd even be glad to wager on it.
Re:Yes it's a dupe, but lets get something straigh (Score:1)
When i first read the headline, i thought more about how this could be used to filter the flood of information that RSS-feeds opened for the tiny fraction of actually interesting information.
I don't know if this method is good enough for indexing old books. Sometimes you want human-made indices. And maybe the parser gets irritated by archaic forms of current language...
Re:Yes it's a dupe, but lets get something straigh (Score:2)
They're often convergence algorithms - you run them until the answer is sufficiently accurate for your purposes. The problem is therefore a combination of 'more speed' and 'more accuracy', combined
Slashdot meets topic modelling (Score:1)
I may be in the minority but I *like* dupes (Score:1)
Easy for news stories (Score:2)
News stories have a regular structure - they're written in a formulaic way by professionals according to a standard. The first sentence is almost invariably a statement of what the story is about. Rarely do news stories start with a paragraph of whimsical nonsequetur. They are the ideal corpus for this sort of thing, which is why people have been doing so for years. It's a couple of order of magnitudes harder doing t
Re:Easy for news stories (Score:2)
That said, you do have a point that news stories might be easier, but I'd hardly call the problem trivial.
Ants and topics (Score:3, Insightful)
I'd like to see someone apply this technique to the articles and comments making up the Slashdot corpus. CmdrTaco might be able to find a more focused set of topics. It might even be possible to tease out who on /. are the most interesting and/or informative posters, whether over the entire corpus or within any given topic.
So (Score:1)
Re:So (Score:2)
I'll have to read about this topic indexing ... (Score:1)
SOM? (Score:2)
Unstructured text processing, not a well trod path (Score:1)