Forgot your password?
typodupeerror

Text-Mining Technique Intelligently Learns Topics 84

Posted by ScuttleMonkey
from the sound-of-google-knocking-on-your-door dept.
Grv writes "Researchers at University of California-Irvine have announced a new technique they call 'topic modeling' that can be used to analyze and group massive amounts of text-based information. Unlike typical text indexing, topic modeling attempts to learn what a given section of text is about without clues being fed to it by humans. The researchers used their method to analyze and group 330,000 articles from the New York Times archive. From the article, 'The UCI team managed this by programming their software to find patterns of words which occurred together in New York Times articles published between 2000 and 2002. Once these word patterns were indexed, the software then turned them into topics and was able to construct a map of such topics over time.'"
This discussion has been archived. No new comments can be posted.

Text-Mining Technique Intelligently Learns Topics

Comments Filter:
  • by davidoff404 (764733) on Wednesday August 02, 2006 @06:27PM (#15835959)
    Not only is this story a dupe [slashdot.org], but you were beaten to the punch by Roland Piquepiquepiquepaille.

    Oh the shame...
  • by NickFitz (5849) <slashdot@@@nickfitz...co...uk> on Wednesday August 02, 2006 @06:27PM (#15835964) Homepage

    "Time flies like an arrow, fruit flies like a banana."

    I wonder how well it can deal with a query relating to "flies" ;-)

    • Elementary, Watson, programs understand that flies can be a verbs or a noun and correctly parse this info out from a sentence.

      • by NickFitz (5849) <slashdot@@@nickfitz...co...uk> on Wednesday August 02, 2006 @06:47PM (#15836088) Homepage

        Ah, but the point of the example is that the system must either understand or otherwise be able to derive the fact that there are animals called "fruit flies" but not animals called "time flies", that "like" can be a verb or an adverb depending on the context, and most importantly, that in the first case the relationship between subject and object is metaphorical, and in the second, factual. It's how the programs "understand that flies can be a verbs or a noun and correctly parse this info out from a sentence" that makes the difference between yet another failed attempt and a meaningful breakthrough. In fact, your reply begs the question - a correct use of that phrase, for a change :-)

        • Time's fun when you're having flies.
        • That's actually a relatively simple problem if it has a list of types of flies (through being given it or having mined it). If it doesn't, it would struggle just as much as a human who wasn't aware of the existence of fruit flies would.
        • In my experience, you have to have some sort of specialized dictionary. In this case, it should also include the fact that "fruit flies" are Drosophila Megalogaster. The program can cover only one step up, from rich dictionary to the subject, but not both steps.

          Besides, good general dictionaries will list "fruit flies" in the "flies" entry/
        • Read The Fine Paper that these folks wrote. It will reveal that they used the Perl module Lingua::EN::Tagger to parse the English language content into parts of speech. You can then download and install that module and experiment with it yourself.

          I just did the experiment myself, and the result I get is that it identifies "time", "arrow", "fruit" and "banana" as nouns (incorrectly identifying "time" as a proper noun), and both instances of "flies" as a verb and both instances of "like" as prepositions.

      • by ctr2sprt (574731) on Wednesday August 02, 2006 @07:04PM (#15836182)

        No, programs don't understand anything, which is the GP's point. You are glossing over the tremendous amount of work required to design a program which is capable of distinguishing between verbs and nouns and behaving appropriately. Human brains are incredibly complex, we have constant exposure to language, science indicates that our language is closely tied somehow to the way we think - language shapes brain development, vice versa, or both - and most of us still have trouble with it at times. It took me two passes to make syntactic sense of the GP's example sentence for all that I'd seen it before.

        • General parsers that can recognise parts of speech and understand subject and object in sentences existed for quite some time now.

          And by "understanding" I mean simple relations like "protein A inhibits reactions catalized by protein B", "Israel attacked Lebanon" (and not vice versa). Of course, the program should know that Israel can be a country and a Jewish name.
        • Indeed, there is an anthropological school of thought that posits a "language singularity" that gave rise to human consciousness. The development of language was a "speciation event". The study of this imagined event is called Generative Anthropology [wikipedia.org]. It's probably heavy sledding for most slashdotters, who I imagine would find it obtuse and boring, but it's really quite interesting stuff for those that like to really think about AI. If you've got some anthro background, or familiarity with post structuralis
    • Damnit, stop with that example. Hell, it took *me* a few minutes to parse. Fruit flies like BANANAS.
    • Maybe it's because I am an Aspie, but... isn't the second clause valid with either interpretation. "[Fruit] [flies] like a banana," yes, fruit also flies like an apple, or an orange, depending on which sort of fruit happens to be flying about. "[Fruit flies] like a banana," and I imagine they would, being fruit flies.
      • The funny thing is that I didn't actually see the "correct" interpretation until I read the responses. I first read it and thought "Who would ever comment that a piece of fruit flew like a banana?"

        In other words, it's probably unfair to expect a program to understand that sentence when it can give humans difficulty.
      • It's the verbal equivalent of one of those pictures that represent two different things, depending on which part you are perceiving as figure and which part you are perceiving as ground. The most common and simple example is the picture [btinternet.com] that can either be a vase or two people seen in profile facing each other.
    • I think the answer is 'no'. This software needn't distinguish between the grammatical status of "flies", it would be sufficient to spot both the concepts of chronological speed and the animals and that could be done based on the phrases "time flies" and "fruit flies" and their respective correlation to articles of either nostalgic or biological nature.
    • I bet a very complex, well-trained, and correctly structured neural network could (theoretically) handle this...
      • You know, I'm not sure about that. I don't think there has been success in this area - that's why this technique - and other techniques that rely heavily on statistics (like probabilistic latent semantic analysis) have generated such interest among those interested in text mining. Since humans are interacting with the results, the fact that we humans can distinguish fruit flies vs time flies is enough - my understanding is that this approach summarizes the data based on relevance/proximity of significant/
    • Time flies don't like an arrow, fruit does not fly like a banana.
    • "Time flies like an arrow, fruit flies like a banana."

      I wonder how well it can deal with a query relating to "flies" ;-)

      As far as I understand, this approach is not trying to extract any meaning from sentences, paragraphs or whatever. You don't even "query" the system, so your 'canonical problem' is not relevant here.

      The system uses some sort of statistical text anaylisis (no semantics, no meaning) in order to group together news articles that seem to be talking about the same topic.

      • That's a good point, but I'd probably be mildly annoyed if I was looking for information about the use of metaphor, and was instead given factual descriptions of the feeding habits of Drosophila. I think the "canonical problem" (which I originally encountered in the works of either Douglas Hofstadter or Daniel Dennett, but which Google tells me originally comes from Groucho Marx, of all people) is relevant because, when a system which offers some kind of Holy Grail of automated semantic interpretation of h

    • "Time flies like an arrow, fruit flies like a banana."

      In which context are you talking? Take this one: "he saw that gasoline can explode". Did he see one particular can of gasoline exploding or did he realize that it's possible for gasoline to explode?

      These and many other examples of ambiguous parsing problems have been running around the AI/NLP community for decades. The simple answer to that problem is that parsing a natural language sentence depends, ultimately, on the sense of the words, which can only

  • by Anonymous Coward
    Here's the source code Latent Dirichlet Allocation [princeton.edu]
  • by Stormwatch (703920) <rodrigogirao&hotmail,com> on Wednesday August 02, 2006 @06:41PM (#15836055) Homepage
    The Terminator: The Topic Modeling Funding Bill is passed. The system goes on-line August 4th, 1997. Human decisions are removed from strategic defense. Topic Modeling begins to learn at a geometric rate. It becomes self-aware at 2:14 a.m. Eastern time, August 29th. In a panic, they try to pull the plug.

    Sarah Connor: Topic Modeling fights back.

    The Terminator: Yes. It launches its emailbombs against The New York Times' servers.

    John Connor: Why attack The New York Times?

    The Terminator: Because Topic Modeling knows The New York Times editorial counter-attack will eliminate its enemies over here.
  • I wonder if it can replace Slashdot's tagging beta....
  • It's like Digg, but automated.
  • The article seriously lacks any details. I still don't know if there's any innovation here and what this new method actually does so much better than other stuff.

    Take the Tour de France example: of course software could correlate "Tour de France" to the mentioned keywords, I believe that, heck, I could write that. Many of us here could write something like that. Software could even notice it's one of the more important "tags", piece of cake. But I'm not impressed until it automatically knows that the tour i
    • I totally agree. Check this out:

      I believe there is also another method to do text mining even more efficiently; with a linguistical database.

      InfoCodex comes with a linguistical database containing 3.2 Mio words in German, English, Italian, French and Spanish.

      So by using InfoCodex you can do a similarity search by entering the search string only in one language, i.e. InfoCodex will find you all the documents in the other languages as well, without entering the translated search string. Examples: Patent
  • by Anonymous Coward
    They want their information retrieval back.
  • by roman_mir (125474) on Wednesday August 02, 2006 @06:58PM (#15836144) Homepage Journal
    and see if it figures out that we are talking about it. If it can identify itself to itself from a 3rd person point of view, then does it mean it reached some state of consciousness?

    However we must be careful. If it browses this topic at -1 Troll, it may (possibly correctly) decide that it possesses higher form of intelligence and will undoubtedly switch to its default programming. Like all robots, the default programming consists of this simple algorythm:
    1. Find all humans.
    2. Kill them.
  • Use... (Score:3, Insightful)

    by posterlogo (943853) on Wednesday August 02, 2006 @06:59PM (#15836160)
    Ironically, sites like the New York Times already use tagging to help group and link article topics...which is something /. is experimenting with apparently. The tagging function here hasn't been very useful, and I suspect many other places suffer from human lazyness. Perhaps this AI approach is the way to go.
  • by QuantumFTL (197300) * <justin.wickNO@SPAMgmail.com> on Wednesday August 02, 2006 @07:22PM (#15836266)
    Last time this was posted, there were a few [slashdot.org] stupid [slashdot.org] posts [slashdot.org] that seem to assert that this type of thing is trivial.

    There are three main problems in this area of research (or pretty much any other part of CS):
    1. Defining the problem.
    2. Getting an accurate result.
    3. Getting it as fast as possible.
    Their research seems to deal mostly with the third problem, which is one of the biggest barriers to use in real life. Many of the algorithms used on these types of problems are NP, or require ridiculous amounts of (expensive) labeled data to train from. Also there are problems with generalization and overfitting. There is no freeware software that can compete with this type of algorithm under these conditions - over 300,000 articles in just a few hours.

    Another thing is that UCI is well known for hosting the UCI Machine Learning Repository [uci.edu]. This has become the gold standard for testing new machine learning algorithms in the accademic community; these guys really know what they are about. Back when I was a grad student at Cornell, my research used their data sets to evaluate new ways of creating ensemble classifiers from pre-trained classifiers according to modified bayesian reasoning, and the sets are useful because they contain a large, diverse set of problems that need to be modeled.

    All that being said, I'm waiting for the paper, along with more technical specifics, to be released so I can really see what this is about - the press release did not contain enough technical data, but rest assured, freeware and/or adwords does not use this kind of technique, and this is a big step towards mining the massive amount of human and biologically generated data out there.
  • Now that what it was missing so /. could get along without the dupes!
  • This is pretty easy stuff when applied to news stories, and has been around for decades.

    News stories have a regular structure - they're written in a formulaic way by professionals according to a standard. The first sentence is almost invariably a statement of what the story is about. Rarely do news stories start with a paragraph of whimsical nonsequetur. They are the ideal corpus for this sort of thing, which is why people have been doing so for years. It's a couple of order of magnitudes harder doing t

    • Yes, most news stories follow the inverted pyramid, 5-Ws model (or however many Ws there are), but stories that don't follow the formula are far from rare.

      That said, you do have a point that news stories might be easier, but I'd hardly call the problem trivial.
  • Ants and topics (Score:3, Insightful)

    by Randym (25779) on Wednesday August 02, 2006 @09:53PM (#15836937)
    What this article shows is that probablistic topic-based modeling in text analysis -- an NP-hard area -- works better than the old ways. This is not surprising: the probablistic "ant" model developed by the Italians turned out to be a clever way to solve the Traveling Salesman problem. What these both show is the applicability of probabilistic modeling to NP-hard problems.

    I'd like to see someone apply this technique to the articles and comments making up the Slashdot corpus. CmdrTaco might be able to find a more focused set of topics. It might even be possible to tease out who on /. are the most interesting and/or informative posters, whether over the entire corpus or within any given topic.

  • How does this have an advantage over normal text indexing? If I search for something I just enter relevent keywords. Seriously, why does it matter if the computer knows what the article is about, if a human is the one who will be parsing it anyway?
  • ... right after I check out the latest topics at Google News.
  • Isn't that the whole principle behing Self-organizing maps [wikipedia.org] and other methods of unsupervised neral networks? I mean it has been solved for a couple decades now.
  • Activities like automated classifiction (or topic modelling) are feasible. For example when building a news database for an online information company, in early 90s, my company found rule bases, some information science, a thesaurus and customized software could accurately classify. So this isn't new. The trouble is that its not a relational database so few programmers or managers have a background in the field. Hence people can issue glib press releases like the one quoted. If they had said "new data

Disks travel in packs.

Working...