Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Text Mining the New York Times

Posted by Zonk on Sat Jul 29, 2006 05:29 AM
from the good-place-to-mine dept.
Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."
This discussion has been archived. No new comments can be posted.
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Homeland security (Score:4, Insightful)

    by Anonymous Coward on Saturday July 29 2006, @05:42AM (#15805041)
    For every time homeland security is mentioned as benefitting of a new technology, you should get a swift kick to the nuts. Goddam, there is more than just terrorism in this world.
    • Re:Homeland security by mrogers (Score:2) Saturday July 29 2006, @07:21AM
    • Re:Homeland security by Gli7ch (Score:2) Saturday July 29 2006, @07:31AM
    • Re:Homeland security by 1u3hr (Score:3) Saturday July 29 2006, @07:54AM
      • Re:Homeland security by damian cosmas (Score:1) Saturday July 29 2006, @08:37AM
      • Homeland Aftosa (Score:5, Interesting)

        by Lord Balto (973273) on Saturday July 29 2006, @08:43AM (#15805465)
        (http://lordbalto.com/)
        As William Burroughs suggested, the goal of the Aftosa Commission is not to rid the world of bovine aftosa. It's goal is to justify its existence and continue to enlarge its budget and its manpower until the world understands that bovine aftosa is such a critical issue that there needs to be a cabinet level Office of Bovine Aftosa with a budget only surpassed by that of the military. No one in government ever does anything that could conceivably put them out of business. This is why relying on the military and the "defense" contractors to bring peace is such a dangerous activity.
        [ Parent ]
        • 1 reply beneath your current threshold.
    • Re:Homeland security by LS (Score:2) Saturday July 29 2006, @08:38AM
    • Re:Homeland security by Fordiman (Score:2) Sunday July 30 2006, @01:16AM
    • Re:Homeland security by Anonymous Coward (Score:1) Sunday July 30 2006, @12:35PM
    • 5 replies beneath your current threshold.
  • Plus some other words (Score:5, Funny)

    by stimpleton (732392) on Saturday July 29 2006, @05:45AM (#15805050)
    For example, the model generated a list of words that included "rider," "bike," "race," "Lance Armstrong" and "Jan Ullrich."

    From this, researchers were easily able to identify that topic as the Tour de France.


    I imagine "testosterone", "doping", and "supportive mother", would have found the Tour de France topic even faster.
  • Funny (Score:1, Insightful)

    by vllbs (991844) on Saturday July 29 2006, @05:48AM (#15805055)
    A relative new method? A difficult task? Sorry, but these are almost laughable, even for a poor spaniard like me.
    • Re:Funny by kfg (Score:2) Saturday July 29 2006, @06:01AM
      • Re:Funny by vllbs (Score:1) Saturday July 29 2006, @06:49AM
        • Re:Funny by tsa (Score:2) Saturday July 29 2006, @08:01AM
        • Re:Funny by kfg (Score:1) Saturday July 29 2006, @01:06PM
      • Re:Funny by vllbs (Score:1) Saturday July 29 2006, @09:27AM
      • Re:Funny by andrewman327 (Score:2) Tuesday August 08 2006, @12:55PM
      • 1 reply beneath your current threshold.
  • Mining? (Score:5, Funny)

    by Eudial (590661) on Saturday July 29 2006, @05:53AM (#15805065)
    "Home atlast after another long day in the salt^H^H^H^Htext mines.

    We lost four more miners today, bless their souls. The foreman kept insisting they'd dig another tunnel between bicycling and Tour de France. They told him it was too dangerous, but no... he never listens. One of these days... They've got us working 20 hour shifts in the abyss that is the text mines, barely pay us enough to afford the rent, I'm telling you, one of these days..."
    • Re:Mining? by The_Wilschon (Score:2) Saturday July 29 2006, @05:06PM
  • by liuyunn (988682) on Saturday July 29 2006, @05:58AM (#15805074)
    If this can be implemented into research in academia, is searching through decades of articles and abstracts finally going to be more efficient? Provided that they are electronic of course. Poor citations, inaccurate keyword tags, obscure sources...ahh reminds me of grad school.
  • by Anonymous Coward on Saturday July 29 2006, @06:03AM (#15805079)
    But does it also ditch the ads?
  • Support Vector Machine? (Score:5, Interesting)

    by Uruviel (772554) on Saturday July 29 2006, @06:12AM (#15805098)
    (http://www.uruviel.com/)
    I thought this was fairly easy to do with a Support Vector Machine. (http://en.wikipedia.org/wiki/Support_Vector_Machi ne ) Or even simple Decision trees by setting the threshold for certain words. (http://en.wikipedia.org/wiki/Decision_tree)
    • Re:Support Vector Machine? by NoTheory (Score:2) Saturday July 29 2006, @06:41AM
      • 1 reply beneath your current threshold.
    • Re:Support Vector Machine? (Score:4, Informative)

      by Anonymous Coward on Saturday July 29 2006, @07:22AM (#15805231)
      Text modeling is mostly viewed as an unsupervised machine learning problem (as nobody will go through thousands of articles and tag each and every word, i.e. assign a topic to it). However support vector machines are very good classifiers for supervised data, e.g. digits recognition (you just learn your svm for a training sample of pictures of 9's tagged as a 9, the svm should then return the correct class for a new digit).

      The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
      [ Parent ]
  • You mean clusty.com? (Score:3, Insightful)

    by SirStanley (95545) on Saturday July 29 2006, @06:21AM (#15805111)
    (http://slashdot.org/)
    You mean they can group data by topic? Like clusty.com does when you search?

    I just read the stub of the article... because it seemed like it does exactly what clusty does and I don't care to read anymore.
  • in other news (Score:4, Funny)

    by tompee (967105) on Saturday July 29 2006, @06:21AM (#15805113)
    Google buys the University of California computer science school
  • Hello Newman..... (Score:1)

    by ActiveMatx (990128) on Saturday July 29 2006, @06:24AM (#15805121)
    "This research work has been presented by Newman and his colleagues during the IEEE Intelligence and Security Informatics Conference" .... Hello Newman....
  • by ThePengwin (934031) on Saturday July 29 2006, @06:34AM (#15805140)
    (http://www.pengwin.net/)
    Has anyone realised that english is one of the most screwed up, stupid languages ever created? its just been stretched and modified in any way possible and some aspects of it are practically useless. Maybe the world would be better off inventing a better language than analysing a horrible one :P
    • Re:Has anyone realized this (Score:4, Interesting)

      by rgravina (520410) on Saturday July 29 2006, @06:48AM (#15805166)
      Yeah I agree :). Linguists have tried to develop new international languages to replace English (e.g. Esperanto) that have less cruft and exceptions, but unfortunately very few people bother with them in practice, and keep using English :).

      Wouldn't it be cool if we all spoke a language which was expressive but at the same time had a machine-parsable grammar and had absolutely no silly exceptions or odd concepts like the masculine/feminine nouns that French and Italian has?

      I'm no expert on this, but I think linguists will tell you that we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used by various cultures around the world.

      Still yeah, I am glad I'm a native speaker of English since it would be a pain to learn as a second language! Imagine all the special cases you'd have to memorise! Spelling, grammar exceptions that may not fit the definition you learned but native speakers use anyway etc.
      [ Parent ]
    • Re:Has anyone realized this by gclef (Score:2) Saturday July 29 2006, @07:05AM
    • A solution already exists by dino213b (Score:1) Sunday July 30 2006, @02:42AM
    • Re:Has anyone realized this by im_thatoneguy (Score:2) Sunday July 30 2006, @02:50PM
    • 1 reply beneath your current threshold.
  • Interesting (Score:5, Interesting)

    by glowworm (880177) on Saturday July 29 2006, @06:35AM (#15805143)
    (Last Journal: Thursday May 04 2006, @10:41PM)
    I have available to me quite a large database of historical research spanning back to 1991, being freeform copies of emails between researchers and acedemics on a wide variety of topics to do with a specific topic from the 15th century. Dry stuff, but a very exciting topic.

    At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.

    It will be quite interesting applying this technique to the dataset to see if unknown relationships become apparent or known relationships become clearer.

    Looking at the paper and samples would indicate this tool (if it does what it promises) might be able to not only work out the correlation between datum but to create visual diagrams linking people, places and events quite well. A handy tool for my dataset.

    I'm now sitting here crystal ball gazing; if we were to expand this to a 3D map. Say by displaying a resulting chart and allow a researcher to hotlink to the data underneath it would be an interesting way to navigate a complex topic, more so than a text based wild or fuzzy search. Of course I won't know if this is possible until I look into the program more, and I won't be able to look into the program more until I massage teh dataset again ;) but it does open up some interesting possibilities.

    Click on the Anthony Ashcam box and see the hotlinking and unfolding of data specific to him. Drill in more... then more... and eventually get to a specific fact.

    The only problem will be that I would need to pre-compute all the charts. Oh well, one day ;)
    • 1 reply beneath your current threshold.
  • Artificial intelligence implications? (Score:2, Informative)

    by Anonymous Coward on Saturday July 29 2006, @06:41AM (#15805152)

    An artificial intelligence [earthlink.net] could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.

    Technological Singularity -- [blogcharm.com] -- here we come!

    • 1 reply beneath your current threshold.
  • by Anonymous Crowhead (577505) on Saturday July 29 2006, @06:54AM (#15805181)
    We're doomed! DOOOOOOOOOOMED!
  • by AJ_Levy (700911) on Saturday July 29 2006, @06:58AM (#15805189)
    (http://amishthrasher.blogspot.com/)
    So how is this not simply automated discourse analysis? [wikipedia.org]
  • Hahahaha (Score:1)

    by cj5 (795058) on Saturday July 29 2006, @07:00AM (#15805196)
    I have to agree with the first response (swift kick in the nuts to whomever came up with that). It's called Google or Regex, whatever you want to use to strip unwanted content from a search.
  • They're late to the game. (Score:4, Insightful)

    by alcohollins (64804) on Saturday July 29 2006, @07:09AM (#15805212)
    Not revolutionary. In fact, they're late.

    Google AdSense network has done this for years to serve contextually-relevant text ads across thousands of websites. Yahoo now, too.

  • grep? (Score:2, Funny)

    by muftak (636261) on Saturday July 29 2006, @07:11AM (#15805218)
    (http://www.muftak.net/)
    Wow, they figured out how to use grep!
    • 1 reply beneath your current threshold.
  • Text mining is... (Score:5, Funny)

    by SlashSquatch (928150) on Saturday July 29 2006, @07:47AM (#15805289)
    (http://cryptostenchies.com/)
    ...a load of grep.
  • Hard to do? (Score:1)

    by accurrent (985697) on Saturday July 29 2006, @08:05AM (#15805335)
    How is this hard to do? It seems like this could be done with relatively simple algorithms.
  • Earlier modes of text mining (Score:5, Informative)

    by soapbox (695743) * on Saturday July 29 2006, @08:18AM (#15805373)
    Phil Schrodt at the U of Kansas has been doing something similar for years using The Kansas Event Data System [ku.edu] (and its new update, TABARI [ku.edu]). He started using Reuters news summaries to feed the KEDS engine back in the 1990s.

    Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database [vranet.com] using machine-based coding.

    These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
  • by romka1 (891990) on Saturday July 29 2006, @08:42AM (#15805460)
    (http://www.madtorrent.com/)
    The new method that they figured out was
    "site:newyorktimes.com "Tour de France" "
  • by Aeomer (990057) on Saturday July 29 2006, @08:42AM (#15805461)
    We were doing this in 1989 with long free form responent answers to marketing questions to gain information about their actual preferences. Full natural language processing. We didn't patent the technique because we thought it was obvious - and we were too dumb to know how difficult a thing we achieved. It worked wonderfully. Ours worked in Japanese, German, and Thai, too - I bet their's only works in English, and American English at that. Of course it took us several months to teach it the decoding matrix for each language. I always think of this as the coolest computer related thing I ever did.
  • by saddino (183491) on Saturday July 29 2006, @08:59AM (#15805522)
    The demonstration is significant because it is one of the earliest showing that an extremely efficient, yet very complicated, technology called text mining is on the brink of becoming a tool useful to more than highly trained computer programmers and homeland security experts.

    On the brink? Q-Phrase [q-phrase.com] has desktop software that does this exact type of topic modeling on huge datasets - and it runs on any Windows or OS X box. [Disclaimer: I work there] And there are a number of companies (e.g. Vivisimo/Clusty) that uses these techniques as well.

    Going beyond the pure mechanics (this article speaks of research that is only groundbreaking in their speed of mining huge data sets), there are more interesting uses for topic modeling such as its application to already loosely correlated data sets. A prime example: mining the text from the result pages that are returned from a typical Google search. One of our products, CQ web [q-phrase.com] does exactly this (and bonus: it's freeware [q-phrase.com]):

    Using the example from the story: in CQ web, text mining the top 100 results from a Google search of "tour de france" takes about 20 seconds (via broadband) and produces topics such as:
    floyd landis
    lance armstrong
    yellow jersey
    time trial


    And going beyond simple topic analysis: using CQ web's "Dig In" feature (which provides relevant citations from the raw data) on floyd landis returns "Floyd landis has tested positive for high leves of testosterone during the tour de france." as the most relevant sentence from over 100 pages of unstructured text.

    So, while this is a somewhat interesting article, fact is, anyone can download software today that accomplishes much of this "groundbreaking" research and beyond.

    • 1 reply beneath your current threshold.
  • by drsquare (530038) on Saturday July 29 2006, @09:09AM (#15805560)
    330,000 articles at $3 each comes to $990,000, almost a million dollars for their data mining experiment. No wonder tuition costs are so high when this is what they're spending their money on!
  • by Animats (122034) on Saturday July 29 2006, @10:29AM (#15805961)
    (http://www.animats.com)
    Google News does a rather good job of associating all the stories on the same topic. I'd thought this was a solved problem.
  • Do Try This At Home! (Score:2, Interesting)

    by ejoe (198565) on Saturday July 29 2006, @12:57PM (#15806613)
    It doesn't come bundled with an analysis engine, but if you're looking to build your own corpus of material (e.g., by automating searches or harvesting large volumes of your research web pages) and you're on MacOSX, check out Anthracite web mining desktop toolkit [metafy.com]... It makes it easy to build spidering and scraping systems, structure the output and feed it into a database like MySQL...all without requiring you to write a single line of code. Take that output and feed it into any number of the analysis and search systems on SourceForge or Freshmeat and you're going to get comparable results without all the fuss, although you should definitely write a press release about it! The Google API and regex support are built-in, and you can even run the data through any UNIX command (e.g., grep or Perl) without leaving the program if you need even more. As for speed, the new release is going to feature a throttle because a few customers are getting overwhelmed by the URL loading throughput. Yes, by way of full disclosure, I wrote the software and that's why I'm always busy promoting it.
  • Chomsky Anyone? (Score:1)

    by TheStonepedo (885845) on Saturday July 29 2006, @02:08PM (#15806928)
    (http://thestonepedo.dyndns.org/ | Last Journal: Friday March 17 2006, @03:32AM)
    Edward Herman and Noam Chomsky may or may not have had a fancy computerized search system, but association of loaded keywords was a major topic in Manufacturing Consent (ISBN 0375714499) where the influences of commercial interests on the media and government was analyzed using the New York Times. The great improvement in the rate at which text can be analyzed should make for an excellent third edition.
    • 1 reply beneath your current threshold.
  • weird (Score:2)

    by m874t232 (973431) on Saturday July 29 2006, @03:03PM (#15807136)
    Out of the thousands of papers published on this subject every year, Roland Piquepaille picks this one.
  • Why is this news? (Score:4, Informative)

    by Lam1969 (911085) on Saturday July 29 2006, @03:23PM (#15807217)
    This is interesting, but the idea has been around for more than 50 years, and practiced using automated computers (as opposed to human coders) since the 1960s. Lerner and de Sola Pool came up with the idea of using "themes" to analyze political texts at Stanford in 1954, and hundreds or even thousands of studies using automated text analysis tools have been performed since then. You can download a free text analysis tool called Yoshikoder [yoshikoder.org], which will perform frequency counts of all words in a text, as well as dictionary analysis, and several other functions. So why is this news now? I think the press release is really leaving out some key information. I think the more relevant questions that should have been addressed in the original release is how the text was prepared for analysis, because most websites and online databases of news articles (LexisNexis, Factiva, etc.) don't allow batch downloads of huge amounts of news text in XML or some other format that can be easily parsed by text analysis programs.
  • brief explanation of the method (Score:4, Informative)

    by jrtom (821901) on Sunday July 30 2006, @12:30AM (#15809478)
    (http://www.ics.uci.edu/~jmadden)
    I'm a PhD student in the research group that worked on this. My research is somewhat different (machine learning and data mining on social network data sets) but I've gone to a lot of meeting and presentations on this work, and I've used the model they're describing in my own research. Certainly people have worked on document classification before, but posters that are suggesting that this isn't new don't understand what this method accomplishes. For example:
    • basically, the model assigns a probability distribution over topics to each document
      i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
    • topics are learned from the documents automatically, not pre-defined
      this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
    • the technique can learn which authors are likely to have written various pieces of a given document, or which cited documents are likely to have contributed most to this document
      side benefit: you can also discover misattributions (e.g., authors with the same name)
    For a good high level description of what these models are doing, see Mark Steyvers' research page [uci.edu] (MS is one of the authors); that page also has links to a number of the preceding papers. Those interested in seeing what the output of a related model looks like might like to check out the Author-Topic Browser [uci.edu].
  • I wonder how the classifier program would cope with text like that in the parent post... probably sprain its parser, or something.
    [ Parent ]
  • 9 replies beneath your current threshold.