Text Mining the New York Times 104
Roland Piquepaille writes "Text mining is a computer technique to extract useful information from unstructured text. And it's a difficult task. But now, using a relatively new method named topic modeling, computer scientists from University of California, Irvine (UCI), have analyzed 330,000 stories published by the New York Times between 2000 and 2002 in just a few hours. They were able to automatically isolate topics such as the Tour de France, prices of apartments in Brooklyn or dinosaur bones. This technique could soon be used not only by homeland security experts or librarians, but also by physicians, lawyers, real estate people, and even by yourself. Read more for additional details and a graph showing how the researchers discovered links between topics and people."
Homeland security (Score:4, Insightful)
Re:Homeland security (Score:2, Insightful)
The graph also shows links betwen US_Military and AL_QAEDA, and between ARIEL_SHARON and Mid_East_Conflict. If only they'd had this technology when they were trying to justify the invasion of Iraq.
"Look, Saddam Hussein has links to Al Qaeda! You can see it on the graph!"
"Uh, Mister Vice-Pr
Re:Homeland security (Score:2)
"... That's my job!"
Re:Homeland security (Score:2)
I doubt the "real" terrorists would speak in regular english, either. First, different languages have different grammatical rules and idioms. Secondly, they wouldn't talk openly about "BOMBING THE WHITEHOUSE", they'd probably say it more discretely in a semi-sophisticated code. This will just be another arms race--a [tele]communinications one--and civilian casualties will be the main results.
Unless I'm wrong ofcourse and terrorists write like NY Times writers.
Re:Homeland security (Score:1)
Re:Homeland security (Score:2)
Re:Homeland security & NY Times (Score:1)
Re:Homeland security (Score:2)
Good sir, I wish I had some mod points left for you
Seriously, every time you mention homeland security, every time you watch a special report on terrorism on you local current affair program - That means the terrorists are winning.
...You don't support terrorism now do you?
Re:Homeland security (Score:3, Insightful)
The drunk replied, "I'm looking for my car keys."
The Officer looked around in the lamplight, then asked the drunk, "I don't see any car keys. Are you sure you lost them here?"
The drunk replied, "No, I lost them over there", and pointed to an area of the sidewalk deep in shadow.
The polic
Re:Homeland security (Score:1)
Unless you have a metric crapload of intercepted communications to sort through for information that might be useful. Especially since the NSA is listening to everything.
Remember that the darling of the Left, John Kerry, insisted that terrorism was a law enforcement problem, not a military problem. A large part of law enforcement is digging through all available information from the comfort of your
Re:Homeland security (Score:2)
Did I suggest carpet bombing as an alternative? I think legwork is the only likely method. Real terrorists don't live their lives online, you might fill up Gitmo with idiots who spouted "Jihad" on some website. Osama gave up using his satellite phone years ago, they're well aware the NSA is snooping on every form of telephone or Internet communication. My
Homeland Aftosa (Score:5, Interesting)
Re:Homeland security (Score:2)
Re:Homeland security (Score:2)
Re:Homeland security (Score:1, Informative)
Plus some other words (Score:5, Funny)
From this, researchers were easily able to identify that topic as the Tour de France.
I imagine "testosterone", "doping", and "supportive mother", would have found the Tour de France topic even faster.
Re:Plus some other words (Score:1)
Texas!
KFG
Funny (Score:1, Insightful)
Re:Funny (Score:2, Funny)
KFG
Re:Funny (Score:1)
Re:Funny (Score:2)
Re:Funny (Score:1)
KFG
Re:Funny (Score:1)
Re:Funny (Score:2)
Mining? (Score:5, Funny)
We lost four more miners today, bless their souls. The foreman kept insisting they'd dig another tunnel between bicycling and Tour de France. They told him it was too dangerous, but no... he never listens. One of these days... They've got us working 20 hour shifts in the abyss that is the text mines, barely pay us enough to afford the rent, I'm telling you, one of these days..."
Re:Mining? (Score:2)
Sounds like an alternative to cross-referencing (Score:3, Interesting)
Re:Sounds like an alternative to cross-referencing (Score:2)
Already done. It's pretty cool to see how the psychology topic over time turns into the AI topic:
http://www.cs.cmu.edu/~lemur/science/ [cmu.edu]
I guess it's one way to avoid registering. (Score:1, Interesting)
Re:I guess it's one way to avoid registering. (Score:1)
Support Vector Machine? (Score:5, Interesting)
Re:Support Vector Machine? (Score:2)
I'm not entirely sure what the novel component of this is. I think it might be the duration of time it takes to process the bodies of text (i should RTF papers to find out i suppose). Latent Semantic Analysis is really computationally expensive.
Re:Support Vector Machine? (Score:4, Informative)
The problem with this new method (called LDA introduced by Blei, Jordan and Ng in 2003) is (beside other issues) the so called inference step, as it is analytically intractable. Blei et al. solved this by means of variational methods, i.e. simplifying the model motivated by averaging-out phenomenas. Another method (which as far as I understand was applied by Steyvers) is sampling, in this case Gibbs sampling. Usually the variational methods are superior to sampling approaches as one needs quite a lot of samples for the whole thing to converge.
Re:Support Vector Machine? (Score:2, Informative)
Also note that for most purposes however classification is becoming less of a big deal. Read Clay Shirky's article [shirky.com] to understand why. Shirkey talks about ontologies specifically, but the gist is the same -- basically, tagging each and every word isn't as crazy an idea if the end goal is just "I want to find something related
Re:Support Vector Machine? (Score:3, Interesting)
Well, even in variational inference, you have the problem of convergence. You have a huge EM algorithm and you're trying to maximize the completele likelihood of the data you have. Gibbs sampling doesn't have the same nice properties, but usually works pretty well in practice. Gibbs sampling is nice because it's usually easier to do, requires less memory (in variational methods you basically have to create a new probability model where everything is decoupled), and it's far easier to debug.
You mean clusty.com? (Score:3, Insightful)
I just read the stub of the article... because it seemed like it does exactly what clusty does and I don't care to read anymore.
in other news (Score:4, Funny)
Re:in other news (Score:1)
Re:in other news (Score:1)
Hello Newman..... (Score:1)
Has anyone realized this (Score:2)
Re:Has anyone realized this (Score:4, Interesting)
Wouldn't it be cool if we all spoke a language which was expressive but at the same time had a machine-parsable grammar and had absolutely no silly exceptions or odd concepts like the masculine/feminine nouns that French and Italian has?
I'm no expert on this, but I think linguists will tell you that we tend to modify/evolve langauge to suit our culture and circumstances, so any designed language (and even existing natural ones) will be modified into many different dialects as it is used by various cultures around the world.
Still yeah, I am glad I'm a native speaker of English since it would be a pain to learn as a second language! Imagine all the special cases you'd have to memorise! Spelling, grammar exceptions that may not fit the definition you learned but native speakers use anyway etc.
Re:Has anyone realized this (Score:2, Interesting)
Actually, Esperanto was created by an ophthalmologist [wikipedia.org]. In general, linguists don't attempt to replace languages with "better" ones. They recognize that linguistic change is natural and unavoidable. And, like other sciences, linguistics is largely occupied with observing and recording phenomena. They do not, as a rule, take a prescriptive point of view.
Re:Has anyone realized this (Score:1)
Others (like yourself) realize this shouldn't be difficult for computers. You are correct. In truth, computers have little trouble keeping track of nouns, verbs, subjects, predicates... even most of the exceptions.
BUT, The insurmountable part is giving the computer any kind of useful understanding of
Re:Has anyone realized this (Score:2)
Re:Has anyone realized this (Score:1)
Technically, I was wrong there. He actually contributed a great deal to the philosophy of language, which is not at all the same thing as linguistics (though there is overlap).
Context-sensitive adaptive parsing (Score:2)
english subset (Score:2)
Re:Has anyone realized this (Score:1)
It was originally designed to study the Sapir-Whorf hypothesis http://en.wikipedia.org/wiki/Sapir-Whorf_hypothesi s [wikipedia.org], but has since developed a rich following from computer scientists as a potential human-computer interface tool. Err, at least that's why THIS computer scientist is interested in it.
--
Re:Has anyone realized this (Score:2)
Re:Has anyone realized this (Score:1)
A solution already exists (Score:1)
http://www.kli.org/ [kli.org]
Re:Has anyone realized this (Score:2)
Interesting (Score:5, Interesting)
At the moment the data is mined with wildcard text searching, which means you need to know the subject before you can participate. It's a very valuable resource, but it's also not used to it's potential due to the clunky methods of interfacing with it.
It will be quite interesting applying this technique to the dataset to see if unknown relationships become apparent or known relationships become clearer.
Looking at the paper and samples would indicate this tool (if it does what it promises) might be able to not only work out the correlation between datum but to create visual diagrams linking people, places and events quite well. A handy tool for my dataset.
I'm now sitting here crystal ball gazing; if we were to expand this to a 3D map. Say by displaying a resulting chart and allow a researcher to hotlink to the data underneath it would be an interesting way to navigate a complex topic, more so than a text based wild or fuzzy search. Of course I won't know if this is possible until I look into the program more, and I won't be able to look into the program more until I massage teh dataset again
Click on the Anthony Ashcam box and see the hotlinking and unfolding of data specific to him. Drill in more... then more... and eventually get to a specific fact.
The only problem will be that I would need to pre-compute all the charts. Oh well, one day
Artificial intelligence implications? (Score:2, Informative)
An artificial intelligence [earthlink.net] could maybe use these new methods to grok all human knowledge contained in all textual material all over the World Wide Web.
Technological Singularity -- [blogcharm.com] -- here we come!
OMG SOMEONE INVENTED TEH SEARCH ENGINE! (Score:1)
Discourse Analysis? (Score:1)
Hahahaha (Score:1)
They're late to the game. (Score:4, Insightful)
Google AdSense network has done this for years to serve contextually-relevant text ads across thousands of websites. Yahoo now, too.
Re:They're late to the game. (Score:1)
grep? (Score:2, Funny)
Text mining is... (Score:5, Funny)
Hard to do? (Score:1)
Earlier modes of text mining (Score:5, Informative)
Following Schrodt's work, Doug Bond and his brother, both recently of Harvard, produced the IDEAS database [vranet.com] using machine-based coding.
These types of data can be categorized by keywords or topic, though the engines don't try to generate links. The resulting data can also be used for statistical analysis in a certain slashdotter's dissertation research...
The new method discoverd (Score:2)
"site:newyorktimes.com "Tour de France" "
Teaching Granny to Suck Eggs... (Score:1)
Text Mining freeware already does this (Score:5, Interesting)
On the brink? Q-Phrase [q-phrase.com] has desktop software that does this exact type of topic modeling on huge datasets - and it runs on any Windows or OS X box. [Disclaimer: I work there] And there are a number of companies (e.g. Vivisimo/Clusty) that uses these techniques as well.
Going beyond the pure mechanics (this article speaks of research that is only groundbreaking in their speed of mining huge data sets), there are more interesting uses for topic modeling such as its application to already loosely correlated data sets. A prime example: mining the text from the result pages that are returned from a typical Google search. One of our products, CQ web [q-phrase.com] does exactly this (and bonus: it's freeware [q-phrase.com]):
Using the example from the story: in CQ web, text mining the top 100 results from a Google search of "tour de france" takes about 20 seconds (via broadband) and produces topics such as:
floyd landis
lance armstrong
yellow jersey
time trial
And going beyond simple topic analysis: using CQ web's "Dig In" feature (which provides relevant citations from the raw data) on floyd landis returns "Floyd landis has tested positive for high leves of testosterone during the tour de france." as the most relevant sentence from over 100 pages of unstructured text.
So, while this is a somewhat interesting article, fact is, anyone can download software today that accomplishes much of this "groundbreaking" research and beyond.
How much did that cost? (Score:2)
That's what Google News does (Score:2)
Do Try This At Home! (Score:2, Interesting)
Chomsky Anyone? (Score:1)
weird (Score:2)
Why is this news? (Score:4, Informative)
brief explanation of the method (Score:4, Informative)
i.e., documents aren't assigned to a single topic (as in latent semantic analysis (LSA))
this means, incidentally, that they're not automatically labeled, although a list of the top 5 words for a topic generally characterizes it pretty well.
side benefit: you can also discover misattributions (e.g., authors with the same name)
Re:brief explanation of the method (Score:1)
Re:brief explanation of the method (Score:2)
Re:brief explanation of the method (Score:1)
Re:taxed minding what their dough is being spent o (Score:2)