Paraphrasing Sentences With Software 203
prostoalex writes "Cornell University researchers are making progress in paraphrasing and "understanding" complete sentences in a software application. Analyzing sentences on the semantic level allows the software application to treat two sentences, expressing similar thoughts and ideas, but written in a different manner, as a single semantic unit. Significant achievements in this area could revolutionize the information searching field."
This translation just got out (Score:2, Funny)
Re:This translation just got out (Score:1, Redundant)
Imagine John Ashcroft, Admiral Poindexter, and the National Security Agency using a Beowulf cluster of these to scan everybody's email.
I wonder if there's a Bayesian filter that picks out athiests, free-thinkers, commies, anti-war activists, and Democrats.
Pass that list through a geo-locator, and the thought police can be at your door by midnight. (According to Solzhenitsyn, they always knock on your door at midnight.)
Re:This translation just got out (Score:1)
Hmm.. Why not just criminate anyone sending emails with a subject different from "FW: fw: fw: FW: READ THIS!!! FW: fw: Something cute"
The problem is... (Score:4, Insightful)
Re:The problem is... (Score:1, Interesting)
I doubt that any system designed to deal with idioms would be programmed with every idiom. More likely, they would take a huge corpus of text and do tons of statistical manipulations to it, such that idioms would be roughly equivalent to non-idiomatic phrases expressing the same concept.
Re:The problem is... (Score:1)
Re:The problem is... (Score:1)
Re:The problem is... (Score:5, Informative)
Rubbish - Ever heard of Machine Learning?
There has been much work on resolving coreferance and named-entity recognition problems has been onging for several years, with the aim being to lead onto full NLP. This research seems interesting in that it takes work from another field (genetic sequence matching) and applies it to an NLP problem. What links them all is that in almost every case, the research involves machine learning at some point... it makes no sense to hand-code millions of case-specific rules, when a machine can learn them faster and better...
Read their paper [cornell.edu] and you'll see that indeed it's an unsupervised learning approach - even nicer in that it doesn't require you to label training examples for the algorithm...
~D
Re:The problem is... (Score:1)
And if only I spent as much time on my english usage research....
Obviously, I meant:
There has been much work on resolving coreferance and named-entity recognition problems in recent years,
~D
Yes. (Score:5, Funny)
Re:Yes. (Score:2)
Then there are the English women puzzled by the expressions they get from Americans when they say "Knock me up the next time you're in town".
Re:Yes. (Score:2)
First use of this technology (Score:5, Funny)
Think about the possiblities...
Of course, the biggest problem with that is that there wouldn't be nearly as many cool articles to read!
Re:First use of this technology (Score:2)
Re:First use of this technology (Score:2)
example of dupes:
id Says 60fps Is Enough For Doom III [slashdot.org]
DOOM III to be capped at 60 fps [slashdot.org]
an example of what I mean is:
a story [slashdot.org] regarding Doom III at QuakeCon is posted, later, a story [slashdot.org] about a specific feature in Doom III is discussed.
The second example may not be the best one, but it gives an idea of what I actually mean.
Re:First use of this technology (Score:2)
Re:First use of this technology (Score:2, Insightful)
Re:Second use of this technology (Score:2)
Schoolkids (Score:3, Interesting)
Google translator already let my sister-in-law "cheat" on a German paper, but the translation was "too good" so she got caught. Paraphrasing that's excellent (obviously would take a while, but what the hell, we can play Apple II games on a Palm not 20 years later....) could be real messy.
Re:First use of this technology (Score:2)
This reminds me of the Infocom classics (Score:5, Interesting)
There is a mailbox here.
Re:This reminds me of the Infocom classics (Score:4, Insightful)
Yes. I can't be the only one that is disappointed that text adventure development essentially died. The great limiting factors always used to be memory (with no disc drives, the whole game had to be stored in a very limited amount of memory) and processing speed. Now that we have both of these in abundance it should be possible to write a real "interactive novel", but I guess that will never happen. Shame, it's a great format for cell phones and pdas.
Re:This reminds me of the Infocom classics (Score:5, Informative)
An interactive novel, at least the kind you're probably thinking about with deeply implemented characters and so forth, is probably AI-complete. It's not about the disk space and processor speed, it's about the inherent trickiness.
Re:This reminds me of the Infocom classics (Score:2)
Yes, I know about the stuff you are talking about.
it's generally acknowledged that the quality of modern works has surpassed that of Infocom.
That's the problem... The modern games have only just surpassed games that were created for machines of 12 years ago.
It's not about the disk space and processor speed, it's about the inherent trickiness.
Not today, but it was an extremely limiting factor when you are trying to get a whole game into 32Kb of memory.
Yes, it is a
Re:This reminds me of the Infocom classics (Score:3, Informative)
For example, right now most of the languages accept sentences of the form [VERB] [DIRECT OBJECT] [PREPOSITION] [INDIRECT OBJECT]. Occasionally someone suggests, "Why not add adverbs?" The general concensus is that doing so suddenly requires th
Douglas Adams (Score:3, Funny)
I would have LOVED to see him tackle a 'text message adventure' along the lines of the old infocom classics. He has written a number of pieces (some of which are collected in salmon of doubt) about how much he enjoyed this marrage of writing and computing. The flexibility and restrictions of the medium would have led to something pretty neat I'm guessing. Of course - then he'd h
Re:Douglas Adams (Score:2)
He did - a game called Starship Titanic was written by Adams, in conjunction with a game developer (Simon & Schuster? can't remember...)
It combined a text adventure interface with some nice 3D graphics that would move around above the text box, in a Mystian sort of way. The game itself was very funny, had some beautiful designs and ideas, and was almost totally impossible. In other words it w
Re:Douglas Adams (Score:2)
Re:This reminds me of the Infocom classics (Score:3, Insightful)
> open mailbox
Re:This reminds me of the Infocom classics (Score:2, Informative)
Infocom's parser was much better. "Put the big bunch of keys in the blue box under the table." can be parsed by it, for example.
As the OP said, this isn't near the level of what's mentioned in the article, but it's certainly better than you imply.
Re:This reminds me of the Infocom classics (Score:2)
Does anyone have any insight into the algorithm they used?
Re:This reminds me of the Infocom classics (Score:2)
Not a coincidence.. ? (Score:3, Funny)
Auto Greeter Machine: I welcome you to our country, and greet you with open arms. Please enjoy your stay - we have a fine range of tourist facilities, restaurants, bars and so forth. And on a personal note, may I say that you are likely to be eaten by a grue.
comments? (Score:1, Interesting)
Re:comments? (Score:2, Funny)
but the day the mods will be replaced by parsers, I think I'll get one to post instead of me.
google? (Score:4, Interesting)
Re:google? (Score:3, Interesting)
Re:google? (Score:2)
Re:google? (Score:2)
Re:google? (Score:5, Informative)
Re:google? (Score:2, Interesting)
It's enabled by default - if you want exact match words (like it was a month ago) you need to search for: +keyword
Re:google? (Score:2)
css ~help
and you'll get sites with tutorials, guides, support, etc
how it can be useful (Score:5, Interesting)
However, after reading the article, I wonder whether the research can be applied to Latin languages, as they did the research on semantic languages.
Correctly paraphrasing is a difficult problem (Score:1)
after reading the article, I wonder whether the research can be applied to Latin languages, as they did the research on semantic languages
...is a good example :)
Hrm (Score:4, Interesting)
Okay, maybe I exaggerate a bit here, I did read the article and while the summarize isn't that far off from what these guys are doing...
Re:Hrm (Score:2)
So for the 1% summarisation of the article "The sentence-based paraphrasing system could improve machine translation, according to Barzilay".
Google News? (Score:5, Interesting)
Re:Google News? (Score:1)
Anyone know?
Re:Google News? (Score:5, Informative)
No, but Regina Barzilay, who is the researcher featured in the article, worked (with me) on the Newsblaster [columbia.edu] project at Columbia University, where she indeed applied these techniques to multidocument summarization. Newsblaster gathers and clusters news like Google News, but produces more sophisticated summaries.
Translation software? (Score:3, Informative)
Fascinating read (Score:1)
I wonder what its' application could be, other than to detect duplicates... Perhaps, a tool to suggest ways of rewriting sentences? Or maybe part of a more advanced grammar check?
Re:Fascinating read (Score:5, Insightful)
My first thought was translation tools. GOOD translation tools that understand the grammar in the source language, and uses the grammar in the destination language to form the resulting sentence.
There has been some work on something to solve this problem, where a phrase in language A was translated to some special "universal" code, and then finally to language B. The developers would then need to make the translator translate all languages to the universal code, and vice versa. The universal code could be whatever necessary to make the software as easily as possible be able to preserve the "meaning" of the sentence.
However, if this is done, the problem could change from this:
Source: I love hot dogs.
Destination: Ich liebe heiBe Hunde. (i.e. a literal translation, from Altavista Babelfish)
Source: I love hot dogs.
Destination: Ich liebe Nahrung. ("I love food")
In case the universal language wasn't advanced enough and the english -> universal translator conversion was "lossy". So we might exchange our current problem with mangled grammar with lots information.
Here's a web site [mundo-r.com] about it, and I'm sure there are many more.
Re:Fascinating read (Score:2)
Cool,
Re:Fascinating read (Score:1, Interesting)
Re:Fascinating read (Score:3, Interesting)
Fascinating (Score:3, Insightful)
I'd note that this is a novel approach, and, for better or for worse, it goes about doing things much differently than our minds do.
Actually, though, it's closer to how humans understand writing (stringing together atomic words/phrases in an implicit context) than previous statistical methods.
RD
Paraphrased version (Score:1, Interesting)
Cornell University researchers could revolutionize the information searching field by analyzing sentences on the semantic level to allow a software application to treat two sentences, expressing similar thoughts and ideas but written in a different manner, as a single semantic unit.
So... (Score:1)
Who will be first to post the paraphrased article so I don't have to RTFA?
Does this mean... (Score:1)
Re:Does this mean... (Score:2)
It's been done (Score:3, Interesting)
I looked again and whaddayaknow? I asked the paperclip about auto summarize and it is still there in the toold menu afterall! Looks like I don't have that feature installed though.
I have it installed (Score:2)
It's been done by CanadaDave (544515) on Thu December 04, 9:20
Microsoft Word had AutoSummarize in Word 97, or was it 2000? Anyhow it seems to be absent in Word XP.
-----
Fantastic bit of programming there, Bill.
Not really the same thing Mr. Dave.
Who didn't think of Reginald Barclay? (Score:1)
Speaking of natural language recognition, I parsed this sentence from the article as reading, "Two ideas led to the system, said Reginald Barclay [startrek.com]
Someone help me out here (Score:5, Funny)
My take on this (Score:2)
Re:My take on this (Score:1)
Would you prefer if we all spoke some sort of langauge governed strictly by some computer-linguistic grammar? I'll get started on the Yacc code right away...
~D
Re:My take on this (Score:3, Interesting)
Binary identity does not imply semantic equivalence. It all depends on how the data is interpreted.
Re:My take on this (Score:2)
> "It's immense".
> "It's massive".
> "It's huge".
Damn! I have GOT to remember to close the shades before I undress
Japanese manuals (Score:3, Funny)
Simon
Goodbye, Cliff Notes... (Score:4, Funny)
Hello, automatic paraphrasing of literature.
P.S. Just joking, kids. Stay in school!
What about... (Score:3, Funny)
Another Killer App (Score:5, Funny)
Of course, millions of lawyers worldwide would lose their jobs, but I, being bitten by them, just take it as an added benefit.
It has to be said ... (Score:1)
(from the I'll-Paraphrase-YOU! department)
I get it. (Score:1)
Significant achievements [GOOD] in this area could revolutionize [IS] the information searching field. [THIS].
yo.
Finally ... (Score:5, Funny)
Forget Research! (Score:1)
Paraphrase of the article. (Score:5, Informative)
At Cornell, University, researchers decided to avail themselves of two different sources of the same news and use computational biology methods to make it possible for computers to automatically paraphrase input sentences. Their first step was to compare the two different sources of the same news.
Eventually, it is hoped that this research will have benefits in computer processing of natural-language queries, translation engines, and in assisting people with certain types of reading disabilities.
The project began when two ideas came together, said one of the Cornell researchers, Regina Barzilay. Regina Barzilay is an assistant professor of computer science at the Massachusetts Institute of Technology.
The vast amount of duplicated content online is a valuable resource for computer systems learning to paraphrase. A number of reporters report the same news but using different wording. The redundant sources of news are able to assist in learning the different ways one piece of information can be paraphrased, as the same basic facts are reported in each. So with these multiple sources, you can sort out the noise and get the facts and then work out different ways of stating those facts.
Even with similar styles of writing, paraphrasing of sentences is more than just working out ans substituting synonyms. The researchers' provide a couple of common business phrases to illustrate this:
After the latest Fed rate cut, stocks rose across the board.
Winners strongly outpaced losers after Greenspan cut interest rates again.
The next step, was to use computational biology techniques to determine how much in common two sentences had and how closely they were related. The technique used was similar to when biologista are looking to see how close two sets of genes are that may have started from the same seed but then evolved. They are different but have a degree of similarity.
They important thing was to compare news sources that were written differently but covered the same event. This generated a whole set of word patterns that were kind of the same. This was exactly the core data needed to inform a computer paraphrasing technique.
The Reuters and AFP news sources were used to test the system. News was selected from English articles produced between September 2000 and August 2002.
The system developed by the researchers performs two groupings; firstly comparing articles from the same source:
Word-based clustering methods were used to identify sets of text that had a high degree of overlapping words. This method identified articles that reported distinct acts of violence occuring in Israel and the Palestinian territories.
Computational biology techniques were then used on these sets of articles to generate lattices or sentence templates for the computer to use. Each lattice contains a number of sets of words that occur in parallel and empty slots where arguments, such as locations, number of fatalities, times and dates can be inserted.
The challenge was to sort out which lattices were indeed due to different events and which were due to writing variability.
The researchers were thus able to identify common templates used by journalists to describe similar events. Ie. journalists who take the same article and change or take out a word, add a detail, reverse the sentence and so on are hereby busted.
One of the templates, or lattices, read: Palestinian suicide bomber blew himself up in NAME on DATE killing NUMBER (other) people and injuring/maiming NUMBER. In addition to the injuring/maiming variable, there are several variables within the name argument: settlement of, coastal resort of, center of, southern city, or garden cafe.
43 AFP and 32 Reuters templates were thus discovered by the system. The researchers then cross-compared these lattices.
They compared the
Pleasure-ism (Score:2, Funny)
Oh wait...
Mrs. G: Johnny, come here for a second.
Johnny: Yes Mrs. G?
Mrs. G: What did you mean by "Shrub claimed that Basket Hamper and the Hatchets of Sin will be blown out" in your current events report?
Johnny: Oh, whoops! What I meant to say there was, "Bush says Bin Laden and the Axes of Evil will be defeated." Sorry about that. Darn that defective spell-ch
Obligatory Paraphrases (Score:3, Funny)
How do you paraphrase Slashdot ?
Ans : Dupes for nerds, stuff that matters again and again.
How do you paraphrase Microsoft Innovation ?
Ans :
Re:Obligatory Paraphrases (Score:2)
Apple?
But could it..... (Score:2, Funny)
Better idea (Score:3, Insightful)
Not to mention the increased ability to quickly spot "re-written" bought term papers.
Interesting (Score:5, Funny)
The output of LSA has been shown to be roughly equivalent to human scorers for examining summary essays produced in tests.
Point is, that by combining this here paraphrasing algorithm with LSA, we can have computers summarizing text and other computers giving them grades on it. This takes students and teachers out of the equation entirely. Saves us big bucks and get public education back on its feet!
Re:Interesting (Score:2)
It's coloradO.edu
SCO Analysis (Score:5, Funny)
"Pass me the crackpipe, man!"
Proudly karma-whoring since the turn of the millenium
Re:SCO Analysis (Score:2)
Like all other disasters, it seemed like a good idea at the time. :-)
Patenting fake .sigs since the turn of the millenium
extracting & searching on memes (Score:2)
Neat trick if they can pull it off. Then Google results would really improve.
LOLITA? (Score:3, Interesting)
Re:LOLITA? (Score:2)
Spamfilter (Score:3, Interesting)
Already been done (Score:2)
Advances in Automatic Text Summarization (Score:5, Informative)
Call Infocom! (Score:3, Interesting)
For the lazy, or interested, a summary via OS X! (Score:5, Informative)
At a roughly 10% size:
At a quarter size:
Re:For the lazy, or interested, a summary via OS X (Score:2)
Select the text.
Choose from the Application (e.g. Safari) menu in the menubar Services...Summarize.
The Summary tool pops up. Horray! The sad part is they demoed it at MacWorld Boston '97, and released it in Jaguar, IIRC.
Re:For the lazy, or interested, a summary via OS X (Score:2)
I think Apple uses the service internally in their file indexing and search feature, too!
Online Machinese syntax parser (Score:2)
http://www.connexor.com/demos/syntax_en.html
How I do this in my product (Score:4, Interesting)
I first classify the text into a category, then weight every word in the text based on how much it contributed to this classification - I then output as a "summary" of the one or two sentences in the original text that most contribute to the classification of the entire text.
Not really sumarization, but useful.
-Mark
Translations (Score:2)
Imagine, using a computer to translate from one language to another, and end up with a gramatically correct result. That would be amazing..
Re:Finally! (Score:2)
He says a lot while he's there, but after they run it through some sort of language processor they find out that he said exactly *zip*.
Aren't weasel-words fun?