Computers Summarize the News 175
oily_ants writes "I get sick and tired of reading the same story on different web sites. That's why I like slashdot so much. Good (??) summaries of all of the stuff out there on the net. Now there is a project at Columbia University by the nlp group that attempts to generate computer summaries of all of those news articles on different web sites. The project is called Newsblaster and the summaries are excellent. You can read about the project on regular news sites like Online Journalism Review or USA Today."
Also try... (Score:5, Informative)
Re:Also try... (Score:2, Interesting)
Okay the timezone is different (I'm in europe). But two weeks...
There you can also easily see how differently different newsagencies report the same story, for example few days ago there was this story about depleted uranium and it being dangerous:
news Agency #1: new study finds connection between depleted uranium and hightened risk to get cancer.
news Agency #2: Soldiers exposed to depleted uranium bullets in risk to get cancer.
news Agency #3: Children in Yugoslavia might get cancer because of NATO's depleted uranium bullets
Talk about reporting the facts objectively...
It would have been intresting to see the summary generated from these three... Might have been a bit schizophrenic...
Re:Also try... (Score:4, Interesting)
I've see a couple occasions where it's had an article on a completely different subject under the header, but it's not the norm. It's always up to date. My only gripe is that it doesn't have an "Older Stories" link. I've gone back to try and find something I've seen before only to find that it had been pushed off.
They also keep a list of links to news sourse and current relevant resources at http://www.google.com/news/ [google.com].
Re:Also try... (Score:1)
Re:Also try... (Score:2)
Of course, they might have a completely different formula, or they may have a large body of optimizations they can do, but that's one possibility.
Re:Also try... (Score:2)
As they add new sources to the service I reckon it will start to truly rock.
If I have a criticism its that I have to go to the original site to view the story - all I get at google is the headline. This means I have to wait about a MONTH for a CNN page to load when I want to know some more...
Re:Also try... (Score:1)
Re:Also try... (Score:1)
That's what I get for forgetting the http part.
Re:What Will Google Do Next? (Score:4, Informative)
One More Time! (Score:2, Funny)
So you read Slashdot, where they are happy to post the same story over and over, on different days?
Hello -- copyright issues? (Score:2)
Say what? (Score:5, Funny)
I'm sure most will agree with me when I say that this makes ABSOLUTELY NO SENSE.
Re:Say what? (Score:4, Funny)
Google kind of does the same thing (Score:1)
Re:Google kind of does the same thing (Score:2)
Re:Google kind of does the same thing (Score:1)
I could be wrong though...
Re:Google kind of does the same thing (Score:1)
If it's the AP feed that they're all picking up, then sure, it's the same story.
What's interesting is when the story is different - if CNN says one thing, the BBC says something a little different, and the Times of India has yet a third viewpoint, then it's interesting to speculate on the editorial biases that are leading to such divergent viewpoints.
Well, there you go (Score:3, Interesting)
Well, there's the answer to the Ask Slashdot from a couple of days ago [slashdot.org].
Re:Well, there you go (Score:1)
Oh well. The search continues.....
Google already does this (Score:1)
As others have pointed out, they've also just launched a beta news summary service.
Now to ask... (Score:3, Insightful)
Re:Now to ask... (Score:3, Informative)
We've been doing that for ages. (Score:2, Interesting)
Re:We've been doing that for ages. (Score:2, Informative)
Newsbot (Score:4, Funny)
Re:Newsbot (Score:2)
Re:Newsbot (Score:2)
Re:Newsbot (Score:2)
I'm more interested in whether the thing makes any humorous errors. That and whether it can eventually out-rotten Daily Rotten [rotten.com].
Slashdot's financial problems are SOLVED! (Score:2, Funny)
Although, it will only be possible to replace slashdot's editors with the newsblaster program if they can implement some sort of misspelling and false information algorithm.
Re:Slashdot's financial problems are SOLVED! (Score:2)
Alternative to my.yahoo.com? (Score:1)
Does anyone know of an alternative? The newsfeed in the article looks prommising, but not exactly what I'm looking for.
Re:Alternative to my.yahoo.com? (Score:1)
Also follow stocks, check the weather, see if I have Yahoo e-mail, and chat with the same application. And there's only one minor ad at the top, which I don't notice anymore. I'm not saying you should change from ICQ or AIM or whatever, but Yahoo Messenger is a pretty cool app for keeping up with the world outside your cube.
-FF
Do I want one? (Score:1)
I think Newshub is better (Score:3, Interesting)
Impressive (Score:5, Informative)
To tell you the truth, at first I thought the summaries were TOO good; I was suspicious that it wasn't really automated.
But after looking at a few more stories, it looks like it just pulls sentences out of the stories that seem to have a different point to make, and strings them together.
Sometimes you see some redundancy and some non-sequiturs, but I have to admit the illusion is pretty good.
Read the papers (Score:3, Informative)
I'm not sure if they've done anything really novel. I skimmed through one of the more recent papers, on sentence ordering; but that seem to only operate on the same event There's research like this going one at alot of major universities like CMU [cmu.edu] and MIT [mit.edu]. That said, it does look impressive.
Re:Read the papers (Score:5, Informative)
The "single event" summarizer is novel though. It uses a clustering component to cluster the sentences, then for each cluster it takes the intersection of the sentences (yes, we need to parse the text to do this, and we do) and RE-GENERATES (does not extract) a sentence that synthesizes the information from the cluster.
There's a lot of other stuff going on as well, we're using a text categorization system that we developed here, a text clustering system, our own system for categorizing the images that come with the articles (you'll be able to browse by image categories soon as well) and some other stuff.
And we`ll be ready (Score:1)
breadth vs depth (Score:5, Insightful)
Re:breadth vs depth (Score:1)
Newsblaster doesn't do that. What it does is grab news stories from a selection of different sites, searches through the stories for certain words, phrases, or sentences, and then creates a summary of the story, puts it in a heading under the "hits" it made, and provides the link.
This actually unseated /. from its throne as my home page. =]
Re:breadth vs depth (Score:2)
Sure you do, when you're trying to figure out the gist of the story in overview mode. The problem you mention with CNN HN is because TV is a non-interactive medium, and you can't find out/they can't provide you with more information about the story in question. With this, you can.
The summarizer has to get its information from somewhere, from the full news story...this is just a way of giving you the executive version (akin to browsing slashdot with a comment threshold to 5), so you can find out the basics before delving into the details.
Postmodern, postliterate Americans (Score:2)
This new averaged, filtered, genericized "news" is exactly the kind of crap suited to a society that spawned "Judge Judy" and "A Current Affair". Sure, it's a nice piece of technical wizardry, but all things clever are not useful or worthwhile.
Re:breadth vs depth (Score:1)
Having said that, my preferred source for real-world news is Radio 4 [bbc.co.uk] (especially Today [bbc.co.uk] and PM [bbc.co.uk]), which is especially good for listening to as I'm waking up or going to sleep, apart from the moments when a politician says something so outrageous that my blood boils...
Still Some Work To Be Done... (Score:4, Funny)
Check out this odd story about incarcerated Browns [columbia.edu]. The summarizer could apparently still use some manual supervision.
Seems nice enough... (Score:4, Insightful)
Filtered news (Score:1)
Copyright Infringement - Fair Use Doctrine -NOT! (Score:3, Interesting)
Every one of these paraphrasers lift large chunks of syntax.
I would maintain that this is still a plagiarsist or copyright violation unless it is done really well.
And it never will be done really well unless NeuralNetwork chips are common and mankind has advances in Artificial Intelligence research. Five years away at best.
I dare the commerical services to hit Enyclopedia Britannica. Or I dare them to routinely slurp New York Times and boast that they digest the New York Times..
A massive Civil Suit is awaiting some of these early adapters planning on creating a business out of this.
And they deserve it.
It is just "Word Twiddling", however useful.
If the twiddling is done live, once, per user client, then maybe its OK, but none of these business models are setup THAT way.
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:1)
I do think this service would be a lot more usefull if it were done so that one didn't have to go to a seperate page for the summary.
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:2)
Slurping a sentence or two from an 5-25 paragraph article and quoting it with attribution is considered fair use, right?
I'm not clear on if they're quoting and attributing it sufficiently to meet a legal challenge however. IANAL. But it's not the open and shut case you make it out to be as far as I can tell.
--LP
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:2)
The service claims to be a computerized summary. However, in terms of copyright, a summary is something that expresses the same idea using different words. Therefore, using exact quotes and labelling them as a summary is a textbook case of plagiarism.
Nathan
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:3, Insightful)
As Bitlaw [bitlaw.com] points out, under the Copyright Act, four factors are to be considered in order to determine whether a specific action is to be considered a "fair use." These factors are as follows:
1) the purpose and character of the use, including whether such use is of commercial nature or is for nonprofit educational purposes;
2) the nature of the copyrighted work;
3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
4) the effect of the use upon the potential market for or value of the copyrighted work.
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:3, Insightful)
Attempting to apply the four factors there, while some could be argued either way, I can see that on balance, you both might be right. I could probably make a stronger case that it doesn't qualify as fair use, than that it does, based on those four factors. I think I was focusing over-much on the "amount taken" criteria and overlooking the others.
--LP
Re:Copyright Infringement - Fair Use Doctrine -NOT (Score:2)
Nathan
copyright/legality? (Score:5, Interesting)
Re:copyright/legality? (Score:1)
Paraphrasing is legal under copyright law, so long as the sources are cited and it's not just a cut-and-paste of the entire selection.
Re:copyright/legality? (Score:2)
Sure, their main page just briefly quotes, which is probably ok, but all the links point to local copies of the copyrighted news articles.
(In the USA) there are four criteria for judging what is a fair use [benedict.com] of copyrighted material.
The purpose and character of their use isn't academic or educational, it's a news service just like the original sites they got the text from. The fact that it's hosted from a
The amount and substantiality of the portion used in relation to the copyrighted work as a whole is darn close to 100% of the copyrighted material.
The effect of the use upon the potential market for or the value of the copyrighted work is particularily bad... if people can easily get the news from this convienent summary site, why would they bother to visit the original site (and thus be an audience for their advertising, become "loyal" readers, etc).
Now the nature of the copyrighted work is informational news, and not really expressive (like songs, movies, etc), so at least they've sort-of got one of the four criteria for fair-use.
Direct NewsBlaster link (Score:3, Informative)
www.cs.columbia.edu/nlp/newsblaster/ [columbia.edu]
although I found some of the summaries slightly shallow, they are not bad.
The problem is that it becomes an average of opinion, when you sometimes need that longer insightful article. This easily could become the news of sheep everywhere.
This could be bad when facts come in to contradict initial impressions.
oops
my God Man (Score:1)
Let computers be computers, humans be humans (Score:1)
Occasionally, the summary will juxtapose two sentences (it's just ripping examplar sentences from different stories), that when put together create screw up the meaning [columbia.edu]:
"Now that David Letterman is staying at CBS, ABC s corporate bosses took steps to mend fences with"Nightline"host Ted Koppel on Tuesday. And that ' s appealing to beer companies..."
Doh!
I think a more fruitful avenue of research is new methods of presenting information so that humans can decide what to read. Instead of using tricks to simulate a computer understanding the meaning of an article, this uses the same tricks to simply assist reading the article.
Apple's research group did some interesting work [acm.org] in that area in the 90's.
Used by U.S.A. Today? (Score:2, Funny)
This is old 'news'... (Score:1)
Hi, I send you this news to ask for your advice!
Along with all kinds of pertinent documents...
Slashdot has something they don't have (Score:1)
Grammar, even (Score:1, Flamebait)
Oops! (Score:1)
So they grab news from the Washington Post, Reuters and the BBC (amongst others), but leave out the National Enquirer [nationalenquirer.com]? Why can't I have all my "Space Aliens Abducted Britney Spears" stories in one place?
How does it work? (Score:3, Interesting)
Also, who else thought "neuro-linguistic programming" for at least a moment when they saw "nlp"?
Is that really possible? (Score:1)
Re:Is that really possible? (Score:1)
In the TIDES project, we will develop a practical, multilingual and multidocument information tracking and summarization system. Our design features the integration of robust, statistical techniques, shallow linguistic approaches and machine learning to achieve scalability within languages and portability across languages. To realize these goals, we will develop methods for summarization across documents using information fusion and identification of key differences, summarization across languages relying on identification and translation of terms, and new methods for identification, expansion and translation of terms. Unlike most other approaches, rather than relying on sentence extraction, our work uses information fusion of similar information, merging together repetitive phrases into a single phrase allowing dramatic reduction of information across many articles. Our work will focus on characterizing types of differences to include in a summary, which is an unexplored direction in multi-document summarization. We will develop difference operators to identify new information, contradictions, trends, multiple perspectives, and different topics. Our approach will minimize reliance on full machine translation, instead using identification, expansion and translation of terms where possible. We will begin work with a language such as Spanish, but quickly expand to include Asian languages and other non Indo-European languages.
From the look of it, NLP (natural language parsing) seems to be evolving nicely. It used to be that NLP required processing the entire document and understanding the sentences by mapping heirarchies of valence/word order.
Still a few bugs in the summarizer (Score:1)
Here is one borderline-incoherent Newsblaster summary [columbia.edu]:
Will Hollywood's 'Tomb' be a box office 'Raider?'
Summary:
After the success of last year ' s Lara Croft: Tomb Raider and high expectations for Resident Evil, out Friday, studios are booting up for more. The game: Alice in Wonderland gets a twisted remake in American McGee ' s Alice, a gothic horror version of the classic tale; based on the game by Electronic Arts Studio: Dimension Status: Horror master Wes Craven directs. The game: A sunglasses-wearing all-American hero blows away bad guys with machine guns; 3D Realms Studio: Dimension Status: In limbo. Star Angelina Jolie was attached to the sequel before the original Tomb Raider opened. The game: Amateur taxi drivers take to the sidewalks and crowded streets, picking up customers and delivering them to their destinations unscathed; Sega Studio: No distributor yet. Brothers Jon and Erich Hoeber(Montana) currently are writing the screenplay.
Evolution (Score:1)
the future. god bless it.
Sounds like the "DJ 3000" (Score:2)
What. a. bunch. of. clowns.
Re:Sounds like the "DJ 3000" (Score:1)
What. a. bunch. of. clowns.
Heh heh heh... How does it keep up with the news like that?
Re:Sounds like the "DJ 3000" (Score:1)
newslinx (Score:1)
Irony....on a Friday?? (Score:1)
Where's today's news? (Score:2)
O'Reilly Network's Meerkat (Score:1)
My newsfeed (Score:2)
2. Fark. I get my good share of the weird news, and of course NewsFlash articles, which are just links to other news sites. I'm happy.
One benefit... (Score:2)
I noticed the "blandness" of the summaries too, but I think that's a benefit-- reading CNN stories can get really tiring after a few minutes since everything has to have as much punch as possible.
Here are some papers (Score:2, Informative)
If you don't like the same story over and over... (Score:1)
I assume that means you only read the stories here, and not the posts.
Summarizing... (Score:1)
Compare the summary with the opening sentences of the first and last articles. Maybe we should wait a bit before speculating on the business impact of this "technology".
hmm. (Score:1)
someone should make line graphs.
Our product does the same thing, and copyright iss (Score:1)
Our product works by first categorizing text articles, then identifying which phrases most effectly support the categorization of the article.
subject of copyright infringement: several people have pointed out that the linked site may go beyond fair use of text on the original news sites.
I bet that the university obtained permission like I did. I sent about 10 news web sites copious documentation on what my system does, and three gave me permission to use their sites. As is usual in life, it helps to ask politely!
-Mark Watson www.markwatson.com
Wanna watch it crash? (Score:2)
I have yet to figure out exactly what his point is in a story, so it would be really interesting to see software try and handle it.
Of course, it did say it summarizes NEWS stories.
Re:Wanna watch it crash? (Score:1)
Let the users vote on the articles (Score:1)
a) interesting
b) well-written
c) in the right column
or similar.
So the majority of people who visit these sites wont have to worry about "bad-news"
Those articles that get downvoted get off the frontpage. This process repeats whenever an article is added and assures a high-quality.
comp.sci
BIas a good thing? (Score:1)
*I'm not trying to take a side in that, but I've recently found American news sources *seem* to skew that kind of news a little.
A Few Kinks, A few Comments (Score:3, Interesting)
While I have to agree with some people that this isn't in-depth reporting I do think that it is pretty interesting AI. When it comes down to it the problem is not that a computer might be summarizing our news. The problem is twofold.
Firstly people are not always inclined to look beyond summaries. When faced with typical time constraints people prefer to look at summaries because they do not have time to search across a dozen sources and articles. This is why USA today became big in the first place. Nothing there is more than 1 column long. (Incidentally did anybody else find it hilarious that this system "summarizes" USA today who themselves summarize other news sources?)
Secondly much of the news is the same. News is big business and most major news media tell the stories that sell. Because they are all targeting the same markets they tell the same stories and in the same ways. Therefore there is little difference between CNN, the NY Times, etc in terms of tone and "facts". Especially since much of "their news" comes from the same wire services such as Reuters [reuters.com]. Fox News is different but that is because they have abandoned the mantle of impartiality and become all conservative all the time.
In essence this system is perfect for the internet news style. Breif summaries of facts followed by more "in-depth" leads that we may peruse as we wish. The real question is, when will this begin drawing on sites like Indymedia [indymedia.org], The Register [theregister.co.uk] and
I'm afraid to Slashdot a great site, but... (Score:4, Funny)
Basically, it looks at the headlines on Yahoo/Reuters, and finds sentences that scan as 5/7/5, and uses Perl cleverness to present them as a little news haikus (or senryu, if you wanna be picky). It's great stuff:
I'm hooked :)
They have archives going back to the beginning of 2001, with only a few holes (e.g. the days after September 11), and they talk about how they are doing everything. Bonus points: you can have the haiku headlines mailed to you automagically every day. I just hope they have the bandwidth (etc) to withstand Slashdot....
How wrong can you get? (Score:3, Insightful)
Besides the fact that
Every news site has some kind of slant to it. CNN, NPR,
I read news from about 10 sources a day and if I see multiple articles that I'm not interested in they're easy to skip. If I am intersted in them I read them on all sites. You get much much more information that way.
Though you do need to pick your sites. If you look at CNN, MSNBC and Salon and all three are merely parroting Reuters then you know your not doing yourself any good.
One problem... (Score:1)
LOL (Score:1)
This is doubly funny because last spring I took a class in computer graphics by the man behind augmented reality [slashdot.org], which is again on the front page today because of the street sign article. [slashdot.org]
This means:
another site... (Score:2)
www.newsnow.co.uk does similar stuff I guess, but is the summary builder the thing here?
Another good news site (Score:2, Informative)
uh... this could be bad.. (Score:2, Insightful)
Tried the new google news search service yet? (Score:2, Interesting)
http://news.google.com/ [google.com]
It indexes a huge array of news sites several times a day for fresh stories - enter a search term and it will bring up all the headlines it can find for that subject. Best of all, it uses an algorithm to identify alternative coverage of any one story and lists these links in a block beneath the main search results. That way you get links to several different accounts of the same story (although in practise they end up being pretty similar due to using the same news agencys) without having to hunt around for them yourselves.
They're still working on the algorithm and are requesting as much feedback as possible - read more here [google.com].
Now it just needs SOAP (Score:3, Interesting)
I was writing about this in response to a post in a user's journal the other day that even better would be to make a story content P2P system where you could allow story distribution. You might place a limit and only allow the summary to drive people to your site, but it could still help with bandwidth issues. This would basically be like an enhanced RDF/RSS type system but over a P2P type network you wouldn't even really have to host your own feeds for people. Add in some sort of DB persistance and you could just say "get new headlines and summaries from site x"--the system would bring in all the new content. Anyway, that is just a dream I have and probably will never happen the way some people feel about their content.
Already done - Newshub (Score:2, Informative)
newshub.com [newshub.com]
The real magic isn't in the summaries... (Score:2, Insightful)
...but rather in identifying multiple documents that appear to be talking about the same thing. Summarization is a well-researched (but not well-perfected) NLP topic, but finding inter-document similarities is quite a bit more challenging. This is easy for me and you to do when we read something, but think about what it takes to get a machine to do this. Take a look at some of the examples--you'll find that although large chunks may be verbatim from document to document (especially ones that rehash standard news feeds like Reuters and AP), most articles have a different wording or spin on each idea.
Re:Dude. (Score:1)
Heh, it wasn't in the initial summary of the slashback (and nothing in the summary interested me).
Oh well, I'm on a bad roll lately. Its just karma, I guess...