Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
×
Programming IT Technology

Open Source Automated Text Summarization? 38

TrebleJunkie writes "I've spent some time recenting looking for open source projects dealing with Automated Text Summarization -- automatically generating detailed summaries from longer documents -- to no avail. I can find a lot of research papers and several commercial projects, but no open source code or projects? Does anyone out there know of any?"
This discussion has been archived. No new comments can be posted.

Open Source Automated Text Summarization?

Comments Filter:
  • by Anonymous Coward
    is there a non-free program that does this?
    • is there a non-free program that does this?

      Microsoft Word.

      It doesn't do it all that well, from what I've seen, but it does it. It's called "AutoSummarize".
      • Mac OS X’s has a built-in Summarize Service that works much better than the one in Word, IMO. Sorry I can't think of any open source ones.
      • "doesn't do it all that well" is being kind. My wife tried it on a 5 page letter she'd written, and the results were...bizarre. Yes, the text was from the document. Far from the most relevant parts, seemingly grabbed at random.

        My best guess for what it could do is some sort of word frequency count, ignoring common words like 'the'. Then include the top N% of sentences and those adjacent to them that include the most common words. Also, give a higher weighting to things in the beginning and end, since papers following the classic form tend to say what they're going to say, say it, then say what they've just said.

    • Check out ArchiText [yellowbrix.com] from YellowBrix.

      Having been looking at their demos and so on, they have some great summary software.

      It is most certainly NOT free, but perhaps by looking at the summaries generated and the documents pulled from, you could get some idea how to reverse-engineer the process.

    • The company I work with has a summarization library that does this. Pricing depends upon how you use it. I know that they've made fairly good deals for educational uses. It was more designed for writing automated abstracts, but it does an amazingly good job on news sources as well.

      Obvious caveats apply - i.e. I work for them and helped write the thing. However if you are needing that sort of thing or something more particular contact Lextek [lextek.com]

  • I know this isn't exactly what you are looking for, but I remember SAT prep books that teach you to read the first line of every paragraph to get a quick summary. Granted it works better for the SATs than it does IRL, but it often works pretty well and it's better than nothing. You could whip up a simple perl script to extract the first line of each paragraph in no time.
    • A local TV talk show host once revealed that if he didn't have time to read the book of an author/guest he would read the first chapter, the first page of each chapter, and the last chapter.

      Yeah, it's off-topic, but it's not redundant! Stupid moderators -- meta-mod will bite you back!

  • Why not provide us with the research you found. Maybe one of us would be willing to hack up a quick and dirty prototype based on the research.
  • First off, I'm doubtful that there are any open-source programs that do this well, as it's a very difficult problem! It has to do with understanding a document, which computers really can't do.

    So I'd like to take a moment to point out a good resource for some existing summaries, at bookaminute [rinkworks.com].
  • by cabalamat2 ( 227849 ) on Thursday March 14, 2002 @07:43AM (#3161813) Homepage Journal

    I know it's not open source, but have you tried the Summarize feature in Microsoft Word? I fed it the entire contents of the GNU website [gnu.org] and it came back with:

    GNU is rubbish. Don't use the viral GPL! Bill is your friend. You love Bill. Microsoft software is the best.
  • Check out Alembic (Score:1, Interesting)

    by Anonymous Coward
    http://www.mitre.org/technology/alembic-workbench/
    Might do exactly what you want. You probably have to train it first but it works quite nicely.

    Mike
  • Summarisers (Score:2, Interesting)

    by Exeter Bun ( 566451 )
    I think one of the problems is that such a piece of software would be big business. I think I found something in the Natural Language Processing Software Registry: http://registry.dfki.de/ Check under sections->written language->summarization Another poster described systems that simply filter through relevant sentences. They're also sometimes known as abridgers. You might want to include that term in any keyword search you're doing
  • by Bazman ( 4849 ) on Thursday March 14, 2002 @09:39AM (#3162024) Journal
    perl -ne 'split;foreach(@_){print $_." " if (rand()>.9)}'

    Try it on man pages:

    man awk | perl -ne 'split;foreach(@_){print $_." " if (rand()>.9)}'

    and it still makes sense! :)

  • I developed NetOwl Summarizer 1.5 [netowl.com] (way at the bottom), and there's a lot that needs to be done. You need to score enough documents, and need to have a good entity extraction mechanism (which NetOwl Extractor does) and you need a good on-line learning system. It's a lot of work, and even still, we don't get very good results, only good results. Microsoft's text summarizer does far worse, actually, but neither of us is perfect.
  • by f00zbll ( 526151 ) on Thursday March 14, 2002 @10:24AM (#3162191)
    I did some research into this for a pet project of my own. I wanted to write an application to crawl the web and get information. After a couple months of research, I realized how big of a problem it is.

    1. the application needs to be able to determine the relevance of the provided text
    2. to do so, it needs to determine the relative importance of the sentences and words
    3. it has to be able to compose new sentence to write a summary
    4. not all documents follow good structure or grammer
    5. how do you account for spelling/grammar mistakes

    From my research, there appears to be two primary methods of performing this kind of processing:

    1. natural language parsing
    2. statistical parsing

    Of the two, statistical parsing is more popular these days because it doesn't require knowledgebase, expert system shells, grammar modeling and extensive dictionary. One of the primary method of determining the relative importance of words in a sentence is valence. The main challenge with natural language parsing and statistical technique is it depends on the training dataset. The more specific the dataset is, the better it will perform.

    Statistical analysis can also use expert system shells and other AI technologies to improve accuracy, but it doesn't have to.

    From my understanding (which is limited), it stems from a principle from linguistics. By counting the frequency of words or more specifically nouns, the program is able to rate each nouns importance. Once it got done, it could then look at the sentence that best describes the document by doing a comparison between the most importance words and the appearance of those words in the sentences. I remember this from my literature and linguistics classes. Congnitive science has also attempted to solve this problem, but it is very difficult.

    In either case, if you dealing with well structured documents, your best bet is to grab the first 3 paragraphs assuming the author followed standard thesis/essay structure. If you're planning on summarizing new articles, it might not be that hard if the author followed the inverted pyramid, which many do not. One of the big tools of natural language parsing in the early days was prolog. It is still used a lot in academic settings for natural language processing. You're best bet is to get an intern to read and summarize for yo

    • Systems I've seen are also capable of extracting noun phrases and verb phrases, and weighing the relevence of those. If 'the lazy sleeping dog' occurres a couple of times (especially across paragraph boundries) it will score high to be part of an overall summary of the document. Ferreting out the parts of a verb phrase can be quite a bit more difficult, because they can be bigger, containing noun phrases, prepositional phrases, they can be nested, yada yada. You need to have a lexicon so the system can pick out a word, and classify it as a noun, adjective, adverb, verb, and the heirarchial relationship they are allowed to have with one another. And after all that, you have to remember that in english (and alot, but not all other languages) word position is syntactically significant, and that there's more than one (syntactically relevant) way to say the same thing.Which is why it would be hard.

      But it wouldn't be impossible. There's a company in Canada that does software like this (in Englis, German and French, I believe) called Nstein. I've seen a demo and its very impressive.

  • Sherlock (Score:3, Informative)

    by maggard ( 5579 ) <michael@michaelmaggard.com> on Thursday March 14, 2002 @12:55PM (#3162901) Homepage Journal
    Apple's Sherlock [apple.com] application does this.

    It's not Open but it is scriptable, is not an additional cost, and is available on a Unix OS (MacOS X) Indeed through Apple's Open Scripting Architecture (OSA [apple.com]) one can use any number of scripting languages such as Python, Perl, and even JavaScript to interact with the application.

    Feed it a document, tell it to summarize and back will come a generally useful précis. For folks directly on a Mac (MacOS 8.6 or newer incl. X) simply highlight a document or portion of text and select "Summarize" from the contextual menu.

  • by Tom7 ( 102298 )
    It's a pretty hard problem; people are still actively researching this and the best results are only so-so.

    The best way to find code would be to e-mail the authors of the papers you've found. They probably have implementations, and academics are usually willing to share under something like a BSD license or the GPL.

He has not acquired a fortune; the fortune has acquired him. -- Bion

Working...