Journal Journal: Finding repeated phrases in MySQL text fields 4
Being that Slashdot is the biggest audience of computer geeks that I know, this should be the right place to ask a question that stumps me.
Some of you know that I am the owner/publisher/programmer of freeinternetpress.com. I was playing with the "tag cloud" idea, but it doesn't quite satisfy what I want.
I wrote a script that looks at the 100 most recent news stories, pulls all of the words from the text and subject, splits it on spaces (and other delimiting characters), and gives me a nice list of words by frequency on the page. The rough equivalent from the command line would be:
cat story.txt | sed -e s/\
It then shows them in tag cloud format, sized for frequency. Each word is linked to a script that finds the most recent story with that word in it, and send you directly to that.
Mine is all done with SQL queries and a little array magic in PHP, not shell commands, I swear.
What I can't quite figure out is, how do I do the same thing for phrases? If John Smith made the news, there may be plenty of people with the first name "John" making the news, so John may show up frequently. Smith may also show up with some sort of frequency (in an obscure world where there are only 4 common last names). But, if John Smith goes on a shooting rampage, it would be reasonable to think that "John Smith kills" would show up in multiple news stories. They may say "John Smith kills 14 in mystery rampage" or "John Smith kills coworkers at super spook spy shack". You never know what will come up, but it would be amazingly advantageous to have that phrase.
While I can't think that we'll cover every breaking news story, I can think that the hundreds of RSS feeds that we're aggregating would. If this was applied to the RSS feeds, we would then have a beautiful resource. Think Google News automated and unfiltered. Yes, Google News filters their news, and does adjust what is shown based on who it thinks are "good" sources, and some big news simply doesn't show up.
In thinking about this, I thought about the brute force method. Find every word, go back and find the word before and check that against the database. go back and find the word after and check against the database. Continue this to up to 5 word phrases.
On just our own 100 most recent stories, there are 19374 words. Of those, there are 6176 unique words. I run this against a "stopwords" table, so common words (like "and" "the" "or" "I" "he" "she", etc). We're using about 1000 stopwords. Even with this, there are 5676 unique words.
Does anyone have any suggestions?