Fighting Spam with DNA Sequencing Algorithms 142
Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
Wordfilter (Score:3, Insightful)
Even with training, isn't this just some regexp and searchting after particular strings.
And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?
Mozilla Firefox (Score:2, Insightful)
With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.
Misnomer, it's not "fighting spam"... (Score:1, Insightful)
Re:Mozilla Firefox (Score:3, Insightful)
the newest version has been doing better so far.
I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
Re:hm (Score:5, Insightful)
Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.
Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.
Works until the Spammers get a copy of it (Score:5, Insightful)
For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).
Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.
Interesting... Electronic evolution... (Score:5, Insightful)
First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).
Second, there seems to be some sort of equilibrium which is inevitably achieved, and
Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.
I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.
Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.
Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...
Corrections... (Score:3, Insightful)
Re:Misnomer, it's not "fighting spam"... (Score:5, Insightful)
People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.
I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.
I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.
Re:Feng Shui hardware (Score:2, Insightful)
Re:Works until the Spammers get a copy of it (Score:3, Insightful)
How so?
1) install software
2) treat as black box
3) spam spam spam
4) see what gets through
5) study, enhance
6) goto 3)
Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.
Re:Stop This B\/llsh!t Filtering Crap (Score:3, Insightful)
If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.
Re:This is all bull -- Change the law (Score:3, Insightful)
You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.
Re:Interesting... Electronic evolution... (Score:3, Insightful)
Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.
(If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)
Um. "Discrete" is the word you want. Spammers are anything but discreet. :-)
Re:Works until the Spammers get a copy of it (Score:2, Insightful)
Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody
Serious methodological flaws (Score:4, Insightful)
The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.
The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.
This sounds like a fully buzzword compliant non-result to me.