Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Spam

Fighting Spam with DNA Sequencing Algorithms 142

Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
This discussion has been archived. No new comments can be posted.

Fighting Spam with DNA Sequencing Algorithms

Comments Filter:
  • Wordfilter (Score:3, Insightful)

    by bert.cl ( 787057 ) on Sunday August 22, 2004 @09:18AM (#10037267)
    While the numbers are impressive, this just looks like a filter that does combined wordsearches?

    Even with training, isn't this just some regexp and searchting after particular strings.

    And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?

  • Mozilla Firefox (Score:2, Insightful)

    by nycsubway ( 79012 ) on Sunday August 22, 2004 @09:21AM (#10037275) Homepage
    I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.

    With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.

  • This isn't "fighting spam", it's "adapting to spam".
  • Re:Mozilla Firefox (Score:3, Insightful)

    by rokzy ( 687636 ) on Sunday August 22, 2004 @09:30AM (#10037306)
    I've had mixed results with Thunderbird. in the beginning it seemed to work great, then I noticed it was junking all my legitimate email too. then I fixed that but it started letting through blatantly obvious stuff.

    the newest version has been doing better so far.

    I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
  • Re:hm (Score:5, Insightful)

    by Pigbot ( 797016 ) on Sunday August 22, 2004 @09:35AM (#10037330)
    wonder what the spammers will come up with to get around this...

    Of course. Spam is a moving target. Given that it is cheaper to create spam than to block spam, it will always be an uphill battle.

    Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.
  • by G4from128k ( 686170 ) on Sunday August 22, 2004 @09:53AM (#10037389)
    This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

    For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).

    Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.
  • by dnaboy ( 569188 ) on Sunday August 22, 2004 @10:09AM (#10037443)
    I think it's really interesting to watch the literal evolution of spam and spam filters. There are really amazing parallels to biological evolution.

    First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).

    Second, there seems to be some sort of equilibrium which is inevitably achieved, and

    Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.

    I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.

    Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.

    Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...

  • Corrections... (Score:3, Insightful)

    by littlewild ( 733743 ) on Sunday August 22, 2004 @10:26AM (#10037498) Homepage
    Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.
  • As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.

    People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.

    I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.

    I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.
  • by DNS-and-BIND ( 461968 ) on Sunday August 22, 2004 @10:40AM (#10037542) Homepage
    It's hardly appropriate that such superstition should be given encouragement in this day and age. Penn & Teller did a great bit on "feng shui" on their show, "Bullshit!". They had 3 different feng shui consultants come in to a house, and each one recommended different changes for different reasons. Some discipline.
  • by Tim C ( 15259 ) on Sunday August 22, 2004 @10:54AM (#10037586)
    in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

    How so?

    1) install software
    2) treat as black box
    3) spam spam spam
    4) see what gets through
    5) study, enhance
    6) goto 3)

    Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.
  • by mikael ( 484 ) on Sunday August 22, 2004 @11:50AM (#10037827)
    Hell, spam has gotten so sophisticated that sometimes even after reading the whole message I still don't know if the e-mail is a legitimiate one from my bank, stock broker, etc.

    If after reading the E-mail, you still don't know what product the spam is advertising, then the spammers are losing, since those E-mail's will not lead to a sale, and the spammers are simply wasting their own bandwidth.
  • by koreth ( 409849 ) on Sunday August 22, 2004 @02:54PM (#10038786)
    This isn't going to work -- you simply can't solve a social / legal problem with technology.

    You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.

  • by devphil ( 51341 ) on Sunday August 22, 2004 @04:56PM (#10039365) Homepage


    First, there's a constant tuning of both preditor and prey

    Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.

    (If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)

    there are occasional discreet major developments

    Um. "Discrete" is the word you want. Spammers are anything but discreet. :-)

  • by Tablizer ( 95088 ) on Sunday August 22, 2004 @06:12PM (#10039672) Journal
    Personally, I've always thought that a simple spell check would do a good job as another layer filtering.

    Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody :-)
  • by YU Nicks NE Way ( 129084 ) on Sunday August 22, 2004 @08:56PM (#10040626)
    It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.

    The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.

    The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.

    This sounds like a fully buzzword compliant non-result to me.

Remember to say hello to your bank teller.

Working...