Slashdot is powered by your submissions, so send in your scoop

 



Forgot your password?
typodupeerror
×
Spam

Fighting Spam with DNA Sequencing Algorithms 142

Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
This discussion has been archived. No new comments can be posted.

Fighting Spam with DNA Sequencing Algorithms

Comments Filter:
  • hm (Score:0, Interesting)

    by Anonymous Coward on Sunday August 22, 2004 @09:12AM (#10037240)
    wonder what the spammers will come up with to get around this...
  • High tech for what ? (Score:3, Interesting)

    by Ozh ( 514694 ) on Sunday August 22, 2004 @09:21AM (#10037276)
    Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address... and so simple that it will never stop :/
  • Re:Wordfilter (Score:3, Interesting)

    by rokzy ( 687636 ) on Sunday August 22, 2004 @09:23AM (#10037283)
    91% detection is far from impressive. AFAIK the better filters today are 99.9% successful. the benefit of this one is its low false-positive rate.

    personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

    can someone point me in the direction of such a filter?
  • Re:Mozilla Firefox (Score:2, Interesting)

    by danharan ( 714822 ) on Sunday August 22, 2004 @09:33AM (#10037321) Journal
    I think you mean Thunderbird.

    My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

    This method is promising because it uses spell-checking and a better way to identify spammy string sequences, something none of the two main camps of spam-filters have seem keen to do until now.
  • by Rahga ( 13479 ) on Sunday August 22, 2004 @09:46AM (#10037366) Journal
    It looks like much of the spam I'm recieving today consits of either nearly-blank or e-mails containing news articles that seem to be designed to pass trough content filters just so users can send them back to their admins as spam, essentially making it easier for bayesian filters and such to mark legitimate e-mail as spam.... though honestly, it's more of annoyance for me, as it makes it easier for users to say "The spam filter isn't working, what are you doing wrong?"
  • Wrong title, I guess (Score:5, Interesting)

    by stm2 ( 141831 ) <sbassi@genes d i g i t a l e s .com> on Sunday August 22, 2004 @09:47AM (#10037368) Homepage Journal
    According to the ./ title, it seems they used an algorithm used for DNA secuencing, when in fact they used an algorithm used for DNA analisis (or DNA sequence analisis that is the same), more specifically, gene finding techniques. As you may know, most DNA in a genome is not translated into protein (some people still call it junk, but most of it is no junk at all). So there are programs to sort genes out from the rest of DNA.
    I think we will see more and more applications like this with the growing cross-polination between Biology and CS.

  • by Donny Smith ( 567043 ) on Sunday August 22, 2004 @10:01AM (#10037416)
    Good point - that's why, in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

    Spell checker as anti-spam filter - that would create huge problems for most Americans :-)
    Otherwise it's a good idea.
  • Re:Mozilla Firefox (Score:3, Interesting)

    by littlem ( 807099 ) on Sunday August 22, 2004 @10:21AM (#10037482)
    My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

    This shouldn't be all that surprising - Bayesian filtering is all based on probabilities. The reason "Outlook message rules" is so bad is because a friend of mine might send me a joke about Viagra, which I don't want to have deleted indiscriminately as spam. False positives are infinitely more annoying than false negatives, so I'd much rather have conservative filtering that let a bit of spam through.

    I'm not saying Bayseian algorithms are perfect yet (though they'll improve) - my personal experience has been SpamAssassin, which got 97% of spam, and I've been experimenting with Thunderbird for a week, which gets 85%-90% and will no doubt get much much better as I train it in the next couple of weeks - but ultimately Bayesian filtering is enough to beat enough spam to make spamming not worthwhile (if everyone did it...)

  • They'll.. (Score:3, Interesting)

    by aussie_a ( 778472 ) on Sunday August 22, 2004 @10:24AM (#10037492) Journal
    To get around this spammers will use DNA algorithms to create spam that gets around the blockers ;)
  • by syrinje ( 781614 ) on Sunday August 22, 2004 @10:29AM (#10037510)
    Congratulations /.

    By now, all the patent-trollster-lurkers who passively phish in the /. pool must be rushing with suitably edited claims to their frienly neighborhood USPTO.

    Can anyone who works in the IP (intellectual property NOT Internet Protocol) post a list of known trollster companies that are full of lawyers who acquire patents (by any means) and make patent litigation their primary business model?
  • Re:Mozilla Firefox (Score:3, Interesting)

    by toxic666 ( 529648 ) on Sunday August 22, 2004 @10:39AM (#10037539)
    "I" being the key word in your assessment. Fine for the home user, not so good for a business.

    Maintaining an enterprise mail system based upon user-controlled spam filtering software is not practical. That small percentage of users with consistent ID 10T errors adds up fast. Try correcting false positives for a user-configured filter. It's time-consuming.

    The better approach from an administrative standpoint is controlling spam at the MTA- and MDA- levels of the mail server. I use postfix checks with MDA-level Bayesian filtering with reasonable success. The spam mbox is comprised of user-submitted and administratively approved mail. The user submits it, and the admin checks for things like filter poisoning text before moving it to the real spam mbox.

    Most importantly, my false-positive rate is extremely low -- probably 10's of thousandths of a percent.
  • by slashname3 ( 739398 ) on Sunday August 22, 2004 @11:26AM (#10037697)
    This will make another nice tool to identify spam. But why not use greylisting at all the ISPs MTAs to simply refuse 99% of the spam that is being sent right now?

    Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a list of addresses spewing messages out by the thousands. They do not queue messages or retry them if they get an error. Greylisting uses this to great effect and blocks spam while letting legitimate MTAs deliver messages.

    True, it is not 100% effective, some small number of spam messages get through since some spam goes through legitimate MTAs and the message is retried. But once you remove the bulk of spam those can be tracked down and shutdown or blocked at the firewalls.

    If the ISPs would implement this spam would become a non-issue over night. Email would once again become a mostly useful tool. But I guess the problem is that the ISPs have no vested interest in solving this problem. None of them will listen or implement this simple solution which does not block any legitimate email. With 70% of the email on the network being spam (number may be higher than that at this time) I would think they would jump at a solution that would reduce the loads on their servers. But I guess they make to much money from spammers to implement such a simple solution.
  • by Hao Wu ( 652581 ) on Sunday August 22, 2004 @12:09PM (#10037909) Homepage
    Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address...

    This is just like your own immune system, which uses such things as "V-D-J" recombination (and other tricks) to create billions of some what random different epitope to attack potential unknown pathogens. Cells they must further educate not to attack "self" in your own body.

    If only computer geeks took some lesson from biologist, perhaps they could get a grip on principles to stop SPAM.

  • by mcrbids ( 148650 ) on Sunday August 22, 2004 @12:16PM (#10037933) Journal
    It's my belief that the most likely source of the birth of Artificial Intelligence will be the SPAM filter.

    Think about it - we now have software that "learns' what you like. [nuclearelephant.com]

    Sorry, but anything that "learns" fits a definition of intelligence - using past results to predict future outcomes. Note that I'm not saying "self aware" or "conscious", simply "intelligence".

    As we move forward, we'll see more and more intelligence on the part of the spammers, and the warring factions of intelligence will likely provide massive financial and political impetus to build ever more intelligence solutions - thus AI is born.

    The problem with other vehicles for developing AI is simply the budget. With SPAM, everybody has a direct, financial incentive to develop it, so development will definitely happen!

  • by Ungrounded Lightning ( 62228 ) on Sunday August 22, 2004 @01:54PM (#10038463) Journal
    That should work for virus and worm detection, too!

    Even moreso, since viruses are much more a compilation of a set of previous constructions with a few mods than a new composition not necessarily based on the wording of old scams.

    And Viruses and worms (especially worms) are more constratined by their environment, requiring an exploit of a vulnerability and the instation of work-doing code. Though gene-shuffling techniques might be able to bury much of the code, the basic exploit must continue to be some sort of match to the vulnerability's "receptor".
  • Re:hm (Score:3, Interesting)

    by ca1v1n ( 135902 ) <snook.guanotronic@com> on Sunday August 22, 2004 @02:16PM (#10038554)
    The great thing about the similarity matching algorithms is that they read with noise filtering the same way that humans do. They also allow for like-character matching without any added computational overhead. This means that you can make a table of unicode characters that are similar to certain ascii characters that gets incorporated into the similarity matrix. By the power of these properties combined, your spam filter can recognize that c;al_is is intended to look like cialis, without a lot of expensive extra computations.

    Now that we've neutralized that form of message garbling, we're left to dealing with bayes filter poisoning. This is something that entropy-based filtering deals with quite well.

    All spam filtering techniques have weaknesses, but if you use a few different methods in concert, preferably within the same package to spare the poor user from having to set up a whole lot, you can get just about all of it.

    Even using a few of these different methods together, I still get a few ads from companies I've done business with that have screwed up my communication preferences. This sucks, but most of these companies are clueless rather than malicious. Threatening to take my business elsewhere has never failed to correct these problems.
  • by Anonymous Coward on Sunday August 22, 2004 @06:05PM (#10039639)
    You are 40 years behind the times. While it's chic to filter your spam using naive Bayesian text classifiers, don't kid yourself. Machine learning and text classification have been around since the 1960s.

UNIX is hot. It's more than hot. It's steaming. It's quicksilver lightning with a laserbeam kicker. -- Michael Jay Tucker

Working...