Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×
Spam

Fighting Spam with DNA Sequencing Algorithms 142

Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
This discussion has been archived. No new comments can be posted.

Fighting Spam with DNA Sequencing Algorithms

Comments Filter:
  • Thunderbird (Score:2, Informative)

    by bert.cl ( 787057 ) on Sunday August 22, 2004 @09:23AM (#10037282)
    I think you mean Mozilla Thunderbird?
  • Re:Wordfilter (Score:4, Informative)

    by Incadenza ( 560402 ) on Sunday August 22, 2004 @09:33AM (#10037320)

    personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

    can someone point me in the direction of such a filter?

    How about spamassassin?
    Just add the following to /etc/mail/spamassassin/local.cf:

    ok_languages en

    And increase the score for BIZ_TLD and other tests you find more important than others. Scoring per test is fully configurable, complete list of tests here [apache.org].

  • Love SA... (Score:5, Informative)

    by ajs ( 35943 ) <{ajs} {at} {ajs.com}> on Sunday August 22, 2004 @09:41AM (#10037352) Homepage Journal
    You have to love SpamAssassin for it's very Perlish approach to spam filtering... "hey, there's a cool new way to filter spam... throw it in!"

    I love this mostly because it means that SA is a moving target. Spammers can figure out how to defeat pieces of it, but it deploys a wide range of static, dynamic, network-based and user-driven tests that changes so much that spammers simply can't afford to keep up.
  • Re:hm (Score:2, Informative)

    by Proud like a god ( 656928 ) on Sunday August 22, 2004 @09:45AM (#10037361) Homepage
    Lately, much of the spam I have been getting in my Inbox (squirrelmail/spamassassin) has been email that has no typos, no random text, no blatent "click here" lines and looks like normal mail. Except they are trying to sell me something.

    You lucky g*t! :-P
  • by BJH ( 11355 ) on Sunday August 22, 2004 @10:02AM (#10037420)
    If I'm not mistaken, Chung Kwei is the figure known as Shouki in Japanese. He's usually described in English as the "Demon Queller", which seems a suitable-enough symbol for an anti-spam program.

    I mean, come on - don't anti-spam programs have the coolest names? SpamAssassin, Vipul's Razor...
  • by gvc ( 167165 ) on Sunday August 22, 2004 @10:18AM (#10037471)
    Notwithstanding accepted wisdom espoused above, random words cannot defeat current statistical spam filters, and it is difficult to defeat such filters even if you have access to the algorithm and the recipient's mailbox.

    John Graham-Cumming presented a talk Beating Bayesian Filters at the 2004 Spam Conference [spamconference.org] detailing these results. A video recording is available; alas, no paper.

    In conducting a recent spam filter evaluation [uwaterloo.ca] I observed (but did not report) that the statistical filter attacks were not particularly effective. The only attack that worked sometimes was to make the entire body of the message a current news item or joke, with only a URL linking to the spam payload.

  • Re:hm (Score:3, Informative)

    by great_snoopy ( 736076 ) on Sunday August 22, 2004 @10:20AM (#10037479)
    In fact, they did. The last spams I receive are composed of two parts : the spammy part, and a longer part that is usually a news paragraph from a public news site like news.google.com or cnn. The second part usually has a very small or none spammy fingerprint, cloaking the first spammy part.
  • Re:Mozilla Firefox (Score:3, Informative)

    by Technonotice_Dom ( 686940 ) on Sunday August 22, 2004 @11:43AM (#10037786)
    I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

    There are a few databases out there that take hashes of spam e-mails (either sent to spam traps or reported) and use them for spam tagging. SpamAssassin can use their client programs to help tag messages also - I don't know if there's an extension or anything for Thunderbird, I don't use it.

    The three that come to mind are DCC [rhyolite.com], Razor [sourceforge.net] and Pyzor [sourceforge.net].

    All have their advantages or disadvantages, but you have to remember that you're relying on somebody else's judgement. I think it's DCC that you can easily configure to say that you need x reports of the message before you class the message as spam, which gives you more control. But you only need one person who doesn't use it correctly to ruin the system and introduce lots of false positives.

    You could always set up SpamAssassin on your local machine and proxy messages through that.
  • by po8 ( 187055 ) on Sunday August 22, 2004 @01:08PM (#10038237)

    As someone who's done some research on machine learning for spam filtering, this sure looks to me from their 8-page paper like yet another simplistic ML algorithm advocated by folks who don't know the field and tested using techniques of questionable sensitivity. Their "novel" method sounds an awful lot like feature set construction by clustering, a method that is widely used in the spam filtering literature, but with a somewhat novel clustering technique from biology.

    Message filtering starts by throwing away line breaks for no obvious reason, then optionally removing the known ham from the training set for no obvious reason. Message headers are then thrown away, for no obvious reason.

    No general method is given for corpus allocation. In the experiment reported later, the original corpus appears to have been split roughly in half. (For unreported reasons, none of these splits are exact. No rationale is given for the various corpus allocations.) The training corpus is then split into ham and spam, and the ham portion is split in half. The spam training corpus is used for "positive training": determining a complex feature set as described below. One half of the ham training corpus is then used for "negative training": filtering out complex features that are common in ham. The remainder of the ham corpus is used as a validation set to select thresholds described below. No justification is given as to the failure of the validation set to include spam messages, and the procedure is vague on this point.

    The description of the key "positive training" phase is difficult to follow: it seems to assume the pre-existence of the "SPAM vocabulary" [sic] being constructed. The key idea seems to be to use positional index of words within the body as base features, and construct complex features by using a pattern recognition algorithm to find correspondences between sets of base features across spam messages. Patterns that appear across many spam messages are treated as indicating spam.

    The final training step is to set thresholds for (1) minimum number of complex features in the spam message and (2) fraction of the message text covered by the complex features. One would expect these two criteria to be highly correlated: no effort appears to have been made to enforce or explore their orthogonality.

    The classification phase proceeds by simply counting the number of patterns in a given test message and the percent coverage of the message by the patterns. If the result exceeds both thresholds, the message is classified as spam.

    For the empirical evaluation, the corpus used seems to have consisted of approximately 130,000 messages, roughly 1/4 ham and 3/4 spam. No details of the construction or acquisition of this large corpus were given. Because of its volume, one would suspect a synthetic corpus from high volume sources. The details of this corpus construction are critical to the evaluation of the method, so no useful conclusions can really be drawn from the empirical evaluation other than that, like most machine learning methods, this method works well on some problem set.

    The claimed accuracies from the technique are at a level that is highly suspect from previous experience: there are fundamental bounds on how well any ML algorithm can do in real situations that don't appear to be met here. Indeed, messages found to be misclassified as spam in the test corpus were manually reclassified, but no effort seems to have been made to identify messages that were "correctly" classified by the algorithm but misclassified in the corpus. The error rate before manual manipulation of the results (!) appears to be about 97%, which is well within the normal expected range. Computational efficiency appears to be good.

    The vocabulary used in the paper is not particularly consistent with the vocabulary normally used in the spam filtering or machine learning literature. A few spam filtering and machine learning papers are cited, but not many: citations are primarily from the

The one day you'd sell your soul for something, souls are a glut.

Working...