Forgot your password?
typodupeerror

Proving Which Spam Filters work Best 263

Posted by samzenpus
from the get-rid-of-it dept.
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
This discussion has been archived. No new comments can be posted.

Proving Which Spam Filters work Best

Comments Filter:
  • Not at 400 (Score:0, Informative)

    by Anonymous Coward on Thursday August 03, 2006 @12:19AM (#15837325)
    400 Megs that is.......
  • In my experience... (Score:5, Informative)

    by vivin (671928) <vivin.paliath@nospAm.gmail.com> on Thursday August 03, 2006 @12:24AM (#15837340) Homepage Journal
    ... the ones which have worked best (for me) are Bayesian Spam Filters [wikipedia.org] (A Plan for Spam [paulgraham.com], SpamBayes - a free filter [sourceforge.net]) and CRM114 [sourceforge.net] The Controllable Regex Mutilator (Paul Graham mentions it here [paulgraham.com]). I've always had a very high success rate with these.

  • by coffeeisclassy (991791) on Thursday August 03, 2006 @12:42AM (#15837417)
    Its round robin mirrored accross a whole bunch of different servers so if youre only getting 8kb/s you could try cancelling and downloading again and seeing if it goes faster.
  • by emag (4640) <slashdot@gurski.MENCKENorg minus author> on Thursday August 03, 2006 @12:45AM (#15837432) Homepage
    And turn off SMTP VRFY. Either that, or having windows systems @ my ISP managed to get the address associated with my account on spam lists. This is an address that's *only* used internally by my ISP (I use pobox or my own domain whenever someone asks for an address). Even that wasn't enough to provent it from getting harvested. :-(
  • by saha (615847) on Thursday August 03, 2006 @12:46AM (#15837437)
    We use Brightmail [brightmail.com] on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

    I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
  • Flaw in the test (Score:5, Informative)

    by lheal (86013) <lheal1999@yahoo . c om> on Thursday August 03, 2006 @12:48AM (#15837444) Journal
    The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

    As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

    I have found that using two dissimilar systems in a chain is quite effective.
  • by martin-boundary (547041) on Thursday August 03, 2006 @01:13AM (#15837527)
    For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.

    The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here [uwaterloo.ca] (or pdf overview [uwaterloo.ca]).

    You can duplicate those tests yourself if you download the evaluation toolkit (GPL) [uwaterloo.ca]. It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).

    There's also a video talk [researchchannel.org] given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).

    There's a new scheduled test towards the end of the year at TREC 2006.

  • Re:I have one word: (Score:3, Informative)

    by Jeffrey Baker (6191) on Thursday August 03, 2006 @01:38AM (#15837598)
    I hope you also have another word, because the Postini service is incredibly bad. I had it enabled on my account at acm.org, and the Postini system was generating roughly one false positive for every 10 true positives. I disabled the Postini filtering and started using Spamassassin. Both the false positive and false negative rates are much improved. Among the traffic that Postini was flagging as spam were the Wikipedia article of the day, my daily email from musicbrainz.org, all messages to the BATN mailing list, many replies to my items for sale on craigslist, and other kinds of completely legitimate traffic. Among the mail they chose to deliver were messages in Korean, Cyrillic, other scripts I can't read, and known viruses.

    Their main problem is the system doesn't learn. Using their web interface, I look through the spam folder and request delivery of all the false positives. The next day, nearly-identical mails are still generating false positives. You'd think it would be easy these days to design a filter that learns from negative reinforcement.
  • by Red Alastor (742410) on Thursday August 03, 2006 @02:00AM (#15837652)
    I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.

    http://popfile.sourceforge.net/ [sourceforge.net]

  • by sciop101 (583286) on Thursday August 03, 2006 @02:35AM (#15837755)
    On-line Supervised Spam Filter Evaluation
    Gordon Cormack and Thomas Lynam

    Full Text, May 29, 2006 - PDF Format

    http://plg.uwaterloo.ca/~gvcormac/spamcormack.html / [uwaterloo.ca]

  • by prandal (87280) on Thursday August 03, 2006 @04:09AM (#15837975)
    This paper's a complete waste of time.

    He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

    We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

    Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ [rulesemporium.com] and the results are outstanding.

    With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

    And the folk on the spamassassin-users mailing list really rock.
  • Re:MS Anti Spam... (Score:3, Informative)

    by KiloByte (825081) on Thursday August 03, 2006 @04:37AM (#15838030)
    A false positive rate of 1:100
    No, better than 1:100 - that's what <1% means. It's actually around the 1:500
    And thus still 200 times worse than the acceptable rate.
    Usually, anti-spam solutions which give more than 1:100000 are considered worthless
    Got links, or is that just your opinion?
    There was a massive flamefest on debian-devel about spam filtering recently, but false positive ratios in that range were something commonly used by most participants in the discussion. I don't have the time to find a bunch of such posts right now, but the most recent thread is "greylisting on debian.org". This particular thread deals mostly with acceptable delays, but it does include quite a bit of statistics.

    However, note that we are talking about two separate scenarios:

    • a home server for an user with no responsibilities
    • a project/ISP-wide mail server
    In the former, delaying mail for weeks may be acceptable -- but even then, I wouldn't touch something with a 1:500 false positive ratio with a long stick.
  • Re:Why do they try? (Score:1, Informative)

    by Anonymous Coward on Thursday August 03, 2006 @06:05AM (#15838238)
    Because many clueless morons have email spam filters administered by the clued;
    Not making any judgements but the "clued" category includes Gmail, Yahoo Mail,
    AOL, corporate IT managers and university mail server admins.
  • Torrent (Score:4, Informative)

    by vivin (671928) <vivin.paliath@nospAm.gmail.com> on Thursday August 03, 2006 @06:55AM (#15838360) Homepage Journal
    Here [vivin.net] is a torrent I made of the xvid file. It should work (I hope).
  • Dspam floats my boat (Score:3, Informative)

    by Zzeep (682115) <kenneth@nOSpAm.vangrinsven.com> on Thursday August 03, 2006 @07:20AM (#15838419) Homepage
    I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).
  • Re:Torrent (Score:2, Informative)

    by jsharkey (975973) on Thursday August 03, 2006 @07:54AM (#15838529)

    Go get VideoLAN client and you can stream download the OGG version. Just open the URL as a Network Stream:

    http://www.csclub.uwaterloo.ca/media/files/cormack -spam.ogg [uwaterloo.ca]

    Very handy use of VLC! :)

  • by gvc (167165) on Thursday August 03, 2006 @09:18AM (#15838974)
    I assume the paper that you are describing is the 2004 study [uwaterloo.ca]. The paper described in the talk (which was given 6 months ago or so) described results of the TREC 2005 Spam Track [uwaterloo.ca] which took place in November 2005. It included a test SpamAssassin 3.x, not 2.3.

    TREC 2006 [nist.gov] evaluations are now underway [uwaterloo.ca].

    While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:

    • The spam filters tested in 2004 give pretty well exactly the same performance on 2005 and 2006 data.
    • New versions of the filters are a little bit better, but not by leaps and bounds, and also get about the same results over the last 2.5 years of data.
    • There is no evidence that "Bayesian poisining" is a viable technique for defeating statistical spam filters in anything but a very artifical laboratory environment where the poisoner has access to the recipient's inbox
    The subject of the paper -- and the talk -- is primarily about testing methodology and the need for controlled scientific investigation. So I hesitate to endorse the simplistic notion of a "winner" of the TREC evaluation. However the technique that did very well [ai.ijs.si] was indeed quite novel, so here's a characterization.
    Andrej Bratko used PPM -- a well-known data compression technique to compress ham and spam separately. Well actually he didn't compress them but just build the statistical model necessary to compress them. Then he simply (tentatively) added the unknown message to each model and chose the one that compressed it best. The general technique of using compression has been mentioned here and elsewhere but Bratko used a much stronger compression scheme and was somewhat clever about it.

    I later reproduced Bratko's results using DMC -- a compression schem that I invented 20 years ago -- and got some interesting results. We have a journal article in press describing it and also an evaluation paper at CEAS 2006 [www.ceas.cc].

    Bratko A., Cormack G. V., Filipic B., Lynam T. R. and Zupan B., Spam Filtering Using Statistical Data Compression Models [uwaterloo.ca]

  • by gvc (167165) on Thursday August 03, 2006 @09:45AM (#15839146)
    Bogofilter works great. Or SpamAssassin but only if you force-feed it its own judgements [uwaterloo.ca]. In both cases you have to correct classification errors.

    Fidelis Assis (who has now gone solo after having participated in the CRM114 project) shows great results for his recent solo effort: OSBF-lua [luaforge.net] Bratko's PPM spam filter [ai.ijs.si] -- the one that did great at TREC -- is not yet packaged as a drop-in filter. Same for my DMC spam filter [www.ceas.cc].

    The actual TREC 2005 tests referred to in TFA are here. [uwaterloo.ca]

  • by gvc (167165) on Thursday August 03, 2006 @10:36AM (#15839587)
    Here are the slides from the 400MB video presentation. [uwaterloo.ca]
  • by hacker (14635) <hacker@gnu-designs.com> on Thursday August 03, 2006 @11:14AM (#15839886)
    Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

    And what happened when you retrained those false positives as ham? Did you see future mails of the same/similar type get caught again? I bet you didn't.

    I've been using dspam for a very long time for my users, and they love it. They love having zero spam in their mailbox, they love the simplicity of the user interface. They love how it treats users on a per-user basis, not globally (i.e. some users WANT html emails, some do not. Each can mark them as they see fit.)

    Here's an example of my own stats..

    hacker: TP True Positives: 122601
    TN True Negatives: 124711
    FP False Positives: 211
    FN False Negatives: 1046
    SC Spam Corpusfed: 3708
    NC Nonspam Corpusfed: 456
    TL Training Left: 0
    SHR Spam Hit Rate 99.15%
    HSR Ham Strike Rate: 0.17%
    OCA Overall Accuracy: 99.49%

  • by ajs (35943) <ajs AT ajs DOT com> on Thursday August 03, 2006 @11:23AM (#15839942) Homepage Journal
    In my experience, the commercial offerings (such as mail frontier) aren't too bad. As far as open source stuff, my personal setup of choice is:
    • Spamhaus SBL/XBL filtering (hard SMTP-time DNSBLing) based on my expereince with them and their consistent listing of VIOLATORS, not just anyone who shares a netblock with a spammer (i.e. they may not catch as much as some others, but they don't have the FP rate that others do)
    • Greylisting. This is controversial because many people can't tolerate the delay it introduces. I found a radical decrease in spam when using it (because honeypots have already located a spammer by the time they try again), and only marginal headaches introduced by the delays of new senders. YMMV, and I wouldn't use it in a production environment.
    • SpamAssassin. I tweek the RBL settings (I *never* want to even score SORBS, for example), and configure razor, but otherwise pretty much leave it in its default configuration, and it works great!
    • Thunderbird mail filtering. I use evolution and thunderbird. I don't bother turning on mail filtering in evolution, since it uses SpamAssassin, and there's no point using SA twice on the same message. I *do* use thunderbirds filtering as yet-another layer of filtering when I'm using that, and it does a good job of classifying what little spam is left.


    YMMV. Good luck.
  • Re:Torrent (Score:3, Informative)

    by wayne (1579) <wayne@schlitt.net> on Thursday August 03, 2006 @11:27AM (#15839981) Homepage Journal
    Your tracker is still 440'ing, so I have put up an alternative tracker [schlitt.net]. As I write this, I only have about 9% of the avi downloaded, so if someone else can seed the complete cormack-spam-xvid.avi file, I would greatly appreciate it.
  • Re:Harder! (Score:2, Informative)

    by Anonymous Coward on Thursday August 03, 2006 @11:57AM (#15840209)
    Hey, we can't help it if people decide to post our videos to ./ and Digg!
    [/innocence]

    Here are UW's traffic stats, in case anyone's interested:
    http://noc.uwaterloo.ca/cgi-bin/14all.cgi?log=cn-r text_gi2&cfg=cn-rtext.cfg [uwaterloo.ca]

    Also note the spikes on Monday and Tuesday from when we posted our last [slashdot.org] two [slashdot.org] talks.
  • Re:Harder! (Score:3, Informative)

    by cruachan (113813) on Thursday August 03, 2006 @12:01PM (#15840241)
    True, but as I per below, there's literally mounds of baked clay tablets because they are so indestructable. Apparently they used to get shovelled into foundations and the like. The estimate I heard was that at current rates it will take scholars several hundred years to translate what we've found already. Compare that to parchment records where the discovery of even a few new scraps is a major event (http://news.bbc.co.uk/1/hi/sci/tech/5235894.stm and particularly http://news.bbc.co.uk/1/hi/world/europe/5216320.st m [bbc.co.uk]). Point is in the race for the most successful long term storage mechanism cuniform on baked clay is way ahead of the field, nothing else comes close.

    Excellent 'In Our Time' programme on Babylon and it's Literature here - http://www.bbc.co.uk/radio4/history/inourtime/inou rtime_20040603.shtml [bbc.co.uk]

A holding company is a thing where you hand an accomplice the goods while the policeman searches you.

Working...