Forgot your password?
typodupeerror

Proving Which Spam Filters work Best 263

Posted by samzenpus
from the get-rid-of-it dept.
pirateninja writes "Dr. Gord Cormack decided to find and prove what the best spam filter is. In his study he looked at the major spam filters (DSPAM, SpamAssassin, etc.) along with those submitted by various academics. The results are quite surprising, with a previously unheard-of spam filter, which uses ideas from various compression algorithms, performing the best overall. He recently presented the results and methodology used in a presentation titled 'Spam Filters, Do they Work? and Can you prove it?'" Note that this is a video of his presentation.
This discussion has been archived. No new comments can be posted.

Proving Which Spam Filters work Best

Comments Filter:
  • Easier? (Score:3, Insightful)

    by Ec|ipse (52) on Thursday August 03, 2006 @12:22AM (#15837338)
    Isn't there an easier way to display the results, liek a chart or something. 400M per file download is a bit extream.
    • Harder! (Score:5, Funny)

      by Profane MuthaFucka (574406) <busheatskok@gmail.com> on Thursday August 03, 2006 @12:53AM (#15837466) Homepage Journal
      I uuencoded the video file, translated it into Sumerian cuneiform, and pressed it into a billion little clay tablets. They are cooking in my oven right now. Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.
      • Re:Harder! (Score:5, Funny)

        by rts008 (812749) on Thursday August 03, 2006 @01:05AM (#15837509) Journal
        I can't come to your house, you insensitive clod!, teh tubes are clogged with clay tablets!

        I won't be able to download my internet until Friday now!

        Turn that crap down, and get off of my lawn! Damn kids!
        • Re:Harder! (Score:5, Insightful)

          by Squalish (542159) <Squalish AT hotmail DOT com> on Thursday August 03, 2006 @08:30AM (#15838679) Journal
          Am I the only one that read the means of presentation as a hilarious attack on a university policy of blocking bittorrent? Given that adding 470MB doesn't really add any usable information to a discussion about spam filters over a piece of text, and all.

          Your college doesn't like bandwidth-efficient delivery? Flood them with a Slashdot effect on a 500mb file, an extra $500 in bandwidth charges, and maybe they'll change their tune.
      • Re:Harder! (Score:2, Funny)

        by Cylix (55374)
        Excellent...

        By chance, are you nearby?

        I have a wonderful set of wikipedia tablets I made and I'm eager to offload them...er I mean... trade them.

        It's the updates you see, I've been having a bit of a nightmare trying to keep them all in sync.
      • Now, the Internet is NOT some kind of truck you can just dump stuff onto, so if you want to get the data you're going to have to come to my house.

        No, I understand the internet is actually a series of tubes, and there will be hell to pay if they get "full".
      • Re:Harder! (Score:5, Insightful)

        by cruachan (113813) on Thursday August 03, 2006 @05:43AM (#15838190)
        Don't knock it, cuneiform on backed clay is the single most successful format for long-term storage ever invented - 3000 years and counting. Heck, most of our modern storage formats can't even manage 30 - tied to read a 8" floppy recently?
        • Re:Harder! (Score:2, Insightful)

          by Jartan (219704)
          I'm not going to knock it but your statement is very far from the truth. Determining the "most successful" long term storage method invented would require waiting till the year 5xxx something to see if something we've currently invented beats cuneiform. Even then it's pretty hard to prove one way or another since a lot of the cuneiform we have today is being carefuly taken care of to prolong it's lifetime I'd suspect (though I have no confirmation of that part).
          • Re:Harder! (Score:2, Insightful)

            by ozmanjusri (601766)
            I'm not going to knock it but your statement is very far from the truth.

            Yep, you're right. The best long-term information storage media ever invented is poetry.

          • Presumably the tablets stuff in museums are, but I was listening to an archeologist recently who was saying that the advantage of clay tablets is that they are virtually indestructable unless you purposly take a hammer to them. Compare that to papyrus and parchment which can get preserved a long time, but need special circumstances, such a being buried in a bog, to stop decay.

            Apparantly because of this there is vast amounts of Sumerian and related texts awaiting translation (the language was only deciphere
            • Re:Harder! (Score:3, Funny)

              by Crayon Kid (700279)
              Bwahaha, I'm moving my blog to clay tablets. They will undoubtedly survive the next Ice Age and the people of year 5000 will be forced to read about my cat, how I hate Emo's and that guy at work who doesn't wash. But first I'll change my blog nick to "Earth Imperial Overlord Supreme", just to fuck with them future dudes.
    • They could have just e-mailed it to everyone with a gmail account.
  • In my experience... (Score:5, Informative)

    by vivin (671928) <<vivin.paliath> <at> <gmail.com>> on Thursday August 03, 2006 @12:24AM (#15837340) Homepage Journal
    ... the ones which have worked best (for me) are Bayesian Spam Filters [wikipedia.org] (A Plan for Spam [paulgraham.com], SpamBayes - a free filter [sourceforge.net]) and CRM114 [sourceforge.net] The Controllable Regex Mutilator (Paul Graham mentions it here [paulgraham.com]). I've always had a very high success rate with these.

    • Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before.... I wonder how long it will be before we see something using the methods available, who wants to bet OpenSource will beet closed source to implementing this?
      • by 1u3hr (530656)
        Whats surprising is, while Bayesian spam filters work well in his tests, the one that performs the best was never really heard of before

        Well, the spammers have heard of the other methods too and try to subvert them. So give them time and see how it performs if and when it becomes more commonly used and the spammers are trying to beat it.

        • yep but good luck trying to defeat different algorithms and still retaining some sense, let alone a convincing message. Unless people are going to trust a sender named "Honey bee furufuru", which unfortunately is still entirely possible.
          • by jank1887 (815982)
            Hello. welcome to the internet.
            First, spam does not need to make sense to make money. Here's some of my latest received headlines:
            • placing LEDhas
            • pJapans mission
            • capture Todays architect shared
            • 6MZ

            and the body text (with an attached image):

            -----
            malware

            USDA databases crop

            entente cordial: admission relation contract GB giveaway andd

            studios another page:

            ... (etc.,etc.)
            -------
            AND IT STILL MAKES MONEY!!!
            spam is funded by idiots. we will never run out of idiots on the net. Thus, spam will

    • by ozmanjusri (601766) <aussie_bob&hotmail,com> on Thursday August 03, 2006 @12:30AM (#15837378) Journal
      I've always had a very high success rate with these.

      I haven't tested this one myself, Barrett Filter [barrettrifles.com] but I understand it is 100% effective at reducing spam from known sources. False positives may be a problem, however.

    • SpamAssassin uses Bayesian Filtering as well as other methods.
    • by Red Alastor (742410) on Thursday August 03, 2006 @02:00AM (#15837652)
      I like popfile because it's a bayesian filter that sorts into any arbitrary categories you want, not just spam and ham.

      http://popfile.sourceforge.net/ [sourceforge.net]

    • by ajs (35943)
      In my experience, the commercial offerings (such as mail frontier) aren't too bad. As far as open source stuff, my personal setup of choice is:
      • Spamhaus SBL/XBL filtering (hard SMTP-time DNSBLing) based on my expereince with them and their consistent listing of VIOLATORS, not just anyone who shares a netblock with a spammer (i.e. they may not catch as much as some others, but they don't have the FP rate that others do)
      • Greylisting. This is controversial because many people can't tolerate the delay it introdu
  • by shotgunefx (239460) on Thursday August 03, 2006 @12:25AM (#15837343) Journal
    400MB?

    Why not just douse the server in gas if you want to see it melt.
  • At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.
    • Bah. We use Spamassassin, multiple DNSBLs, and I still get hundreds per day, most of them to addresses published on websites (unavoidable).

      The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.

      • DUL = DailUp List... a bit of a misnomer as it commonly refers to all dynamic hosts. My spam went down dramatically after starting to use Trend's DUL (formerly MAPS). Alas, it's a pay service, but it all comes down to your pain threshold. Mine is low relative to my income.
      • And turn off SMTP VRFY. Either that, or having windows systems @ my ISP managed to get the address associated with my account on spam lists. This is an address that's *only* used internally by my ISP (I use pobox or my own domain whenever someone asks for an address). Even that wasn't enough to provent it from getting harvested. :-(

        • And turn off SMTP VRFY.

          SMTP VRFY (or recipient-checking at the SMTP level in general) being disabled is pointless. Given a choice between allowing people to not send mail to invalid addresses or having to deal with bounce-back scatter and getting your MX server blacklisted for third-party spam, I'll take the former any day.

          And I'd wager anyone who's had to admin a qmail server and decide which (if any) recipient-checking patch to use would feel the same way.

          It's far less load on the servers to have a more e
      • Heh, even if you are reasonably diligent in protecting your email address, 9/10 it will still get out(though maybe not as bad). All it takes is one recipient with a compromised windows box and your address can be all over the spammers lists in no time.
        Or, as in my case, you could assume that a university you apply to will not send out a giant mass email to all the incoming graduate students inviting them to the graduate orientation. So now I have the email address of every grad student entering the Univ
      • The key is still: don't give out your address. Once you've done that, you're going to be screwed eventually.


        Nah, that's such a half measure. The real solution is to not have an email address at all.
    • At work we've set up a combination of SpamAssassin and Spamhaus. Personally I've went from about 10 spams per day to about 1 every two weeks.

      Amazing! - We've been using that combo for a long time and I get about 5-10 spams AN HOUR coming through the filters (and about the same amount caught). This is all personalized spam sent to one specific email address. That address was used in the past for a few newsgroup postings, a few technical forums and it was listed on a webpage some time ago. No spam sent to it
  • by _vSyncBomb (50710) on Thursday August 03, 2006 @12:30AM (#15837375) Journal

    Hey Slashdot, what's up, man! Dude, I read your thing and like totally agree about Best Work Proving Spam Site Work! Dude, that's awesome!

    Bro, in the same vein, I was totally checking out this dope ass site [microsoft.com] which you might wanna check out [doubleclick.net] too man. Guys like us that dig Spam Which Proving and Best work Filters will be all over this before long...

    OK, man take care until I see you this Friday at the dinner thing, Slashdot!

    Cheers,
    John

    • by Anonymous Coward on Thursday August 03, 2006 @12:55AM (#15837473)
      Hey _vSyncBomb,

        Having trouble pleasing your woman? I've got something Very Interesting And Generally Really Amusing that you could try!!!

      Your buddy,
      _vAnoymousCoward
    • by patio11 (857072) on Thursday August 03, 2006 @04:15AM (#15837987)
      I ran your message through a perl script to mail it to me for giggles (I do research on spam filtering at ye olde day job). Regretfully, you didn't make it through. Aside from header garbage, which was a mixed bag (half spam tokens, half "known-good automated email" tokens), you ran into problems with dope, ass, wanna, and... work*. Which is just as well, as I have no desire to speak to anyone who uses those words. * Last 15 occurrences in my mailbox are all of the "Make l0ads of $$$ work @ h0m3!" variety.
  • RTFA? (Score:5, Insightful)

    by glowworm (880177) on Thursday August 03, 2006 @12:39AM (#15837408) Journal
    So, how are we supposed to RTFA then the FA is over 470MB and a video file. Why not just a nice simple text summary Mr Submitter, but nooooo that would just be too easy!
  • Not surprising... (Score:4, Insightful)

    by RealGrouchy (943109) on Thursday August 03, 2006 @12:45AM (#15837433)
    Although I haven't WTFV (watched the video), it doesn't seem surprising that spam filters which use techniques that aren't used widely would be most successful.

    If they aren't used widely, it would either be because they don't work, or they do work but they haven't caught on [yet].

    It's like any other fad. As an example, when the original Survivor series came out, it was really popular because it achieved its goal (attracting viewers) in a way that was original. Heck, even I watched the original one. Now that all the networks are doing the reality TV thing, it has become hackneyed, and each successive version of survivor does a worse job of achieving its goal. And I've given up watching TV.

    With antispam, new techniques are effective, but as they become more popular and more widely used, spammers will find equally innovative ways of getting around them.

    I've noticed that at any given time, there will be a particular style of (non-blank) spam that manages to get through Gmail's filters fairly consistently, but every now and then Gmail adapts its spam filters to block the successful spam type of the season, and eventually a new type will make its way through.

    - RG>
    • Spam is easy to take care of, well 99% of it. the rest isnt a big deal so who cares.

      My office went from 2000 spam mails a day to about 10. across 15 employees. Who gives a crap about the 10 emails remaining...

      I only wish it could be taken care of upstream further to shut those pricks down. but for the end user in an admins perspective, most systems are pretty easy to deal with (particularly small offices)
  • by saha (615847) on Thursday August 03, 2006 @12:46AM (#15837437)
    We use Brightmail [brightmail.com] on our campus and our users love it with its very low false positive and pretty accurate flagging of SPAM. Another campus uses DSPAM and some people are up in arms at the prospect of losing their Brightmail to switch to DSPAM. Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

    I also echo a gripe of other posters. Its nice to have a video but 500MB video file it a bit much. A 50KB pie chart or bar graph would have been nice.
    • Personally, DSPAM isn't nearly as good and has flagged many legitamate messages and sent them to the Junk folder.

      And what happened when you retrained those false positives as ham? Did you see future mails of the same/similar type get caught again? I bet you didn't.

      I've been using dspam for a very long time for my users, and they love it. They love having zero spam in their mailbox, they love the simplicity of the user interface. They love how it treats users on a per-user basis, not globally (i.e. so

  • Flaw in the test (Score:5, Informative)

    by lheal (86013) <lheal1999@@@yahoo...com> on Thursday August 03, 2006 @12:48AM (#15837444) Journal
    The spammers actively try to subvert the more popular filters. That gives a lesser-known one a decided advantage, one which will go away as it becomes more popular.

    As with most choices like this, factors such as ease of use, speed, and resource efficiency can overshadow selectivity. No system is perfect, so it's perfectly reasonable to go with a system that's pretty good if you already are using it, rather than switching to the latest cool thing.

    I have found that using two dissimilar systems in a chain is quite effective.
    • Excellent point.

      And that applies to spam filtering techniques as well - it's like anti-biotics. For serious stuff, a spread attack is a good idea.

      I've found that using RBLs, SpamAssassin, and Bayesian filters prevents 99.5% of spam with essentially no false positives. And that means, by my day-to-day experience with addresses spammed for a full 10 years now, that instead of getting 100 spam and one real mail, I get 1 real mail, and once every could of days a spam that gets through.

      Except for earlier this ye
      • by Jeffrey Baker (6191)
        The problem with the spam filters, which you have stated, is that eventually a spammer figures out how to craft a spam which avoids the feature detection systems. Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.

        Lately, I've been thinking about this problem a lot. The classic method of computer classification systems (Bayes, SVM, whatever) are all base
        • Right now there's some zombie network sending around a stock market scam, of which I am getting roughly 300 copies per hour, even though spamassassin correctly classifies virtually all other unwanted mail.

          If you're talking about spam with the pump & dump message in an image, and random-words text, I'm getting about a dozen of those a day. They're one of three types that's getting through my filters currently. 300 copies per hour would make me just about ready to kill somebody.

          I have long avoided webs
        • by perlchild (582235)
          A web of trust will work only until someone you trust's computer gets subverted. The zombie network you mentioned doesn't happen by itself. Now the smaller, more technically proficient web of trust, the less likely it is to be subverted, but it's still vulnerable to someone you trust having their computer hijacked.
  • It may not be coincidence that a little-known filter algorithm produces the best results; many spammers probably test their spew on the more popular filters to try and fool them. If this new filter becomes more popular you may see its reliability decay.
    • This is very true.
      I have a successful spamfilter deployed at work. It uses SpamAssassin for the backend filtering, but that part has to do very little.
      The bulk of the rejecting is done in the dedicated SMTP engine that receives the mail. There is a lot of information to be deduced from the SMTP transaction itself, which is normally not used by spamfilters.
      Close adherence to RFC standards is something that most SMTP servers have achieved quite well, and the tools the spammers use are very bad at it.
      I know
  • by Ossifer (703813) on Thursday August 03, 2006 @12:55AM (#15837474)
    And I printed out every frame so I could scan them. I'll be posting the TIFFs on my website shortly...
  • by martin-boundary (547041) on Thursday August 03, 2006 @01:13AM (#15837527)
    For those who don't relish downloading 400MB worth of video (why can't somebody cut out the audio as a standalone MP3?), the material of the talk is also available in text mode.

    The official tests of spamfilters were done in last year's TREC conference, you can read the writeup here [uwaterloo.ca] (or pdf overview [uwaterloo.ca]).

    You can duplicate those tests yourself if you download the evaluation toolkit (GPL) [uwaterloo.ca]. It's a modular system where you can add a mail corpus (either one of the public TREC ones, or you can make your own trivially), and add a spamfilter package (there are 10 or so to download from the web, or create your own as per documentation).

    There's also a video talk [researchchannel.org] given at Microsoft research which should cover pretty much the same ground, if text mode is slashdotted :).

    There's a new scheduled test towards the end of the year at TREC 2006.

  • Is there any filter that doesn't give false positives? I don't mean "almost none", I mean zero . It isn't a matter of "holding out for perfect". Some of us simply can't afford to have a key email discarded as "spam".
    • There is no classification system with zero real risk, except for delivering all mail to the Inbox. Sorry.

      If your mail is that important, you should be using couriers instead of email.
      • How many spam do you get a day? I get hundreds. Half of them are not in my native language (much like half the mail in my inbox), which means it takes more than a split-second glance to figure out what is going on. I'd guess my accuracy in split-second decisions is probably on the order of 95%, which if I were a spam filter would earn me a D-. Paul Graham, who probably has more typical email habits when compared with the average Slashdotter, says he misses about 3 per 2,000. http://www.paulgraham.com/w [paulgraham.com]
      • Delivering all mail to the inbox has a real risk: Human classification error, which AFAIR tend to run at about 0.1%. This is higher than some automated systems.

        Eivind.

    • Well,

      You could have it only filtered completely if it's suspect rating is high enough and then otherwise just tag it if the rating is below a certain point.

      That said... white lists are your friends.

      Funny thing though... someone forwarded me some "funny" e-mail and usually they are not that humorous. I was so damned pleased when it was filtered out.

      That said, I haven't moved to deletion just yet. I just tag the mail and sort it later. As soon as I'm sufficiently happy with the system highly suspect mails can
    • Cloudmark's safetybar product (http://www.cloudmark.com/ - lousy name, SpamNet which it was before was far better) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.

      On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but ju
  • by Anonymous Coward on Thursday August 03, 2006 @01:34AM (#15837587)
    Dear Slashdot,
    At the university where I work, they have recently adopted a pesky policy banning the use of bitTorrent.
    What can I do to fix [uwaterloo.ca] this ?
    Yours faithfully,
    Dr. Gord Cormack
  • by abh (22332)
    A 400mb video file? Is this a joke? WTF is everyone thinking that everything on the web needs to be on video all of a sudden. I just blogged about this today: http://www.anotherblogger.com/2006/08/02/please-no -more-gratuitous-videoblogging/ [anotherblogger.com]
  • "In his study he looked at the major spam filters ( DSPAM, SpamAssasian"

    Spam about asian donkeys is a new one on me, though.

  • I use the built in Spam filter in Exchange 2k3 set to level 8. All "filtered" e-mails are archived. I get maybe 3 or 4 a day (on a "bad" day) that make it through. Once a week (or more if I can be bothered) I view the archive and send on any that aren't spam (<1%) on and those that are spam get junked. I do this using a little tool I wrote that displays the From, To and subject of all these e-mails. If I can't tell from these fields whether the e-mail is a SPAM or not (and it generally is anyway) th
    • Er, what? A false positive rate of 1:100!?!?

      Usually, anti-spam solutions which give more than 1:100000 are considered worthless. What you're quoting is beyond words.
      • A false positive rate of 1:100

        No, better than 1:100 - that's what <1% means. It's actually around the 1:500

        Usually, anti-spam solutions which give more than 1:100000 are considered worthless

        Got links, or is that just your opinion?
        • Re:MS Anti Spam... (Score:3, Informative)

          by KiloByte (825081)

          A false positive rate of 1:100

          No, better than 1:100 - that's what <1% means. It's actually around the 1:500

          And thus still 200 times worse than the acceptable rate.

          Usually, anti-spam solutions which give more than 1:100000 are considered worthless

          Got links, or is that just your opinion?

          There was a massive flamefest on debian-devel about spam filtering recently, but false positive ratios in that range were something commonly used by most participants in the discussion. I don't have the time to

  • Anyone care to post a link?
  • by bgog (564818) * on Thursday August 03, 2006 @02:33AM (#15837752) Journal
    Why exactly should be give any weight to anything from and organization so ignorant as to disallow bittorrent? I take someone pretty darn ignorant to disallow a protocol because some use it to transport illegal content. Why havn't then banned TCP? It is an evil technology used every day to violate copyright.

    This guy should spend his time educating the fools at his institution.
  • by sciop101 (583286) on Thursday August 03, 2006 @02:35AM (#15837755)
    On-line Supervised Spam Filter Evaluation
    Gordon Cormack and Thomas Lynam

    Full Text, May 29, 2006 - PDF Format

    http://plg.uwaterloo.ca/~gvcormac/spamcormack.html / [uwaterloo.ca]

  • GMail Spam Filter (Score:5, Interesting)

    by foxylad (950520) on Thursday August 03, 2006 @02:48AM (#15837790) Homepage

    I use greylisting (gld to be specific) which works wonderfully. A couple of customers wanted even better filtering...

    First I tried DSPAM, but they refused to train it so the results weren't good. Then I tried Spam Assasin, which also let through a suprising amount of spam - a lot more than my personal account on Gmail.

    So I set up accounts on Gmail for them, and forwarded their mail to those accounts (after greylisting - don't want to burden GMail too much!). Gmail lets you set up forwarding, so I simply forwarded all the filtered mail back to a second account on my mailserver for the customer to pick up. Finally I wrote a python script that logs in to Gmail once a week to prevent the account being closed due to non-use.

    A tad involved, but it works like a dream. Yet again Google comes out on top, this time in a market it doesn't even know it's in!

    • Re:GMail Spam Filter (Score:2, Interesting)

      by sd.fhasldff (833645)

      This is actually something Google could sell. Access to their mail filter. I do realize that they have "corporate email", but that still smacks a lot of GMail and some businesses would rather avoid that. Instead, they could provide a simple access to their spam filter. Yes, requiring all email to be piped through a Google server if they don't want to make the filter available as a binary (presumably updated regularly).

      To minimize bandwidth consumption and (partly, at least) allay privacy / corporate secre

  • So Which One Won? (Score:2, Interesting)

    by ryanisflyboy (202507)
    So which one is the "unheard of spam filter?"

    Wouldn't it make sense to put this in the /. submission (or at least a link).

    Did I miss the obvious "and the winner is..." some place?
  • Cloudmark's SpamNet (Score:3, Interesting)

    by cruachan (113813) on Thursday August 03, 2006 @03:40AM (#15837906)
    I have to push this as it usually gets missed from reviews as it's a hybrid P2P solution and not a straightforward filter, but Cloudmark's safetybar product (http://www.cloudmark.com/) is just about perfect for me. I get an average of about 20 spam emails a day and it has a false positive result of 0% and has had for months. In fact I've been using the product for several years now and I think the last time I saw a false positive was a couple of years back.

    On the efficiency side it has a hit rate of nearly 100%. I would have said it was 100% a couple of months back, but just recently it's been having a bit of a problem with one stock-pushing spam.

    Anyway, that aside it's the best spam filter I've ever seen by a very long way, and I'd highly recommend the service. It costs a few $ a month, but it's probably the best value subscription I have.

    I have no connection with the company, just a very satisfied customer who's been using it since the beta some years ago. I have a publically available email address which I've had for years and must be on many spam lists, without Cloudmark it would be unusable, with it it's no problem at all. I recently installed it for my wife who was starting to get a lot of spam - on that I noticed it took about two weeks to get it trained not to junk a few mailing list emails she was on, but after that it's been just as highly reliable as my installation.
  • IMHO, the criteria for best spam filter is very simple. It is the filter that is able to consistantly maintain the highest spam to false positive ratio.

    Feel free to add to it. :D
  • by prandal (87280) on Thursday August 03, 2006 @04:09AM (#15837975)
    This paper's a complete waste of time.

    He tested spamassassin 2.3 - that's ancient! I'd imagine the other tools are similarly obsolete.

    We currently use SA 3.1.4 with a well-trained Bayes database and Razor, Pyzor, and DCC.

    Throw in a few custom rules and a selection of rules from http://www.rulesemporium.com/ [rulesemporium.com] and the results are outstanding.

    With the new sa-update feature the core rules are updated between point releases, which came in useful this week dealing with the new image spams which seemed to be designed to avoid detection by spamassassin. Thanks Theo.

    And the folk on the spamassassin-users mailing list really rock.
    • by gvc (167165) on Thursday August 03, 2006 @09:18AM (#15838974)
      I assume the paper that you are describing is the 2004 study [uwaterloo.ca]. The paper described in the talk (which was given 6 months ago or so) described results of the TREC 2005 Spam Track [uwaterloo.ca] which took place in November 2005. It included a test SpamAssassin 3.x, not 2.3.

      TREC 2006 [nist.gov] evaluations are now underway [uwaterloo.ca].

      While it is reasonable to conjecture that spam has changed so as to defeat spam filtering techniques, or will change so as to defeat the PPM technique that did well at TREC, the historical evidence does not support this conjecture. In particular:

      • The spam filters tested in 2004 give pretty well exactly the same performance on 2005 and 2006 data.
      • New versions of the filters are a little bit better, but not by leaps and bounds, and also get about the same results over the last 2.5 years of data.
      • There is no evidence that "Bayesian poisining" is a viable technique for defeating statistical spam filters in anything but a very artifical laboratory environment where the poisoner has access to the recipient's inbox
      The subject of the paper -- and the talk -- is primarily about testing methodology and the need for controlled scientific investigation. So I hesitate to endorse the simplistic notion of a "winner" of the TREC evaluation. However the technique that did very well [ai.ijs.si] was indeed quite novel, so here's a characterization.
      Andrej Bratko used PPM -- a well-known data compression technique to compress ham and spam separately. Well actually he didn't compress them but just build the statistical model necessary to compress them. Then he simply (tentatively) added the unknown message to each model and chose the one that compressed it best. The general technique of using compression has been mentioned here and elsewhere but Bratko used a much stronger compression scheme and was somewhat clever about it.

      I later reproduced Bratko's results using DMC -- a compression schem that I invented 20 years ago -- and got some interesting results. We have a journal article in press describing it and also an evaluation paper at CEAS 2006 [www.ceas.cc].

      Bratko A., Cormack G. V., Filipic B., Lynam T. R. and Zupan B., Spam Filtering Using Statistical Data Compression Models [uwaterloo.ca]

  • It is a war (Score:3, Insightful)

    by Alain Williams (2972) <addw@phcomp.co.uk> on Thursday August 03, 2006 @04:42AM (#15838039) Homepage
    Spam is a war between the spammers and the system administrators/spam filters. The spam filters adopt a new technique; then spammers then work round it; the spam filters advance; ...

    By the time that I have downloaded the video the war will have moved on a couple of iterations ...

  • by bytesex (112972)
    It looks like another win for compression algorithms. Not only do they maximize entropy in your data while shortening it, they can also be used successfully to earmark pieces of text as being written in a certain language, or written by a certain author, and now they can be used for spam detection. The usefullness just keeps on coming. Colour me impressed.
  • by dodobh (65811) on Thursday August 03, 2006 @06:57AM (#15838364) Homepage
    See here [vix.com]

    The key paragraph:

    If you'd like a more topical example, consider "spam". People began altering their e-mail "From:" lines in order to make their addresses harder to guess or aggregate; people began doing pattern matching in order to catch known-bad messages and either sideline or reject them. Many defenders used many small tricks to protect their inboxes. The result has not been that less spam is sent or even that less spam is received, on an aggregate basis. Things are worse now than they've ever been. (I say this as co-founder of MAPS LLC, by which I hope to establish my credentials in the spam field for those of you who do not know me.) Today a small number of highly advanced defenders is spam-immune only because they are a small number and their techniques are not widely effective against the attackers; and a small number of highly advanced attackers can "spam at will" a far larger population than ever before. And the trend is that things are getting worse, and getting worse faster than ever before.
  • Dspam floats my boat (Score:3, Informative)

    by Zzeep (682115) <kenneth@vangrinsv[ ]com ['en.' in gap]> on Thursday August 03, 2006 @07:20AM (#15838419) Homepage
    I receive (no kidding) around 600 spam mails per day, versus approximayely 30 real e-mails. I've been using dspam for over a year now (with very faithful training), and there is maybe 1 false positive every few weeks (less than 1 in 10.000) and every few days a few (usually "new") spam mails get through, which I ofcourse immediately train, to never see those kind again. So I am very very positive about dspam. What I do miss though is something like a good and reliable service (better than the RBL's I know) that can block SMTP clients on the fly (like DSL home users and such) to reduce the immense load on our mailservers (I work for an ISP) caused by all the spam (that also has to go through a virus scanner, clamav).
  • Sadly, the way this was done, there is no way to test how well Greylisting [puremagic.com] would have helped.

    IMarv
  • by gvc (167165) on Thursday August 03, 2006 @10:36AM (#15839587)
    Here are the slides from the 400MB video presentation. [uwaterloo.ca]

There are running jobs. Why don't you go chase them?

Working...