Stories
Slash Boxes
Comments

News for nerds, stuff that matters

Working Bayesian Mail Filter

Posted by CmdrTaco on Sun Nov 03, 2002 01:05 PM
from the stuff-to-play-with dept.
zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."
This discussion has been archived. No new comments can be posted.
Working Bayesian Mail Filter | Log In/Create an Account | Top | 313 comments (Spill at 50!) | Index Only | Search Discussion
Display Options Threshold:
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
  • Whas that? by cos(0) (Score:2) Sunday November 03 2002, @01:08PM
    • Re:Whas that? by Raul654 (Score:1) Sunday November 03 2002, @01:11PM
    • Re:Whas that? (Score:4, Informative)

      by DalTech (575476) on Sunday November 03 2002, @01:16PM (#4589064)
      Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government. http://www.bayesian.org/
      [ Parent ]
      • 1 reply beneath your current threshold.
    • Bayes Explained by brw215 (Score:1) Sunday November 03 2002, @01:18PM
      • Re:Bayes Explained by Stonehand (Score:1) Sunday November 03 2002, @01:24PM
      • Re:Bayes Explained (Score:5, Informative)

        by johnynek (36948) <boykin@pobox.com> on Sunday November 03 2002, @01:36PM (#4589191) Homepage
        That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.

        It should be:

        Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)

        and:

        Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)
        [ Parent ]
      • Re:Bayes Explained by capt.Hij (Score:2) Sunday November 03 2002, @01:47PM
      • Re:Bayes Explained (Score:4, Informative)

        by Jim Nugent (619564) on Sunday November 03 2002, @02:55PM (#4589647)
        To put this in simpler terms, consider this scenario, 90% of all all X-rays that have a certain feature are from women with breast cancer. That is an easy statistic to compute; you have the x-rays and you follow up with the patients.

        The trick is derive a statement like: "If an x-ray has this feature, the patient has NN % chances of having breast cancer. THAT's useful tor screening, but it doesn't follow from the first statment (without some serious statistical calculations).

        Bayes theorem has all sorts of applications in prediction. In the case of E-mail, we can greatly oversimply and say "We found that X% of E-mails with this subject line are Spam." "We conclude that an E-mail with this subject line has Y% odds of being spam." Note that these are two very different statements. If we can find Y for the second statement and set a threshold we're comfortable with, say, 95% then we can create a filter with 95% confidence of correctness; it may well be wrong 5% of the time.

        Other responses have done a good job with the math so I won't repeat it here.
        [ Parent ]
      • 1 reply beneath your current threshold.
    • Re:Whas that? (Score:5, Funny)

      by Evil Adrian (253301) on Sunday November 03 2002, @01:19PM (#4589092) Homepage
      If you had just clicked the POPFile [sourceforge.net] link, you would see the explanation.

      Initiative is your friend.

      Hyperlinks are your friend.

      Don't be afraid, just click.
      [ Parent ]
      • Re:Whas that? (Score:5, Informative)

        by sfe_software (220870) on Sunday November 03 2002, @02:02PM (#4589341) Homepage
        If you had just clicked the POPFile link, you would see the explanation.

        I also highly recommend this link [paulgraham.com], as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.
        [ Parent ]
      • 1 reply beneath your current threshold.
    • Re:Whas that? (Score:5, Informative)

      by dvk (118711) on Sunday November 03 2002, @01:19PM (#4589094) Homepage
      From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".

      A couple of URLs quickly found on Google:
      http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html [faqs.org]
      http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf [monash.edu.au]

      Also, any decent AI/machine learning textbook ought to cover the topic.

      -DVK

      [ Parent ]
    • 2 replies beneath your current threshold.
  • True "Bayesian" and do I care? by Dog and Pony (Score:1) Sunday November 03 2002, @01:08PM
  • spambayes.sf.net (Score:5, Informative)

    by supton (90168) on Sunday November 03 2002, @01:10PM (#4589019)
    Saw this a few weeks back... [sf.net] Spam filter in Python using Naive Bayes.
  • Sure it's promising (Score:4, Insightful)

    by bigberk (547360) <bigberk@users.pc9.org> on Sunday November 03 2002, @01:12PM (#4589042)
    And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.
    • Re:Sure it's promising (Score:5, Informative)

      by outlier (64928) on Sunday November 03 2002, @01:26PM (#4589136)
      While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).

      Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
      [ Parent ]
      • Re:Sure it's promising (Score:4, Informative)

        by rgmoore (133276) <glandauer@charter.net> on Sunday November 03 2002, @01:47PM (#4589262) Homepage

        Another important point is that there are some things that they can't hide, at least not in their current working model. If they're trying to sell you something, they have to describe what that thing is and where you can get it, and those descriptions are unlikely to be in any legitimate email. If they want to advertize a web site, they have to include its URL in the message, and the filter can catch that. If they advertize a physical address or phone number, the system can catch those, too. If they don't repeat the message, it means that there's inherently less spam, because I'm only seeing each add once.

        It's also not possible to disguise everything in their headers, so things like their posting host (either the one they pay for legitimately or any open relay they're taking advantage of) will wind up being a pointer to who they are. They certainly can't change anything about the headers that's added downstream of their posting host, so as long as they keep using the same one it's likely that there will be characteristic stamps there that the spammers absolutely can't change. I know that analysis of the headers is part of bogofilter [sourceforge.net], another Bayesian filter that I've been using to good effect.

        [ Parent ]
        • by devphil (51341) on Sunday November 03 2002, @06:30PM (#4590909) Homepage


          So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.

          Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:

          Ignore the actual contents of the message. 34% of the time, it's spam.

          And it's right.

          [ Parent ]
          • 1 reply beneath your current threshold.
        • Re:Sure it's promising by Daytona955i (Score:1) Sunday November 03 2002, @10:45PM
        • Re:Sure it's promising by Autonomous Crowhard (Score:2) Wednesday November 06 2002, @01:09PM
      • Re:Sure it's promising by marmoset (Score:2) Sunday November 03 2002, @02:47PM
      • Re:Sure it's promising (Score:4, Interesting)

        by Tim Browse (9263) on Sunday November 03 2002, @03:51PM (#4589991)
        One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.

        So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)

        Tim
        [ Parent ]
      • Welcome to the future by disarray (Score:3) Sunday November 03 2002, @04:55PM
      • Re:Sure it's promising (Score:5, Funny)

        by Alsee (515537) on Sunday November 03 2002, @07:48PM (#4591298) Homepage
        (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam.

        I have a cousin that lives in Nigeria and we regularly discuss tips on penis enlargement. He works at a bank refinancing mortgages and his wife is a professor at an accredited university. I work in in a Las Vegas casino producing shows featuring live nude showgirls. He offered to help me pay some bills and get out of debt (a generous offer, but I told him I just found a second part time job working from home earning thousands of dollars per week). My wife is a stock broker and I regularly let my cousin in on hot stock tips. I have an herb garden, I take viagra, and use rogaine. Since we both own the same brand of printer we've been working out the best way to refill the ink cartridges. I've been trying to lose weight, but it comes right back as soon as I quit smoking.

        I don't quite understand this "beysian filter" stuff, but I can't wait to try it out!

        -
        [ Parent ]
      • 2 replies beneath your current threshold.
    • Re:Sure it's promising by bmwm3nut (Score:2) Sunday November 03 2002, @01:26PM
      • Re:Sure it's promising (Score:4, Informative)

        by rgmoore (133276) <glandauer@charter.net> on Sunday November 03 2002, @01:56PM (#4589313) Homepage

        Bogofilter [sourceforge.net] comes close to this. It has an operating mode where each file that it filters is automatically added to the appropriate corpus, either of spam or non-spam. Since it's correct the vast majority of the time, that means that there's very little for the user to do. When it is wrong, you just take the messages that it miscategorized and feed them back into the system with the notation that they were originally marked incorrectly, and it backs out the changes to the wrong category and adds them to the correct category.

        I'm using bogofilter with Evolution [ximian.com], and it works very well. I just have two extra folders, one for false negatives and one for false positives. When I notice mail that's been flagged incorrectly, I drag it into the appropriate folder and run a script that tells bogofilter to correct its mistake. Then I either flush the mail (if it was spam marked as non-spam) or process it normally (if it was non-spam marked as spam). I've only been using it for about two weeks and it already has a nearly zero false positive rate (i.e. incorrectly flagged as spam) and a usefully low false negative rate (i.e. incorrectly flagged as legitimate).

        [ Parent ]
      • Re:Sure it's promising by Helter (Score:1) Sunday November 03 2002, @02:00PM
    • Re:Sure it's promising by Theodore Logan (Score:2) Sunday November 03 2002, @01:27PM
    • Re:Sure it's promising by wheany (Score:1) Sunday November 03 2002, @01:37PM
    • Re:Sure it's promising by chrsbrwn (Score:1) Sunday November 03 2002, @01:40PM
    • by Bob9113 (14996) on Sunday November 03 2002, @01:44PM (#4589237) Homepage
      This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.
      [ Parent ]
    • Re:Sure it's promising by Brendan Byrd (Score:2) Sunday November 03 2002, @01:52PM
    • Re:Sure it's promising (Score:4, Insightful)

      by tsg (262138) on Sunday November 03 2002, @03:46PM (#4589960)
      Any solution that requires spammers to be more clever is going to reduce the number of spammers. And that is the end goal.
      [ Parent ]
    • Re:The decimal issue by Spock the Baptist (Score:2) Sunday November 03 2002, @01:58PM
    • Re:Why hex and binary? by Anonymous DWord (Score:1) Sunday November 03 2002, @01:59PM
    • 1 reply beneath your current threshold.
  • Server-side solutions? (Score:3, Interesting)

    by Quixote (154172) on Sunday November 03 2002, @01:12PM (#4589046) Homepage Journal
    Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?
  • by AT (21754) on Sunday November 03 2002, @01:14PM (#4589055)
    The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.
  • That Google search... (Score:4, Insightful)

    by Jugalator (259273) on Sunday November 03 2002, @01:15PM (#4589061) Journal
    Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".
  • by davids-world.com (551216) on Sunday November 03 2002, @01:16PM (#4589062) Homepage
    A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).

    More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines [kernel-machines.org] and, somewhat older, Maximum Entropy models.

    Enough nerd talk for today :-)

  • Forget Bayes (Score:5, Funny)

    by Evil Adrian (253301) on Sunday November 03 2002, @01:16PM (#4589070) Homepage
    We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.
  • *BUT* it's a Perl script... by pilot1 (Score:2) Sunday November 03 2002, @01:17PM
  • I don't get any spam (Score:3, Funny)

    by Istealmymusic (573079) on Sunday November 03 2002, @01:17PM (#4589076) Homepage Journal
    Can someone explain why this filter would be useful to me?
  • bogofilter (Score:4, Informative)

    by stype (179072) on Sunday November 03 2002, @01:18PM (#4589084) Homepage
    This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter [tuxedo.org] for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.
    • Re:bogofilter by Theodore Logan (Score:2) Sunday November 03 2002, @01:44PM
      • Re:bogofilter by SubtleNuance (Score:2) Sunday November 03 2002, @09:45PM
        • Re:bogofilter by Theodore Logan (Score:2) Monday November 04 2002, @08:40AM
    • 1 reply beneath your current threshold.
  • IMAP by Evil Adrian (Score:2) Sunday November 03 2002, @01:22PM
    • Re:IMAP by LetterJ (Score:2) Sunday November 03 2002, @02:05PM
    • Re:IMAP by uksv29 (Score:1) Sunday November 03 2002, @02:14PM
      • Re:IMAP by uksv29 (Score:1) Sunday November 03 2002, @02:27PM
    • Re:IMAP by RustyTaco (Score:1) Sunday November 03 2002, @02:22PM
      • Re:IMAP by vondo (Score:2) Sunday November 03 2002, @05:22PM
    • Re:IMAP by vondo (Score:2) Sunday November 03 2002, @02:39PM
    • Mail.app by Arker (Score:2) Sunday November 03 2002, @03:05PM
      • 1 reply beneath your current threshold.
    • Re:IMAP by Drakonian (Score:1) Sunday November 03 2002, @06:02PM
  • Mozilla integration by Powerdog (Score:1) Sunday November 03 2002, @01:24PM
    • 1 reply beneath your current threshold.
  • Um. No. by 3-State Bit (Score:1) Sunday November 03 2002, @01:25PM
    • Re:Um. No. by judd (Score:2) Sunday November 03 2002, @01:36PM
    • Re:Um. No. by jjo (Score:2) Sunday November 03 2002, @01:38PM
    • Re:Um. No. by Fastolfe (Score:1) Sunday November 03 2002, @01:43PM
    • Re:Um. No. by rgmoore (Score:2) Sunday November 03 2002, @02:05PM
    • Re:Um. No. by kirkjobsluder (Score:1) Sunday November 03 2002, @02:19PM
    • Re:Um. No. by dvdeug (Score:2) Sunday November 03 2002, @02:30PM
    • Re:Um. Yes by crisco (Score:2) Sunday November 03 2002, @02:54PM
    • 2 replies beneath your current threshold.
  • product of marketrons by hfastedge (Score:2) Sunday November 03 2002, @01:26PM
  • As effective as a well trained secretary by Gribflex (Score:1) Sunday November 03 2002, @01:29PM
  • Not integrated solution by unfortunateson (Score:2) Sunday November 03 2002, @01:32PM
  • You know what I'd kill for? (Score:3, Interesting)

    by Saint Aardvark (159009) on Sunday November 03 2002, @01:34PM (#4589180) Homepage Journal
    A version of this for Outlook Express.

    I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock [dowco.com]. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)

    But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)

    The good folks at DeerSoft [deersoft.com] have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.

    Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?

    Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

  • SquirrelMail has a Bayesian plug-in (Score:4, Informative)

    by ptbarnett (159784) on Sunday November 03 2002, @01:37PM (#4589200)
    Plugins - BayesSpam - Intelligent Spam Filter [squirrelmail.org]

    SquirrelMail [squirrelmail.org] is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor [sourceforge.net]).

  • Uhmm.. like bogofilter? (Score:3, Informative)

    by Jamuraa (3055) on Sunday November 03 2002, @01:42PM (#4589231) Homepage Journal
    Bogofilter [tuxedo.org] has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better [debian.org] or not as good [debian.org] as spamassassin [spamassassin.org].
  • Risk management by hansroy (Score:1) Sunday November 03 2002, @01:44PM
  • Statistics are cool. by Fuzzums (Score:1) Sunday November 03 2002, @01:46PM
  • Ximian Evolution? by Namtar (Score:1) Sunday November 03 2002, @01:47PM
  • Spam will be spam by dazdaz (Score:1) Sunday November 03 2002, @01:53PM
  • Staged Categories by irritating environme (Score:2) Sunday November 03 2002, @02:05PM
  • Where's the news? (Score:4, Informative)

    by Roadmaster (96317) <roadmr@@@entropia...com...mx> on Sunday November 03 2002, @02:09PM (#4589364) Homepage Journal
    Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.

    Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.

    Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.
  • Good in combination with spamassassin? by Moritz Moeller - Her (Score:2) Sunday November 03 2002, @02:12PM
  • Developers missed this... (Score:3, Insightful)

    by bigberk (547360) <bigberk@users.pc9.org> on Sunday November 03 2002, @02:15PM (#4589381)

    In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.

    A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat [ritlabs.com], Pimmy [geminisoft.com], JBMail [pc-tools.net] and PocoMail [pocomail.com] will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.

  • Easy filter by kraf (Score:1) Sunday November 03 2002, @02:17PM
  • outlook integration by jdkane (Score:1) Sunday November 03 2002, @02:21PM
  • by Rooney444 (617388) on Sunday November 03 2002, @02:29PM (#4589479)
    If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?
    • by dzym (544085) on Sunday November 03 2002, @02:38PM (#4589550) Homepage Journal
      Yes, but remember, who runs the SMTP servers?

      The very design of the whole system specifies that anyone can just turn on a machine, hook it up to a network somewhere, and start spewing out messages to smtp ports all over the world.

      It doesn't have to be a sendmail, qmail, or exim server, remember. Some Windows viruses have taken advantage of that loophole to set up mini-SMTP servers in the network stack to continue propagating viruses without needing to connect to anything that provides authenticated external relay.

      [ Parent ]
    • Re:Is this intended for server, client, or both? by tsg (Score:2) Sunday November 03 2002, @04:09PM
  • Image-based spam by Anonymous Coward (Score:1) Sunday November 03 2002, @02:43PM
  • Same approach works in Lotus Notes! by scottme (Score:1) Sunday November 03 2002, @02:45PM
  • What about random misspellings? by archeopterix (Score:2) Sunday November 03 2002, @02:46PM
  • Missing the point? (Score:5, Informative)

    by crisco (4669) on Sunday November 03 2002, @02:46PM (#4589600) Homepage
    I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.

    This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).

    Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.

    Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

  • Yahoo! Mail by sfe_software (Score:2) Sunday November 03 2002, @02:48PM
  • Bayes (Score:5, Funny)

    by John Garvin (229844) on Sunday November 03 2002, @02:56PM (#4589654) Homepage
    Now we can tell spammers: "All your Bayes are belong to us."
  • SpamOracle by flockofseagulls (Score:1) Sunday November 03 2002, @03:04PM
  • Multi-purpose tool (Score:3, Interesting)

    by B'Trey (111263) on Sunday November 03 2002, @03:10PM (#4589736)
    An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.
    • 1 reply beneath your current threshold.
  • this battle cannot be won (Score:4, Insightful)

    by mboedick (543717) <matthewm AT boedicker DOT org> on Sunday November 03 2002, @03:13PM (#4589766) Homepage

    These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.

    Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.

    • Re:this battle cannot be won (Score:4, Insightful)

      by shayne321 (106803) on Sunday November 03 2002, @05:48PM (#4590645) Homepage Journal

      These technologies are interesting, but the problem of spam should be solved at the source.

      And how do you propose we solve the problem at its source? Make it illegal? They'll just find loopholes in the law and/or move to a country where it is legal. Hunt them down and murder their wife and kids in front of them then hang them from a tree? Satisfying though it may be, last I checked murder was illegal.

      Techniques like this CAN eventually solve the problem.. As others have pointed out, for someone to buy something from a spammer they have to READ the spam. If they send out 1 million spams and 500,000 read them and 20 of them buy something, they'll keep doing it. If they send out 1 million and only 500 people read it and 1 person buys something, they'll loose their source of income and have to find a new line of work.

      Also, for each obstacle we put in their way (checksum databases, open relay databases, filters, etc) it costs them more time, effort and therefore, money to send their crap - all for less income.

      Shayne

      [ Parent ]
    • Re:this battle cannot be won by crucini (Score:3) Sunday November 03 2002, @05:56PM
  • What's the problem? by LS (Score:2) Sunday November 03 2002, @04:20PM
  • Other applications... (Score:3, Funny)

    by Ed Avis (5917) <ed@membled.com> on Sunday November 03 2002, @06:29PM (#4590907) Homepage
    How long until we can set up Bayesian by-word filtering on Slashdot comments?
  • Spamassasin (Score:3, Interesting)

    by fireboy1919 (257783) <rustyp@@@freeshell...org> on Sunday November 03 2002, @07:59PM (#4591345) Homepage Journal
    This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

    (Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

    What really interests me is that Spamassasin claims to use a genetic algorithm [spamassassin.org] to rate how likely an e-mail is to be spam.
  • Privacy questions... by cerebrum (Score:1) Sunday November 03 2002, @08:04PM
  • I've already got an Outllook (VBA) version by Red Herring (Score:1) Sunday November 03 2002, @08:05PM
  • Spamvertised URL Tracker by herbierobinson (Score:2) Sunday November 03 2002, @11:05PM
  • Several points come to mind by CySurflex (Score:2) Monday November 04 2002, @01:42AM
  • spammers debugging the code right now by SystematicPsycho (Score:1) Monday November 04 2002, @02:07AM
  • Already patented by MicrosofT (Score:3, Informative)

    by barfy (256323) on Monday November 04 2002, @03:29AM (#4592944)
    This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

    patent 6,161,130 [uspto.gov]
  • Bogofilter with IMAP integration by giggls (Score:1) Monday November 04 2002, @05:05AM
  • Been posted before... by NNland (Score:2) Monday November 04 2002, @12:43PM
    • 1 reply beneath your current threshold.
  • bayesian modslapping by psamuels (Score:2) Tuesday November 05 2002, @12:52AM
  • Spammers Counter Tactics by htaccess (Score:1) Wednesday November 06 2002, @11:52PM
  • Last Post! by alpg (Score:1) Sunday November 17 2002, @12:36PM
  • Re:Those terrorists! by edb (Score:1) Sunday November 03 2002, @01:10PM
    • 1 reply beneath your current threshold.
  • Re:Honest to whom? by jez9999 (Score:1) Sunday November 03 2002, @01:37PM
  • Bad idea by archeopterix (Score:1) Sunday November 03 2002, @02:52PM
  • Re:Honest to whom? by DrPascal (Score:1) Sunday November 03 2002, @03:07PM
  • 18 replies beneath your current threshold.