Working Bayesian Mail Filter 313

Posted by CmdrTaco on Sunday November 03, 2002 @02:05PM from the stuff-to-play-with dept.

zonker writes "A real, working honest to god Bayesian spam filter. I've been waiting for something like this for a while (since I first read Paul Graham's research paper on this very topic a few weeks ago). Well here's POPFile, a small but extremely effective Perl script that runs on just about any system Perl does. After just a little training was I able to get very effective filtering out of it. From what I understand the new email client that comes with OS X Jaguar has a feature similar to this, but I don't know if it is true Bayesian. Hopefully this kind of feature will become more prevalant in client software as I see the Google results for it are growing."

This discussion has been archived. No new comments can be posted.

Working Bayesian Mail Filter

Load All Comments

Search 313 Comments Log In/Create an Account

Comments Filter:

Whas that? (Score:2, Interesting)

by cos(0) ( 455098 ) writes:

Would anyone care to explain what is a "Bayesian" mail filter?
- Re:Whas that? (Score:4, Informative)
  
  by DalTech ( 575476 ) writes: on Sunday November 03, 2002 @02:16PM (#4589064)
  
  Bayesian is statistical theory and methods useful in the solution of theoretical and applied problems in science, industry and government. http://www.bayesian.org/
  
  Parent Share
  twitter facebook
- Re:Whas that? (Score:5, Funny)
  
  by Evil Adrian ( 253301 ) writes: on Sunday November 03, 2002 @02:19PM (#4589092) Homepage
  
  If you had just clicked the POPFile [sourceforge.net] link, you would see the explanation.
  
  Initiative is your friend.
  
  Hyperlinks are your friend.
  
  Don't be afraid, just click.
  
  Parent Share
  twitter facebook
  - Re:Whas that? (Score:5, Informative)
    
    by sfe_software ( 220870 ) writes: on Sunday November 03, 2002 @03:02PM (#4589341) Homepage
    
    If you had just clicked the POPFile link, you would see the explanation.
    
    I also highly recommend this link [paulgraham.com], as it goes into quite a lot of detail on this filtering technique. After reading it, I am going to give the Perl variation a shot.
    
    Parent Share
    twitter facebook
- Re:Whas that? (Score:5, Informative)
  
  by dvk ( 118711 ) writes: on Sunday November 03, 2002 @02:19PM (#4589094) Homepage
  
  From what I understand, it is a mail filter which determines what to filter out based on a statistics-based machine learning system called "Bayesian Learning".
  A couple of URLs quickly found on Google:
  http://www.faqs.org/faqs/ai-faq/neural-nets/part3/ section-7.html [faqs.org]
  http://www.csse.monash.edu.au/courseware/cse5230/a ssets/images/week09.pdf [monash.edu.au]
  Also, any decent AI/machine learning textbook ought to cover the topic.
  -DVK
  
  Parent Share
  twitter facebook
- - Re:Bayes Explained (Score:5, Informative)
    
    by johnynek ( 36948 ) writes: <boykin@pobox.com> on Sunday November 03, 2002 @02:36PM (#4589191) Homepage
    
    That's /. for you. You guys have modded up to 5 a post that is wrong in both of the equations it posts.
    
    It should be:
    
    Pr(h|D) = Pr(D|h) * Pr(h) / Pr(D)
    
    and:
    
    Pr("SPAM"|Email) = Pr(Email|"SPAM") * (proportion of spam) / (probability of getting this paticular Email)
    
    Parent Share
    twitter facebook
  - Re:Bayes Explained (Score:2)
    
    by capt.Hij ( 318203 ) writes:
    
    Great, now the spammers will hire mathematicians to figure out how to best defeat the common algorithms used to calculate Pr(D|h). It is the same old story. In a war over information only the mathematicians win.
    - Re:Bayes Explained (Score:4, Informative)
      
      by B'Trey ( 111263 ) writes: on Sunday November 03, 2002 @03:07PM (#4589356)
      
      Read the referenced article. The only way to avoid the filter is to make your email sound like a normal message. In essence, the filter recognizes the sales pitch. If you remove the sales pitch to get your spam past the filter, you've removed the whole point of sending the spam.
      
      Parent Share
      twitter facebook
  - Re:Bayes Explained (Score:4, Informative)
    
    by Jim Nugent ( 619564 ) writes: on Sunday November 03, 2002 @03:55PM (#4589647)
    
    To put this in simpler terms, consider this scenario, 90% of all all X-rays that have a certain feature are from women with breast cancer. That is an easy statistic to compute; you have the x-rays and you follow up with the patients.
    
    The trick is derive a statement like: "If an x-ray has this feature, the patient has NN % chances of having breast cancer. THAT's useful tor screening, but it doesn't follow from the first statment (without some serious statistical calculations).
    
    Bayes theorem has all sorts of applications in prediction. In the case of E-mail, we can greatly oversimply and say "We found that X% of E-mails with this subject line are Spam." "We conclude that an E-mail with this subject line has Y% odds of being spam." Note that these are two very different statements. If we can find Y for the second statement and set a threshold we're comfortable with, say, 95% then we can create a filter with 95% confidence of correctness; it may well be wrong 5% of the time.
    
    Other responses have done a good job with the math so I won't repeat it here.
    
    Parent Share
    twitter facebook
spambayes.sf.net (Score:5, Informative)

by supton ( 90168 ) writes: on Sunday November 03, 2002 @02:10PM (#4589019) Homepage

Saw this a few weeks back... [sf.net] Spam filter in Python using Naive Bayes.

Share
twitter facebook
Sure it's promising (Score:4, Insightful)

by bigberk ( 547360 ) writes: <bigberk@users.pc9.org> on Sunday November 03, 2002 @02:12PM (#4589042)

And I'm going to check it out right now :) But one long standing I fear with such solutions is spammer's adapting to new environments (changing wording used, making the emails look more professional). Sure, they're dumb shits but they're still humans with brains.

Share
twitter facebook
- Re:Sure it's promising (Score:5, Informative)
  
  by outlier ( 64928 ) writes: on Sunday November 03, 2002 @02:26PM (#4589136)
  
  While spammers will undoubtedly continue to refine the content of their messages, one of the strengths of using a Bayesian filter like this is that it uses the user's own spam and non-spam (ham) as the basis for its calculations. This means that messages are categorized not only by whether they contain spammy words, but also whether they contain the hammy words from your own messages. So, even if spammers could refrain from using words like "free" "mortgage" "sluts" and "spam", they probably wouldn't use words that discriminate your own ham from others (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam. The challenge to the spammer would then be to target you with spam that looks like *your* ham (which is probably different from the ham of others).
  
  Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
  
  Parent Share
  twitter facebook
  - Re:Sure it's promising (Score:4, Informative)
    
    by rgmoore ( 133276 ) writes: <glandauer@charter.net> on Sunday November 03, 2002 @02:47PM (#4589262) Homepage
    
    Another important point is that there are some things that they can't hide, at least not in their current working model. If they're trying to sell you something, they have to describe what that thing is and where you can get it, and those descriptions are unlikely to be in any legitimate email. If they want to advertize a web site, they have to include its URL in the message, and the filter can catch that. If they advertize a physical address or phone number, the system can catch those, too. If they don't repeat the message, it means that there's inherently less spam, because I'm only seeing each add once.
    It's also not possible to disguise everything in their headers, so things like their posting host (either the one they pay for legitimately or any open relay they're taking advantage of) will wind up being a pointer to who they are. They certainly can't change anything about the headers that's added downstream of their posting host, so as long as they keep using the same one it's likely that there will be characteristic stamps there that the spammers absolutely can't change. I know that analysis of the headers is part of bogofilter [sourceforge.net], another Bayesian filter that I've been using to good effect.
    
    Parent Share
    twitter facebook
    - Growing a spam filter -- a firsthand experience (Score:4, Interesting)
      
      by devphil ( 51341 ) writes: on Sunday November 03, 2002 @07:30PM (#4590909) Homepage
      
      So, the graduate CS course I'm taking this quarter is Evolutionary Computing, which is all about the convoluted nonlinear multidimensional-search-space problems, and guess what our current homework is? That's right, taking statistics on spam data, and using genetic algorithms to evolve a working spam filter.
      
      Due to one typo and two thinkos in my fitness evaluation function, my algorithm evolves -- within only a few dozen generations -- a solution which looks like this:
      Ignore the actual contents of the message. 34% of the time, it's spam.
      
      And it's right.
      
      Parent Share
      twitter facebook
  - Re:Sure it's promising (Score:2, Informative)
    
    by marmoset ( 3738 ) writes:
    
    Over the last month or so, I've received a few really strangely worded porn spams that seem to be engineered so as not to trip ISP porn filters. They use lots of passive verbs, no exclamation points, no HTML, and dictionary definitions of whatever kink the spammer is selling.
    
    Since I use Jaguar's mail client, I just told it that these were spam too and now it catches them by itself. :)
  - Re:Sure it's promising (Score:4, Interesting)
    
    by Tim Browse ( 9263 ) writes: on Sunday November 03, 2002 @04:51PM (#4589991)
    
    One interesting fact that came out of these statistical analyses of spam was from one that was featured a while back on slashdot - the guy was doing word analysis, and was looking for good spam indicators/correlations, and expected "sex" or "teens" to be a good match, but the best word was, surprisingly, "ff0000". This was because so much spam uses HTML mail with red text.
    
    So if nothing else, it will force spammers to stop using red text - that has to be some kind of victory :-)
    
    Tim
    
    Parent Share
    twitter facebook
  - Welcome to the future (Score:3, Informative)
    
    by disarray ( 108 ) writes:
    
    Future systems (assuming faster processors and more HD space) could include semantic analysis (e.g., Latent Semantic Analysis) to do an even better job and go beyond the word level.
    
    Welcome to the future: the mail client [apple.com] in Mac OS X 10.2 uses latent semantic analysis. (This isn't just marketingspeak--my mail folder includes "LSMMap"--LS as in "latent semantic".)
  - Re:Sure it's promising (Score:5, Funny)
    
    by Alsee ( 515537 ) writes: on Sunday November 03, 2002 @08:48PM (#4591298) Homepage
    
    (e.g., if you are a computer scientist, your mail may include hammy words like "algorithm" "compile" "project" or "stargate" that would help distinguish ham from spam.
    
    I have a cousin that lives in Nigeria and we regularly discuss tips on penis enlargement. He works at a bank refinancing mortgages and his wife is a professor at an accredited university. I work in in a Las Vegas casino producing shows featuring live nude showgirls. He offered to help me pay some bills and get out of debt (a generous offer, but I told him I just found a second part time job working from home earning thousands of dollars per week). My wife is a stock broker and I regularly let my cousin in on hot stock tips. I have an herb garden, I take viagra, and use rogaine. Since we both own the same brand of printer we've been working out the best way to refill the ink cartridges. I've been trying to lose weight, but it comes right back as soon as I quit smoking.
    
    I don't quite understand this "beysian filter" stuff, but I can't wait to try it out!
    
    -
    
    Parent Share
    twitter facebook
- Re:Sure it's promising (Score:2, Interesting)
  
  by bmwm3nut ( 556681 ) writes:
  
  that's the beauty of this approach. the filter learns all the time (or atleast you can set it up that way). so if spammers get smart, it doesn't take long until the filter adjusts. what i'd love to see is this filter built into a mail client where you have two buttons for delete. one, just to delete the mail, the other to delete it and mark it as spam. when you press that button the filter would scan the email and update its rules.
  - Re:Sure it's promising (Score:4, Informative)
    
    by rgmoore ( 133276 ) writes: <glandauer@charter.net> on Sunday November 03, 2002 @02:56PM (#4589313) Homepage
    
    Bogofilter [sourceforge.net] comes close to this. It has an operating mode where each file that it filters is automatically added to the appropriate corpus, either of spam or non-spam. Since it's correct the vast majority of the time, that means that there's very little for the user to do. When it is wrong, you just take the messages that it miscategorized and feed them back into the system with the notation that they were originally marked incorrectly, and it backs out the changes to the wrong category and adds them to the correct category.
    I'm using bogofilter with Evolution [ximian.com], and it works very well. I just have two extra folders, one for false negatives and one for false positives. When I notice mail that's been flagged incorrectly, I drag it into the appropriate folder and run a script that tells bogofilter to correct its mistake. Then I either flush the mail (if it was spam marked as non-spam) or process it normally (if it was non-spam marked as spam). I've only been using it for about two weeks and it already has a nearly zero false positive rate (i.e. incorrectly flagged as spam) and a usefully low false negative rate (i.e. incorrectly flagged as legitimate).
    
    Parent Share
    twitter facebook
    - Re:Sure it's promising (Score:2)
      
      by dvdeug ( 5033 ) writes:
      
      it already has a nearly zero false positive rate
      
      True. I think I've only had one message falsely get pegged as spam.
      
      a usefully low false negative rate
      
      I haven't found this to be true. Maybe it's because I didn't save up all my spam to send through it, but I estimate that it's only catching half my spam. Nigerian spams still get through sometimes, which should been very easy to catch.
    - - Re:what is the point then? (Score:2, Insightful)
        
        by rgmoore ( 133276 ) writes:
        
        Well, there are potentially three points. One is that hopefully after a while the filter will work well enough that you can develop some real confidence in it and you won't have to check every time to see that it's working right. I'm pretty close to that point with bogofilter; I so rarely see any false positives that I can almost afford to flush the messages without checking. Actually, I assume that what I'll really do is to change the rules a bit so that alleged spam is sent to a waiting folder and doesn't even show up in my main inbox.
        That gets to point two: now I'll be able to check for spam in batch mode. Instead of going through my inbox every time I look for messages, marking some as spam and reading others, I'll be able to read just about everything in my inbox without worrying about spam. Then once a week I can check my spam box and see if there's actually anything legitimate there. This is going to be faster than doing it every time a new message shows up in my inbox.
        I'm not a compulsive mail reader, but for some people this would also be really useful because it would protect them from distractions. They are working on something and then their mailbox beeps them to let them know that a message has arrived. Unfortunately, when they check it out it turns out that their train of thought has been needlessly disrupted by another spam. If they can filter out the spam before the notification while still being alerted promptly when a real message shows up, that's a big win.
- Re:Sure it's promising (Score:2)
  
  by Theodore Logan ( 139352 ) writes:
  
  Well, as the man says in the article:
  
  The Achilles heel of the spammers is their message. They can circumvent any other barrier you set up. They have so far, at least. But they have to deliver their message, whatever it is. If we can write software that recognizes their messages, there is no way they can get around that.
  
  And I think that in this he is correct, almost even provably correct. That's theory, however. In practice no system, short of "real" AI, will be good enough to always recognize spam with a zero false positive rate. It may eventually be good enough, but it won't be perfect. Natural language is just too hard to parse in this way.
  
  But don't despair. If it flunks, there's always spammotel [spammotel.com] and their likes.
- Professional Looking Spam May Be Impossible (Score:4, Insightful)
  
  by Bob9113 ( 14996 ) writes: on Sunday November 03, 2002 @02:44PM (#4589237) Homepage
  
  This may be self-regulating. Consider the Skinner box; if something is capable of perfectly emulating recognition of Chinese, then it can be said to recognize Chinese. Likewise, if a spammer becomes sufficiently skilled at writing undetectable prose, he or she will have reached a skill level at which he or she can pursue more profitable writing ventures. The margins in spam are pretty small. Those spams are being written by morons because morons are cheap.
  
  Parent Share
  twitter facebook
  - Re:Professional Looking Spam May Be Impossible (Score:3, Informative)
    
    by ceswiedler ( 165311 ) writes:
    
    I don't think you're talking about the Skinner box [uni-wuerzburg.de], which is a device used in the psychology of learning, but rather the Chinese room [wustl.edu], which is John Searle's take on AI and the Turing test.
- Re:Sure it's promising (Score:2)
  
  by Brendan Byrd ( 105387 ) writes:
  
  Is there an application to this theory with SpamAssassin? Right now, it's more or less human-edited words and phrases, but applying a real Bayesian method to it would increase it's accuracy. I've also consider making a filter that would change the scores of the different SA rules to reduce the false positives, but this would be a long project.
- Re:Sure it's promising (Score:4, Insightful)
  
  by tsg ( 262138 ) writes: on Sunday November 03, 2002 @04:46PM (#4589960)
  
  Any solution that requires spammers to be more clever is going to reduce the number of spammers. And that is the end goal.
  
  Parent Share
  twitter facebook
- - Re:The decimal issue (Score:2)
    
    by Spock the Baptist ( 455355 ) writes:
    
    One of my pet peeves is the obsession that folks have with zeros. An example is the year 2000. In base 10 you get beaucoup zeros whereas with hex you get 7D0, or 11A6 (base 12), or 3720 (octal), or 11111010000 (binary). Zeros are an artifice of both the base, and numeral system used to represent a pure number. Thus, the fact that most humans use the decimal Indo-Arabic numeral system to represent it is the only reason for all those zeros. Use another base, or numeral system to represent 2000, you don't get beaucoup zeros.
    
    The real properties of pure numbers are the relationships that they have with other numbers, and not the symbology used to represent them.
Server-side solutions? (Score:3, Interesting)

by Quixote ( 154172 ) writes: on Sunday November 03, 2002 @02:12PM (#4589046) Homepage Journal

Any server-side solutions (MTA==qmail, MDA==procmail) using this (Naive-Bayesian) technique out there?

Share
twitter facebook
- Re:Server-side solutions? (Score:2)
  
  by rehannan ( 98364 ) writes:
  
  I've been using PopTray [crause.co.za] (a POP3 email checker for Windows). You have the option of defining "rules" which allow you to delete emails server-side.
  
  It's not a "smart filter" but it works fine for me.
- Re:Server-side solutions? (Score:4, Interesting)
  
  by cmeans ( 81143 ) writes: <chris...a...means@@@gmail...com> on Sunday November 03, 2002 @02:33PM (#4589174) Journal
  
  James [slashdot.org] is a 100% Java Email server (SMTP, POP3, NNTP, and IMAP soon) that supports mail-server extensions via the Mailets API [apache.org]. I developed a Java implementation of the Bayesian rules discussed, so that they could be used in any configuration, but also provided a mailet wrapped implementation so that the filtering (or flagging) could be done at the server side.
  
  Parent Share
  twitter facebook
  - Oops, screwed up the URL... (Score:2)
    
    by cmeans ( 81143 ) writes:
    
    Apache Jakarta James is at http://jakarta.apache.org/james [apache.org].
- Re:Server-side solutions? (Score:4, Interesting)
  
  by koreth ( 409849 ) writes: on Sunday November 03, 2002 @02:44PM (#4589239)
  
  I've been using SpamProbe [sourceforge.net] (which gets invoked from procmail) with excellent results.
  
  Parent Share
  twitter facebook
- Re:Server-side solutions? (Score:2)
  
  by ragnar ( 3268 ) writes:
  
  Yes, my company provides an online service [inbox13.com] to do this sort of thing. We are in beta right now. email me (ragnar@spinweb.net) if you are interested in some more details, as the marketing stuff on the site is a bit lacking.
Mozilla in Process of adding Bayesian filter (Score:5, Interesting)

by AT ( 21754 ) writes: on Sunday November 03, 2002 @02:14PM (#4589055)

The mozilla mail client is getting a Bayesian mail filter, too. See http://bugzilla.mozilla.org/show_bug.cgi?id=163188 . Unfortunately, it probably won't show up until after version 1.2 is released.

Share
twitter facebook
- Re:Mozilla in Process of adding Bayesian filter (Score:2)
  
  by Jugalator ( 259273 ) writes:
  
  And it seems likely the SpamBayes project [sourceforge.net] will work as the foundation for their mail filter.
  
  There are a few other applications [sourceforge.net] that use this code as well, such as an Outlook 2000 add-in.
That Google search... (Score:4, Insightful)

by Jugalator ( 259273 ) writes: on Sunday November 03, 2002 @02:15PM (#4589061) Journal

Try searching for "bayesian email filter" instead of just "bayes email filter" (as in the news post). You'll get better results and more hits since Google doesn't match "*bayes*" (as one would think) when searching for "bayes", but only the actual word "bayes".

Share
twitter facebook
- Re:That Google search... (Score:2)
  
  by Preposterous Coward ( 211739 ) writes:
  
  Google doesn't match "*bayes*" (as one would think) when searching for "bayes",
  
  Just curious: Why at all would anyone think that "bayes" would match "*bayes*"? Imagine if searching for "cars" also got you "scars", "Johnny Carson", and so on...
  
  It might make sense for a search engine to do limited stemming (cars -> car, eating/eats/ate -> eat), but that's something completely different...
Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:5, Interesting)

by davids-world.com ( 551216 ) writes: on Sunday November 03, 2002 @02:16PM (#4589062) Homepage

A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).
More intelligent classification algorithms can solve non-linear problems far better. Check out Kernel Machines [kernel-machines.org] and, somewhat older, Maximum Entropy models.
Enough nerd talk for today :-)

Share
twitter facebook
- Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:3, Informative)
  
  by JPZ ( 42691 ) writes:
  
  A true Bayesian filter, wow. Let's face it, statistical classifiers based von Bayes' formula are not really state of the art. They make false assumptions about the data (independence of features).
  
  Bullshit. Bayes' formula is exact, and makes no assumption on independence whatsoever. Naive Bayesian approaches make independence assumptions, hence the use of the term naive.
  
  The only inherent drawback in using Bayes' rule in classifiers is that you have to assume the number of classes to be known a priori.
  
  JPZ
- - Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:3, Interesting)
    
    by Lenbok ( 22992 ) writes:
    
    Actually compresssion-based techniques don't work particularly well, mainly because they are very sensitive to the amount of training data. If you have a lot of non-spam mail, your non-spam compressor will compress better than your spam compressor.
    
    In the long view, all compression is machine learning anyway :-)
- - Pedantry! (Score:3, Funny)
    
    by Tim Browse ( 9263 ) writes:
    
    that's not irony, it's sarcasm.
    
    Actually, irony is generally considered [dictionary.com] to be "use of words to express something different from and often opposite to their literal meaning".
    Sarcasm [dictionary.com] is often defined as a form of irony (but not necessarily), intended to be cutting/offensive etc.
    So while his comment may have been sarcasm, it was also irony.
    And I'm not pedantic, I'm pernickety. :-)
    Tim
Forget Bayes (Score:5, Funny)

by Evil Adrian ( 253301 ) writes: on Sunday November 03, 2002 @02:16PM (#4589070) Homepage

We need the Stalin Mail Filter (TM) -- it detects spam, hunts down the spammer, and exiles them to Siberia.

Share
twitter facebook
- Re:Forget Bayes (Score:3, Funny)
  
  by Galahad2 ( 517736 ) writes:
  
  I tried that, but it was constantly too paranoid about idenifying spam. I can't even remember how many of my friends and family ended up in Vladivostok for sending me bad jokes. The problem sort of solved itself though, since the filter program eventually just barracaded itself in my second hard drive and refused to come out. The only drawback is that now I can't save anything on the drive, since the Stalin Filter instantly deletes everything it can.
*BUT* it's a Perl script... (Score:2, Redundant)

by pilot1 ( 610480 ) writes:

Sure it's great that someone made one, but its a perl script. We might be able to use perl , but most of the "normal" people have never even heard of perl, let alone them having knowledge of running perl scripts. It would be great if someone ported this, to an .exe file or something that everyone could run. It'll probably happen eventually.
- Re:*BUT* it's a Perl script... (Score:2, Funny)
  
  by Niksie3 ( 222515 ) writes:
  
  sure... an .exe file everyone could run... have you had your pills today? a perl script runs on many more platforms then any .exe file.
  - Re:*BUT* it's a Perl script... (Score:2)
    
    by B'Trey ( 111263 ) writes:
    
    Might want to check your own medicine cabinet. Sure, Perl runs on more platforms. So what? How many of the worlds actual computers have Perl installed? Even better, how many of the worlds computers that are used daily to read email have perl installed? How many of them can run an .exe file? I'd suggest that the latter is orders of magnitude more than the former.
- Re:*BUT* it's a Perl script... (Score:2)
  
  by Elias Israel ( 182882 ) writes:
  
  This is a very good point.
  
  Truth is, to really tackle the problem of spam, a solution is needed that doesn't require the user to be a software engineer.
  
  Plus, another problem with rolling out a Bayesian filter for a large collection of users is that each individual user needs their very own filter database. The statistical analysis of my mail would be nearly useless for anyone else.
  
  OK, cards on the table: I am working on a new solution that will be useful for the general public and overcomes these problems.
  
  Those who care to learn more can sign up to be notified when it becomes available.
  
  Check out www.PureMessaging.com [puremessaging.com]
- perlcc (Score:3, Insightful)
  
  by Camel Pilot ( 78781 ) writes:
  
  I just received the November edition of the TPJ [tpj.com] which included a fine article "perlcc & Compiling Perl Script".
  
  In short, the filter script could be compiled to C and built to a native binary for a variety of platforms eliminating the need for a Perl interperter.
- Re:*BUT* it's a Perl script... (Score:2, Informative)
  
  by rgmoore ( 133276 ) writes:
  But perl scripts are just as easy to run as .exe files, so long as you have the perl interpreter installed. So now it's just a two step process:
  
  Install perl.
  
  Install the perl script.
  
  This is not exactly brain surgery. Perl can be installed on essentially any system you choose to name, with no more trouble than installing any other executable. For those people running Windows, there's an excellent port available from Activestate [activestate.com]. As somebody else pointed out, this means that a perl script is actually available to more people than a .exe would be, because it's truly cross-platform.
- Re:*BUT* it's a Perl script... (Score:2)
  
  by crisco ( 4669 ) writes:
  
  You're absolutely right. I've been closely following POPFile's development (and trying to help with docs) and it is a goal of the developer to create a brainless install that the masses can use, while still retaining the cross platform core that is useful for much more than spam detection. POPFile is under very active development and is only recently getting close to the point where it will be ready to stabilize on a release.
- Re:*BUT* it's a Perl script... (Score:2)
  
  by Jeremy Erwin ( 2054 ) writes:
  
  Lots of libraries allow you to embed a perl interpreter in a C program... I suspect that a number of linux email clients could be altered to run such a script as part of their "retrieve_mail()" functions.
  
  What do you want? a hideous visual basic macro in Outlook? The mere fact that one OS is difficult to use with perl shouldn't be a obstacle to innovation.
- Re:*BUT* it's a Perl script... (Score:3, Informative)
  
  by crucini ( 98210 ) writes:
  
  It would be great if someone ported this, to an .exe file or something that everyone could run.
  
  I don't think an .exe would help much - a Windows user doesn't need a standalone executable. He needs a filter (probably a .dll) coded to the specific filtering API of his mail client. Or does Microsoft have a generic mail filtering API? That way the filter seems to run "inside" the mail client.
  
  In general this illuminates one of the advantages of Unix. Lots of programs are written as filters that read from STDIN (standard input) and write to STDOUT (standard output). My own mail filtering script, for example, does that. I didn't have to learn any mailer-specific API, and my script can be used in different contexts. (Actually my script doesn't write to STDOUT - it saves the message to the appropriate folder.)
  
  Windows does not lend itself to the everything-is-a-filter idea because, among other things, process creation is slow and expensive. When a filter is invoked, a process is launched. Unix has more efficient process creation, and Linux has especially efficient and light process creation. Therefore on Windows a mail filter should be implemented as a reusable software component (probably a COM object) that can be called by the mail client.
  
  Also, most mail clients on Unix use the same mail folder format (mbox) which is basically just the literal messages from the network written to a file. Since it is the assumed common language of mail folders, it encourages software to interoperate on the file level, which my script does by writing messages to mail folders. (Unix is file-centric.) Windows mail clients, in contrast, seem to store mail folders in proprietary formats. That's because Windows philosophy is that an application serves as gatekeeper to "its" files - the file is not a unit of interoperability. In our case it means a standalone mail filter probably couldn't write messages to the mail folder.
  
  Unix is a more friendly, efficient development environment because you can write a mail filter as a standalone program and test it without building a test harness.
I don't get any spam (Score:3, Funny)

by Istealmymusic ( 573079 ) writes: on Sunday November 03, 2002 @02:17PM (#4589076) Homepage Journal

Can someone explain why this filter would be useful to me?

Share
twitter facebook
- Re:I don't get any spam (Score:4, Funny)
  
  by moosesocks ( 264553 ) writes: on Sunday November 03, 2002 @02:49PM (#4589271) Homepage
  
  Just post your email address, and we'll be happy to tell you.
  
  Parent Share
  twitter facebook
bogofilter (Score:4, Informative)

by stype ( 179072 ) writes: on Sunday November 03, 2002 @02:18PM (#4589084) Homepage

This isn't exactly the first bayesian mail filter out there. I've been using ESR's bogofilter [tuxedo.org] for weeks now, and I must say it works better than I could have ever imagined. Bogofilter however is simply for sorting out spam, while it appears this filter can sort out other things. But honestly, I can setup some simple filters to separate personal emails from work emails, so I'm not entirely sure the extra stuff is that useful.

Share
twitter facebook
- Re:bogofilter (Score:2)
  
  by Theodore Logan ( 139352 ) writes:
  You quite obviously haven't checked out bogofilter's README [tuxedo.org]. Let me quote:
  This package implements a fast Bayesian spam filter along the lines suggested
  
  by Paul Graham in his article "A Plan For Spam".
  'Nuff said.
IMAP (Score:2, Insightful)

by Evil Adrian ( 253301 ) writes:

Does anyone know of any spam solutions for IMAP? Everything I've seen out there is POP3, but goddammit I like my IMAP folders!!! (Not to mention that the server on which my e-mail resides gets backed up nightly...)
- Re:IMAP (Score:2)
  
  by LetterJ ( 3524 ) writes:
  
  If you use SquirrelMail, you can use a Bayes spam filter from the Squirrelmail plugin page.
- Re:IMAP (Score:2)
  
  by vondo ( 303621 ) writes:
  
  Yep. I wrote IMAPAssassin (on sourceforge [sourceforge.net]).
  Its a perl script that uses SpamAssassin on runs on any machine as an IMAP client. Spam shows up in your INBOX and disappears shortly there after.
  People are working on a Bayesian module for SpamAssassin, which will be promising. The great thing about SA (as many others have said) is that it uses a number of inputs to decide if a mail is spam-like, including auto-whitelists which keep track of the people who send you mail.
- Mail.app (Score:2)
  
  by Arker ( 91948 ) writes:
  
  The apple mail client, mentioned in the blurb, works very well with IMAP, that's what impressed me enough that I'm actually using it.
product of marketrons (Score:2, Interesting)

by hfastedge ( 542013 ) writes:

I don't know if it is true Bayesian

You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.

As long as you're not developing the idea, it shouldnt matter how it works as long as it works.

I read the original article here as you did to. After all the mumbo jumbo about learning, i picked out one effective tip from the article on filtering my email: filter out HTML.

With 1 line of regex I eliminate 95% of my spam:
match and throw it out.
- Re:product of marketrons (Score:3, Insightful)
  
  by crucini ( 98210 ) writes:
  
  You know, on this issue, you really depress me. You are clearly not of the academic nature, so your stance toward something thats probably way above your head really frustrates part of me.
  
  I think you may have misunderstood that comment. Since Paul Graham started talking about Bayesian filtering, there's been some tendency here to refer to all learning spam filters as Bayesian. Which results in complaints, which results in the designation "pseudo-Bayesian" for the many independently-discovered learning algorithms that don't have a theoretical underpinning.
  
  Put another way: if an algorithm outputs a dimensionless "score", and the author can't set an upper bound on the score, it's at most pseudo-Bayesian. If it outputs a probability that the message meets certain criteria, then it could be "true Bayesian". Additional implication: the "pseudo-Bayesian" filter may have a stack of rules in addition to its table of probabilities.
  
  I don't think we're splitting hairs on some deep statistical issue. I think we're groping for very rough categories in a new field of application software. If you can establish clearer categories, that might help.
  
  With 1 line of regex I eliminate 95% of my spam: match and throw it out.
  
  Graham addresses this in the article. One can identify most spam with a simple rules-based engine. That tends to make one lazy in reading the spam folder, which means false positives can languish unread. Enhancing the rules-based engine becomes an ongoing project as the volume and clerverness of spam increase. Hopefully Bayesian filtering can automate this.
Not integrated solution (Score:2, Insightful)

by unfortunateson ( 527551 ) writes:

What will make this thing work is if it is integrated with the e-mail client.

With this tool, you unfortunately have to manually add a message of a certain classification (work, pr0n, spam, family...) to the progrma through the perl script -- very awkward.

A tool like this need to run as a daemon and 'notice' when a message is added to a folder. Unfortunately, with different formats for e-mail folders, it's a much tougher job.

As it stands, with something like Outlook, I'd have to export each message individually, then run the Perl script. I can probably add a macro to do that (with its own pains -- you add a VBA macro to Outlook and it gripes every time you start up), and possibly even one that responds to filing in a folder.... hmm... maybe I will try this out.
- Re:Not integrated solution (Score:2)
  
  by crisco ( 4669 ) writes:
  
  This tool also has a web interface to reclassify mail. Not as good as client integration but a little easier than the command line for the masses.
You know what I'd kill for? (Score:3, Interesting)

by Saint Aardvark ( 159009 ) writes: on Sunday November 03, 2002 @02:34PM (#4589180) Homepage Journal

A version of this for Outlook Express.
I work on the helpdesk of a small ISP; I also take care of the spam filtering, and answer abuse@. We recently added SpamAssassin, and God does it rock [dowco.com]. (The big spike you see is me getting MRTG to graph what SA catches now; it's 6-10 times better than what we used to catch.)
But I still get complaints from our customers about spam that gets through. Just the other day a crapload got through because it was relatively subdued spam (no webbugs, NO LINE OF YELLING, etc); unfortunately, it also advertised pictures of young boys having sex. It's hard to explain why it's very, very hard to filter for this sort of thing, especially when I'm going through the talk for the nth time this week. (I need a good analogy that non-geeks can understand; I'm still looking.)
The good folks at DeerSoft [deersoft.com] have a version of SpamAssassin for Outlook, and are promising one for OE Real Soon Now. But I would loooooooooooooooooooooooove a good spam program -- this or SA or something else -- that I could point our customers to. Download, double-click, say yes, and bam it's installed. I can figure out how to install this on a Unix box; I could probably, eventually figure out how to do it on a Windows box; there's no way the customers could do it.
Or am I missing good, free spam filtering for Windows? Can anyone point me in the right direction?
Slightly OT: There has got to be a huge market for setting up spam filtering for small businesses. My idea: Tell 'em that if they provide the box -- an old Pentium or 486 will do -- I'll set up spam filtering and a firewall on it, set up some maintenance tools (whitelist this, firewall that). They get great mail service, I get $x00.

Share
twitter facebook
- Re:You know what I'd kill for? (Score:3, Informative)
  
  by bstadil ( 7110 ) writes:
  
  You know what I'd kill for?
  It might be smarter to read the article, than killing someone.
  You could have installed the program for Outlook in the time it took you to type your rant, but then you would not get any Mod point would you.
- Re:You know what I'd kill for? (Score:2)
  
  by crisco ( 4669 ) writes:
  
  As others have pointed out, thats exactly what POPFile is. Unfortunately it is not yeat a point and click kind of install but that is the direction it is heading.
SquirrelMail has a Bayesian plug-in (Score:4, Informative)

by ptbarnett ( 159784 ) writes: on Sunday November 03, 2002 @02:37PM (#4589200)

Plugins - BayesSpam - Intelligent Spam Filter [squirrelmail.org]
SquirrelMail [squirrelmail.org] is a WebMail client implemented in PHP. I use the client, but not the plugin (I use Razor [sourceforge.net]).

Share
twitter facebook
Uhmm.. like bogofilter? (Score:3, Informative)

by Jamuraa ( 3055 ) writes: on Sunday November 03, 2002 @02:42PM (#4589231) Homepage Journal

Bogofilter [tuxedo.org] has been out since august, and does this bayesian spam-stuff in C, which probably will run a bit faster than the perl or python versions just because of it's compiled-ness. I've never run it myself, but people on debian lists say it works better [debian.org] or not as good [debian.org] as spamassassin [spamassassin.org].

Share
twitter facebook
Staged Categories (Score:2, Interesting)

by irritating environme ( 529534 ) writes:

An advertised false positive rate of 0% is nice, but why not additional research into the spam, to attempt to categorize into blatant spam, probable spam, borderline, and non-spam, and see if false positives can be plopped into the borderline categories.

Also, from what I saw in the article, there will already be a next level that spam can take: image-based messages, misspellings of key words (klik, Clic, Clik, etc), using 0xfe0000 for almost-bright-red.
Where's the news? (Score:4, Informative)

by Roadmaster ( 96317 ) writes: on Sunday November 03, 2002 @03:09PM (#4589364) Homepage Journal

Just because it's the first one that actually makes the slashdot frontpage it doesn't mean it's the only one.

Do a freshmeat search for bayespam, bogofilter and spamprobe, they're all working and quite mature bayesian filters (or should we say "paulgrahamian" in order to appease the "true bayesian" crowd). Hell, even a search for "bayes" will turn out a few more hits, like ifilter, which aims to automatically classify mail in different folders, but could be easily tuned to filter out spam.

Of these, I think spamprobe is becoming the true "swiss army knife" of "bayesian" filtering; I did find both bogofilter and bayespam spartan, but they work well. spamprobe, on the other hand, is very actively maintained, is under constant improvement by the author, Brian Burton, and has given me excellent results getting rid of over 90% of my spam.

Share
twitter facebook
Good in combination with spamassassin? (Score:2)

by Moritz Moeller - Her ( 3704 ) writes:

I am just about to put bogofilter in my mail filtering system. I am thinking about combining this baby with spamassassin, as described here:
http://www.randomhacks.net/2002/09/23/#usin g-bogof ilter-with-spam-assassin

I will use the pass through option and I can use spamassassin to protect against false positives and to adjust the sensitivity.

BTW: Does anyone know if the number of SPAM and nonSPAM have to be about equivalent or is this accounted for? I have 4000 spam mails in a folder, but just about 500 nonspam mails.
- Re:Good in combination with spamassassin? (Score:2)
  
  by Matts ( 1628 ) writes:
  
  FWIW, SpamAssassin 2.50 will include a statistical filter that works like similar bayesian filters.
  
  It should be pretty cool, in that it will automatically train on spamassassin results, as well as allowing you to add or remove spam and non-spams.
  
  Matt (a spamassassin developer)
Developers missed this... (Score:3, Insightful)

by bigberk ( 547360 ) writes: <bigberk@users.pc9.org> on Sunday November 03, 2002 @03:15PM (#4589381)

In my testing (over the last 30 mins) I discovered that filtering is employed when the POP3 "RETR" (retrieve entire message) command is used but no filtering is done when the equally useful "TOP" (show me the headers and X lines of the body) command is issued by a client.

A huge advantage of also doing the filtering for the TOP command would be that mail clients such as The Bat [ritlabs.com], Pimmy [geminisoft.com], JBMail [pc-tools.net] and PocoMail [pocomail.com] will let you preview all headers while leaving mail on the server (or deleting it, whatever) but without actually downloading the full message bodies.

Share
twitter facebook
- Re:Developers missed this... (Score:2)
  
  by crisco ( 4669 ) writes:
  
  Thats a good idea. Message classification would get less accurate on just the headers or headers+top of message but that might be enough to avoid downloading spam (biggest drawback to POPFile, you still download the spam, only to delete it).
  - - Re:Developers missed this... (Score:2)
      
      by crisco ( 4669 ) writes:
      
      No, not presently. It seems the author wants to keep it as simple as possible. However, as it matures it might be great to look at all the different ways people use mail and mail clients and start making allowance for what people like to do.
Is this intended for server, client, or both? (Score:3, Insightful)

by Rooney444 ( 617388 ) writes: on Sunday November 03, 2002 @03:29PM (#4589479)

If this is only intended for client side use then it still doesn't address the issue of all the bandwidth that spam wastes. Wouldn't it just be a better project to help all the idiots close the open relays on their servers? Or maybe require authentication on all SMTP servers?

Share
twitter facebook
- Re:Is this intended for server, client, or both? (Score:4, Informative)
  
  by dzym ( 544085 ) writes: on Sunday November 03, 2002 @03:38PM (#4589550) Homepage Journal
  
  Yes, but remember, who runs the SMTP servers?
  The very design of the whole system specifies that anyone can just turn on a machine, hook it up to a network somewhere, and start spewing out messages to smtp ports all over the world.
  It doesn't have to be a sendmail, qmail, or exim server, remember. Some Windows viruses have taken advantage of that loophole to set up mini-SMTP servers in the network stack to continue propagating viruses without needing to connect to anything that provides authenticated external relay.
  
  Parent Share
  twitter facebook
What about random misspellings? (Score:2, Interesting)

by archeopterix ( 594938 ) writes:

Hm... what about an anti-anti spam filter that mangles the message inserting random misspellings into the spam-identifying words? The bayesian filter would perceive this as a message consisting of many 'unclassified' words, just like a message in some unknown language. Sure, the short words probably haven't got many possible misspellings (cock, c0ck, coock, cokc - hm... starts to look undecipherable ), so they would probably get classified after some time. And this would hopefully lower the spam success ratio. But the possibility still remains...
- Re:What about random misspellings? (Score:3, Interesting)
  
  by PigleT ( 28894 ) writes:
  
  Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time.
  Ifile does this, bogofilter does this with some wangling in procmail, ...
  
  That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
  - Re:What about random misspellings? (Score:2, Interesting)
    
    by archeopterix ( 594938 ) writes:
    
    Dual feedback loops. Every mail that matches spam gets fed back into the system so both the is-spam wordlist AND the is-good wordlists become more "concentrated" over time. Ifile does this, bogofilter does this with some wangling in procmail, ... That way, if someone sends something that's still mostly spam (one or two words in common with spam, enough to tip the balance) then all the neutral words will be tarnished as well.
    This is clever, but might have some undesirable side effects. Suppose a spammer attaches a long list of neutral words to his e-mail in order to 'dilute' the bad words. This way some innocent words might get assigned positive spam probability thus resulting in false positives later.
Missing the point? (Score:5, Informative)

by crisco ( 4669 ) writes: on Sunday November 03, 2002 @03:46PM (#4589600) Homepage

I think lots of people here are missing the point of POPFile. Everyone is happy to point out that there are already several assorted solutions to Bayesian mail filtering in many different languages. Nearly all of these work on the mail server. Now lots of us are qualified and interested in setting up our own mail server, customizing the mail processing our own One True Way and happily enjoying an inbox free of spam. But the average windows user has no idea how to set up a mail server. Others could easily do it but feel their time is better spent on other things, not admining a mail server.
This is what POPFile is for. Its a pop3 proxy server, it sits between your pop3 client and the server and simply adds a classification to the headers (or the subject line for braindead mail clients).
Currently POPFile is a bit rough on computer newbies, it needs a Perl install and such. However, if you read the forums it is intended to end up as an easily installed executable for windows users and to remain a nifty little perl script for the rest of the platforms where it might come in handy. So when those pesky friends and relatives come asking about all the viagra and farmyard spam they get (and you haven't already set them up on your tightly filtered mail server) set up POPFile for them.
Also, its not just for spam filtering. Think of what you could do if you could go beyond simple rules for your inbox. Want email you think is important forwarded to your phone? Create a category for important email and go through your archives and feed POPFile email you would have wanted forwarded instantly. Create a new folder to recieve those mails and watch it for a few days, retraining POPFile until it is getting reasonably good at putting important mail in there. Now set up your mail system to forward those to your phone. Will it work? I don't know, but based on the results I'm getting, it probably would. How about using it to filter help desk emails?

Share
twitter facebook
Yahoo! Mail (Score:2)

by sfe_software ( 220870 ) writes:

Noone has mentioned it so far, but Yahoo mail has a Bulk Mail folder. SPAM is automatically sent there, and I have yet to see a single false positive (and false negatives are quite rare as well).

The system works surprisingly well. I checked the FAQ and it doesn't go into any detail about how it works, but I wouldn't doubt if something like this is being used.

I've been thinking, and it seems that this could potentially have a lot of use, aside from Spam filtering. Perhaps a mail client could let you categorize email in general (SPAM, Business-related, forwarded stuff from AOL users, etc), and learn how to spot and organize things.

I'm putting this (either the POPfile or bogofilter) into place with a modified SquirrelMail, just to give it a good run; I might try and modify it to also categorize other types of email, just to see if something like that could work.

I could easily see a mail client (web-based or otherwise) that lets you drag mail to specific folders, and eventually learns how to do this for you (and of course you can always correct it by simply dragging to another folder, which also contributes to the learnig process)...

After reading this article [paulgraham.com] my mind is just spinning with ideas... Bayesian search engines... perhaps speech/voice recognition applications... classifying text/html/doc files... organize songs (processing the lyrics)... ugh, I should stop now :)
Bayes (Score:5, Funny)

by John Garvin ( 229844 ) writes: on Sunday November 03, 2002 @03:56PM (#4589654) Homepage

Now we can tell spammers: "All your Bayes are belong to us."

Share
twitter facebook
Multi-purpose tool (Score:3, Interesting)

by B'Trey ( 111263 ) writes: on Sunday November 03, 2002 @04:10PM (#4589736)

An interesting idea that I haven't seen discussed is using this concept for more general uses. If we can sort spam from non-spam, how about business from personal? Technical from administrative? All you'd need is multiple databases of word probabilities, the ability to assign emails to multiple categories and a hierarchical method of sorting.

Share
twitter facebook
this battle cannot be won (Score:4, Insightful)

by mboedick ( 543717 ) writes: on Sunday November 03, 2002 @04:13PM (#4589766)

These technologies are interesting, but the problem of spam should be solved at the source. Why should we waste our time, money, CPU and drive space trying to outwit spam with clever software? As has been said before, if you filter spam at the inbox, a lot of resources have already been wasted by the time it arrives.

Spam is anti-social behavior - a perversion of technology to make a quick buck. It's a cancer, and we should try to kill it. If you try to fight it any other way, you will constantly be playing catch-up, as the spammers have technology on their side too.

Share
twitter facebook
- Re:this battle cannot be won (Score:4, Insightful)
  
  by shayne321 ( 106803 ) writes: on Sunday November 03, 2002 @06:48PM (#4590645) Homepage Journal
  
  These technologies are interesting, but the problem of spam should be solved at the source.
  And how do you propose we solve the problem at its source? Make it illegal? They'll just find loopholes in the law and/or move to a country where it is legal. Hunt them down and murder their wife and kids in front of them then hang them from a tree? Satisfying though it may be, last I checked murder was illegal.
  Techniques like this CAN eventually solve the problem.. As others have pointed out, for someone to buy something from a spammer they have to READ the spam. If they send out 1 million spams and 500,000 read them and 20 of them buy something, they'll keep doing it. If they send out 1 million and only 500 people read it and 1 person buys something, they'll loose their source of income and have to find a new line of work.
  Also, for each obstacle we put in their way (checksum databases, open relay databases, filters, etc) it costs them more time, effort and therefore, money to send their crap - all for less income.
  Shayne
  
  Parent Share
  twitter facebook
- Re:this battle cannot be won (Score:3, Insightful)
  
  by crucini ( 98210 ) writes:
  
  It's all very well to say that spam should be stopped at the source, but how do you plan to do that? Blocklists that pressure the ISP? SPEWS is pretty effective, but Verio, UUNet and Sprint are deeply committed to spam. They won't dislodge their pet spammers until they feel financial pain. Want the government to stop spam at the source? I see lots of problems with that. One of them is the creation of another eternal government responsibility like the war on drugs. They will forever need more funding for "the war on spam" because spammers are getting more clever. These federal agencies develop a symbiotic relationship with the "problems" they're trying to "solve".
  
  In practice, a multipronged approach will work best, combining prosecution, litigation, blocklists, content-based filtering, complaints to upstream providers and education of new users. Graham's article, in fact, shows how attempts to avoid prosecution push spammers into the arms of content-based filtering.
  
  I don't ask for a 100% solution to spam, because any such solution will have awful side effects.
Other applications... (Score:3, Funny)

by Ed Avis ( 5917 ) writes: <ed@membled.com> on Sunday November 03, 2002 @07:29PM (#4590907) Homepage

How long until we can set up Bayesian by-word filtering on Slashdot comments?

Share
twitter facebook
Spamassasin (Score:3, Interesting)

by fireboy1919 ( 257783 ) writes: <rustyp&freeshell,org> on Sunday November 03, 2002 @08:59PM (#4591345) Homepage Journal

This seems to be about using strange approaches to spam filtering, but really...a bayesian network seems to be a natural step for a system that henceforth was composed of a series of heuristics with no knowledge of which is more important.

(Why hasn't it been done? Bayesian networks are only taught in AI and statistics classes).

What really interests me is that Spamassasin claims to use a genetic algorithm [spamassassin.org] to rate how likely an e-mail is to be spam.

Share
twitter facebook
Already patented by MicrosofT (Score:3, Informative)

by barfy ( 256323 ) writes: on Monday November 04, 2002 @04:29AM (#4592944)

This whole methodology is already patented by Microsoft. ANY implementation not licensed by Microsoft is going to be a violation... And now that you know, it is treble damages...

patent 6,161,130 [uspto.gov]

Share
twitter facebook
- Re:Um. No. (Score:2)
  
  by judd ( 3212 ) writes:
  
  I think you have failed to understand how the filter works.
  
  It is "trained" on a corpus of spam, which is compared to a corpus of known good messages. The important part is that YOU, the user, supply the spam corpus and the good messages. Thus in your case, as long as your "good spamlike messages" are in your "known good pile", similar new ones from the same source will not be tagged as spam. This is where the statistical approach shines over simple keyword matching.
  
  Go on, read about how it works. You might learn something.
- Re:Um. No. (Score:2)
  
  by jjo ( 62046 ) writes:
  
  Well, if all spam is indistinguishable from the legitimate spamlike messages you want to see, then no filter will help you.
  
  However, it seems more likely that a large proportion of spam is distinguishable from mail you want to see. It's quite plausible that you don't want to see messages about nympho sluts, or penis enlargement, or breast enlargement (or at least not all three), and that a naive Bayesian filter could easily distinguish these and other spams from mail you do want to see.
- Re:Um. No. (Score:2)
  
  by rgmoore ( 133276 ) writes:
  
  You're wrong, though. The whole point of this kind of filter is that it develops its rules based on the information that you give it, not what somebody else thinks. If you tell it that mails from your legitimate business partners aren't spam, it learns to tell them apart. I use a Bayesian filter on my mail, and it has no trouble telling my legitimate business mail, like messages from Amazon about books I've been waiting for, from illegitimate ones. Some of that is that the legitimate mail is written with a very different style from the illegitimate stuff, but I assume that the filter has also learned that mail with amazon.com as the sender is OK. In any case, I find that it just plain works.
- Re:Um. No. (Score:2)
  
  by dvdeug ( 5033 ) writes:
  
  Does this system know what businesses I've given my credit card to?
  
  Do you understand what a bayesian filter does? It tries to figure out what you consider spam. I don't like dentists sending me advertising junk; bogofilter trashes it. Anything about Esperanto or Project Gutenberg or Linux could probably fly on through, as it's got a lot of words that actually appear in my good email in it. At worst, a couple messages from that business get caught, and then it will recognize that the messages are good based on sender and embedded URL's.
  
  In any case, there tends to be a huge difference between the messages I've got from companies I've given my credit card to and the ones that are sending me spam. Usually, one is quietly informing me of new items for sale, and one is screaming about crap. A bayesian filter can often tell the difference.
- Re:Um. Yes (Score:2)
  
  by crisco ( 4669 ) writes:
  
  Try it, you might be surprised.
  You can separate the newsletters from the businesses you've opted in to from the penile-enlargement spam. Thats one of the beautiful things about POPFile, it isn't just about spam vs useful mail. In fact, it seems to be more accurate and learn faster when you define categories for all the different types of mail you recieve, not just spam vs inbox.
- Re:As effective as a well trained secretary (Score:2, Insightful)
  
  by bmwm3nut ( 556681 ) writes:
  
  but, unlike your secretary not showing you things. you can just set up the filter to put the spam in a spam folder. you can then periodically look at it and see if there are any false positives. or you can tell the filter to delete things that are 95% spam, but put things that are still most likely spam in a special folder. that's what's great about learning algorithims, they can always adapt to what you want (if you teach them enough).
- Re:Spam will be spam (Score:2)
  
  by acceleriter ( 231439 ) writes:
  
  I try the creative step of prepending common Chinese names, e.g. zhao@chinacenter.com, chen@chinacenter.com, lchen@chinacenter.com, chang@chinacenter.com. Along with a nice "Thank you" for the beautiful picture of the Dalai Lama they sent me, and good wishes that the freedom of information contrary to the PRC's politics continues.
- Re:Ximian Evolution? (Score:2, Informative)
  
  by rgmoore ( 133276 ) writes:
  
  With some cleverness, you can use any outside filter with the most recent version (i.e. the develpment fork) of Evolution. They've added the ability to pipe incoming messages to an outside program and read back the exit code. So if the program is written using standard Unixisms- i.e. it reads on standard input and returns a different value depending on whether the incoming message is spam or not- it can be used with Evolution. I know that bogofilter [sourceforge.net] can do this because I'm using it with Evolution and it works pretty well.

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Whas that? (Score:2, Interesting)

Re:Whas that? (Score:4, Informative)

Re:Whas that? (Score:5, Funny)

Re:Whas that? (Score:5, Informative)

Re:Whas that? (Score:5, Informative)

Re:Bayes Explained (Score:5, Informative)

Re:Bayes Explained (Score:2)

Re:Bayes Explained (Score:4, Informative)

Re:Bayes Explained (Score:4, Informative)

spambayes.sf.net (Score:5, Informative)

Sure it's promising (Score:4, Insightful)

Re:Sure it's promising (Score:5, Informative)

Re:Sure it's promising (Score:4, Informative)

Growing a spam filter -- a firsthand experience (Score:4, Interesting)

Re:Sure it's promising (Score:2, Informative)

Re:Sure it's promising (Score:4, Interesting)

Welcome to the future (Score:3, Informative)

Re:Sure it's promising (Score:5, Funny)

Re:Sure it's promising (Score:2, Interesting)

Re:Sure it's promising (Score:4, Informative)

Re:Sure it's promising (Score:2)

Re:what is the point then? (Score:2, Insightful)

Re:Sure it's promising (Score:2)

Professional Looking Spam May Be Impossible (Score:4, Insightful)

Re:Professional Looking Spam May Be Impossible (Score:3, Informative)

Re:Sure it's promising (Score:2)

Re:Sure it's promising (Score:4, Insightful)

Re:The decimal issue (Score:2)

Server-side solutions? (Score:3, Interesting)

Re:Server-side solutions? (Score:2)

Re:Server-side solutions? (Score:4, Interesting)

Oops, screwed up the URL... (Score:2)

Re:Server-side solutions? (Score:4, Interesting)

Re:Server-side solutions? (Score:2)

Mozilla in Process of adding Bayesian filter (Score:5, Interesting)

Re:Mozilla in Process of adding Bayesian filter (Score:2)

That Google search... (Score:4, Insightful)

Re:That Google search... (Score:2)

Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:5, Interesting)

Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:3, Informative)

Re:Bayesian? Wow!!! I'm sooo excited. (Irony!) (Score:3, Interesting)

Pedantry! (Score:3, Funny)

Forget Bayes (Score:5, Funny)

Re:Forget Bayes (Score:3, Funny)

*BUT* it's a Perl script... (Score:2, Redundant)

Re:*BUT* it's a Perl script... (Score:2, Funny)

Re:*BUT* it's a Perl script... (Score:2)

Re:*BUT* it's a Perl script... (Score:2)

perlcc (Score:3, Insightful)

Re:*BUT* it's a Perl script... (Score:2, Informative)

Re:*BUT* it's a Perl script... (Score:2)

Re:*BUT* it's a Perl script... (Score:2)

Re:*BUT* it's a Perl script... (Score:3, Informative)

I don't get any spam (Score:3, Funny)

Re:I don't get any spam (Score:4, Funny)

bogofilter (Score:4, Informative)

Re:bogofilter (Score:2)

IMAP (Score:2, Insightful)

Re:IMAP (Score:2)

Re:IMAP (Score:2)

Mail.app (Score:2)

product of marketrons (Score:2, Interesting)

Re:product of marketrons (Score:3, Insightful)

Not integrated solution (Score:2, Insightful)

Re:Not integrated solution (Score:2)

You know what I'd kill for? (Score:3, Interesting)

Re:You know what I'd kill for? (Score:3, Informative)

Re:You know what I'd kill for? (Score:2)

SquirrelMail has a Bayesian plug-in (Score:4, Informative)

Uhmm.. like bogofilter? (Score:3, Informative)

Staged Categories (Score:2, Interesting)

Where's the news? (Score:4, Informative)

Good in combination with spamassassin? (Score:2)

Re:Good in combination with spamassassin? (Score:2)

Developers missed this... (Score:3, Insightful)

Re:Developers missed this... (Score:2)

Re:Developers missed this... (Score:2)

Is this intended for server, client, or both? (Score:3, Insightful)

Re:Is this intended for server, client, or both? (Score:4, Informative)

What about random misspellings? (Score:2, Interesting)

BUT it's a Perl script... (Score:2, Redundant)

Re:BUT it's a Perl script... (Score:2, Funny)

Re:BUT it's a Perl script... (Score:2)

Re:BUT it's a Perl script... (Score:2)

Re:BUT it's a Perl script... (Score:2, Informative)

Re:BUT it's a Perl script... (Score:2)

Re:BUT it's a Perl script... (Score:2)

Re:BUT it's a Perl script... (Score:3, Informative)