Slashdot Log In
More on Bayesian Spam Filtering
Posted by
michael
on Tue Sep 17, 2002 03:26 PM
from the snake-eyes dept.
from the snake-eyes dept.
michaeld writes "The "Bayesian" techniques for spam filtering recently publicized in Paul Graham's essay A Plan for Spam doesn't actually seem to have anything Bayesian about it, according to Gary Robinson (an expert on collaborative filtering). It is based on a non-Bayesian probabilistic approach. It works well enough, because it is frequently the case that technology doesn't have to be 100% perfect in order to do something that really needs to be done. The problem interested Robinson, and he posted his thoughts about trying to fix the problems in the Graham approach, including adding an actual Bayesian element to the calculations."
This discussion has been archived.
No new comments can be posted.
The Fine Print: The following comments are owned by whoever posted them. We are not responsible for them in any way.
How about Macchiavellian Spam Filtering (Score:1, Funny)
Spam spam spam (Score:1)
Of course, the 1% of non-spam that accidentally gets filtered out is just collateral damage (except it's normally something really important like a tin of processed peas or something).
I'm going to sit down now and take some more HGH.
spam (Score:1)
I still think passive euthanasia is the best way. (Score:2, Flamebait)
Until politicians will be fed up and people will actually get SUED for spamming (for once you could have a good reason to sue real bad guys) nothing will change.
Yes I know in SOME states it's beginning, so for local spam in a few years from now I think legislation will make it's way and we'll be able to look in our mailbox and stop having TD waterhouse spamming when you already have an account with them, etc.
The other problem now is oversea spamming, especially coming from China/Taiwan. I mean.. I don't read chineese, I don't plan on buying that #.#" something oversea, so why do they spam us like that? I never get it, but I'd be all for passive euthanasia (i.e. ban their IP at router level) and if this is bad for buisness or relations or whatever, well MAYBE they will do something about it.
Here where I work, it's simple, one spam, I ban a whole class straight off the servers, if one day I get a call because someone couldn't reach us (if they really need to reach us, we have a phone anyways!) I'll be sure to mention him Why. too bad this is not happening at the backbone level, because some people would get their act together fast and apply a legislation globally.
Tutorial on Bayesian Inference (Score:5, Informative)
The timing of this article seems impecable, since I am myself trying to learn about Bayesian Statistics.
I am a Computer Science student [ime.usp.br] studying Computational Biology [ime.usp.br] (more specifically, Sequence Alignments) and while I have a bit of background on Classical Statistics, I was (and still am) completely ignorant about Bayesian Statistics.
It is only now that I'm trying to learn about Hidden Markov Models and its applications to Sequence Alignment that Ifinally decided to learn the basic hypothesis about Bayesian Statistics and how it differs from the hypothesis made by the Classical Statistics.
During my searches for finding introductory material on Bayesian Statistics, I found this course page [arizona.edu] which has some nice introductory notes, including Bayesian Statistics.
I hope that other people find this resource as useful as I did.
Post your results here (Score:5, Interesting)
I'd like to hear about modifications to this system. I removed Graham's doubling of "good" word frequencies, and I trained my filter using digrams. I also tried all the various methods supplied by the program "rainbow", with good results, but the implmentation was too slow and klunky to place in the middle of my email delivery system. What are other possible modifications?
Re:Post your results here (Score:5, Interesting)
You can grab the source here [saturn5.com], but it is specific to the exact way that my mail gets delivered (via offlineimap into maildirs).
The proof of the pudding... (Score:5, Interesting)
We will now have many slashdot posts saying "I've not tested this but I think A (or B, or C, or X)"
Here's where the scientific method comes into its own. Anyone who cares enough can actually test and post their results. I'd be interested in seeing what they look like. I don't have a database of spam to test against (and please don't volunteer to sign me up for some
poor Hotmail users are still in the cold... (Score:4, Funny)
Filter any message without the @ in the address.
Filter Britney, Boobs, Penis, Inches, WIN, ___
Now you only have about 40 spams a day to deal with instead of 100.
Uncheck your information from being in the MSN directory too.
Enjoy
John
Terrible Spam Filters (Score:3, Informative)
It's funny how bad the standard Microsoft spam filter is (the one present in outlook). It's simply a word lookup, where if the word is present the message is marked as spam. It looks for things like "for free?". You can see the full list here [iirusa.com], near the bottom. It's a little old, but not outdated (I think you can upgrade your spam filters, but I tested these, and the ones I tested work).
The adult filter isn't any better.
Naive Bayesian Learning (Score:2, Interesting)
Let's see (Score:5, Funny)
Now, given that I have prior knowledge that:
P (It will enlarge my penis)
is very low,
and given that, having never encountered anything which enlarges my penis in any permanent way, I have no knowledge of
P (This is Spam | It will enlarge my penis)
and we have the product of one probability which I know is low, and another of which I have no posterior knowledge, so we conclude that P (It is Spam) is also low, and that I must have requested more information on their new penile enlargement technique.
So, that message goes into the keepers.
Meanwhile,
P (It is Spam) = P (It is Spam | Frank is getting maried) * P (Frank is getting married)
So, I know frank is getting married, since he sent me this e-mail I'm considering filtering as Spam, and weather or not it is spam is pretty much independent of whether or not frank is getting married, so.... it's Spam. Away it goes.
P.S. I've deliberated made a hash of this for a joke. The actual rule is:
P (A & B) = P (A | B) * P (B)
Whatever Jaguar (Mac OS X 10.2) uses works! (Score:1, Interesting)
I, myself, am not sure but the new Mail.app is smart and it does learn. After a week of "learning" it has correcly determined messages as spam more than 99 out of a 100 times.
filtering not the answer - maybe this is (Score:5, Insightful)
Here is a suggestion for something that might make an impact on spammers: IF I open my firewall, I see several attempts a day from people trying to get into my mail server. Of course, I don't have a mail server, but spammers are always looking for open relay points they can spam from. My suggestion: Give the a nice open relay server they can send mail to. Of course, you don't want to piss off your service provider by sending spam, and your upstream speed might limit you to less than you can receive, so rather than run a full mail server lets modify some mail server code to just accept mail and send it to the bit bucket. Maybe we can even misconfigure existing code to do this with no programming changes.
No valid user will be affected, assuming you don't otherwise run a mail server. All that bandwidth you pay for can be used to receive e-mail from spammers before it ever goes out. Eventually their customers will see the response go from .1% to 0% and their business will dry up. This will impact spammers, blocking your own spam after it's been delivered will not.
This need not even impact your own bandwidth. You can run the server when you are done using your system (Might make a nice screen saver - a black screen that just shows how many spammed addresses were prevented from getting spammed). Or you cam impose limits on bandwidth at a firewall or router, or even restrict hours of access.
If we set up enough different false open relay servers I think we could have a real impact on the spammers.
Re:filtering not the answer - maybe SPOOFSERVERS (Score:4, Insightful)
BUT, an early spam filter at an ISP worked just like that. The design parameters were 1) that spam filtering require no more resources than actual delivery of the message, and 2) the filter give no indication to the spammer that the message was not going to delivered. This gives the spammer no feedback and forces THEM to waste CPU cycles which will slow them down.
Neural Net Spam Filtering (Score:3, Interesting)
Our approach worked pretty well (95-97% accuracy), and we had to deal with the same issues that the above "Bayesian" approach did. I.e., weighing the neurons so that false positives occur much less frequently than false negatives, etc. We built it using data on spam collected from the UCI machine learning repository.
It ties in with procmail. I'm not really a windows guy, so if anyone knows how to put a filter between an IMAP server and Microsoft Outlook/Netscape Communicator, I'd be interested in hearing how it's done.
The README for it is at: http://www-cse.ucsd.edu/~wkerney/spamfilter.READM
And you can download it at:
http://www-cse.ucsd.edu/~wkerney/spamfilter.
-Bill Kerney
wkerney at ucsd.edu
SpamAssassin - duh (Score:3, Interesting)
With so many people using SpamAssassin these days, I can't see how this is a timely or newsworthy item. More like from the been-there-done-that-dept..
How do you pronounce "Bayesian" anyways? (Score:2)
While I love everything there is to love about open source (code and ideas), I kind of worry when I read how successful all these new Bayesian/Grahamian filtering techniques work.
Not being a coder or statistician myself, I'm left wondering if the spammers can exploit it for a workaround. Is there something "built in" to these filtering techniques that can be used by spammers to effectively circumvent them?
Well... (Score:2, Informative)
Anyway I hear that the next version of MSN will have a Bayesian filter and that it will be introduced in an up coming version of Outlook Express (no idea about Exchange and Outlook).
BTW I believe internally MS uses this technique for spam control and that they don't seem to have any spam problems.
Why just spam? (Score:1)
On the other hand, I get hundreds of emails every few days covering a range of topics, which need to be manually sorted into folders.
What I'd like to see, and I suspect I'm not alone here, is similar software that can sort email into any number of categories, not just spam and non-spam.
For example, if I have an email folder called 'fishing', containg emails from fishing buddies, then next time I get an email containg references to 'casting', 'trout' and 'it was *this* long', it should be sorted into that folder automatically.
I'd be curious to know if there's any existing software to do this, and if not, I'd be tempted to have a go at knocking something up to do this.
One tricky bit would be how to integrate it with the email client. I'd imagine that users wouldn't want to switch away from Outlook/Mozilla/Mutt/Whatever merely for this feature, so it would have to be client-agnostic.
I'm thinking that implementing a simple IMAP server would be the easiest option since this allows for server-side folder management. It would then be case of maintaining word counts (Bayesian or otherwise) for each folder, and classifying mail accordingly.
Anyone else had any thoughts along these lines?
Brain exploded (Score:2, Funny)
Bayesian Filtering Works (Score:1)
Could someone tell me... (Score:1)
authorization based email box (Score:1)
keyword matching isnt the answer (Score:2, Interesting)
i don't see why they cant implement some system that scans incoming mail for its users' mailboxes, maybe does a checksum for each message or something, and if it finds that a number of its users are receiving exactly (or nearly exactly) the same message, assume it's spam. nuke the messages, and any new incoming ones.
yeah, if such a system only scans a small number of mailboxes, it may filter out mailing list posts and so on. but it gets more and more reliable the higher number of mailboxes it tracks.
this avoids searching for certain keywords and eliminates false positives. after all, how well would these keyword searching methods do if i were to quote a spam message in an email to a friend?
Bayesian vs not isn't really the point (Score:4, Insightful)
I'm not sure why this particular article needed to be posted, as it's just one of several alternative approaches and an untested one at that. On Paul's page, he also lists several published academic papers with other alternatives -- all actually tested, of course.
Gary is basically right in questioning the use of the word "Bayesian". Paul's approach is more about weighing "evidence" as given by the appearance of certain words, rather than in figuring out the probability of spam assuming a "prior". See Paul's explanation [paulgraham.com], but if you check the article he references at the end, you'll note that the method Paul uses is only one of several methods to solve an underspecified problems. It's a reasonable guess, not necessarily the only guess.
Looking at another article [lanl.gov] Paul references, given the word independence assumption, the more formal Naive Bayesian approach calculates as follows:
p(spam) = [ p(spam)*p(word1|spam)*...*p(wordn|spam) ] / [ p(spam)*p(word1|spam)*...*p(wordn|spam) + p(!spam)*p(word1|!spam)*...*p(wordn|!spam)]
This is similar to Paul's approach except for including a "prior" assumption of p(spam) -- the expected probability of any email being spam, calcuated from the historically observed frequency of spam. By leaving it out, Paul implicitly assumes that 50% of mail is spam -- that's his "prior" estimate of the spam rate. Given the other adjustments he makes to his sample, that appears to be acceptable in practice. (Paul overweights the spam prior, but also overweights the effects of "good" words.)
I'd personally prefer to overweight the "good" e-mails entirely rather than just put a "good-multiplier" on them like Paul does, but that's just quibbling over small bits.
As to the bit that Gary raises about Paul assuming a spam probability for an unknown word -- Paul originally said .2, then revised to .4, but really should have put it at .5 or just excluded it from all calculations. A new word has no robustness as a predictor (which is why Paul dropped words that didn't appear five times anyway). In practice, a new word at .4 isn't going to be among the 15 most interesting words to make the calculation from, anyway.
-XDG
not 100% - not good enough (Score:1)
Right. Try that one again after your non-100% effective filter starts filtering out business e-mails. Then where'll ya be? nowhere.
AI people have absolutely no common sense. Its been proven by my neural net.
Discussion in comp.lang.python (Score:1)
How long until we throw out the current e-mail sys (Score:2)
I own my own domain, which makes it easier, but we really need a system designed to filter. And make it easier. This is my uninformed proposal. Perhaps it won't work, but it seems something is needed.
People should have a private/public e-mail address. They should all go the same "account" and be part of the basic plan for any e-mail user.
privateauthentication~myemail@myhost.com
I know this is important and relevant
publicauthentication~myemail@myhost.com
I gave this person my e-mail address
myemail@myhost.com will go into the crap bin and be deleted eventually. Perhaps some program could be used to alert users of possible important mail pieces there.
Then we could also have some system to CHANGE the private authentication or public authentication that is form based. I.e. This address has been disconnected. Please apply for the new password.
Dictionary spam? (Score:1)
It works well, but some spammers circumvent it (Score:1)
After a couple of weeks I've built up a big enough spambase that Graham's algorithm is pretty close to 100% effective (and no false positives at all).
However, I did run into one problem: Some particularly devious spammers are base64 encoding their email so that it can't be scanned by programs like this. (I can't think of any other reason why they're using base64 encoding for text/plain or text/html messages.)
After I added code to check the email header and decode the message body it worked much better.
Jaguar (Score:2)
microsofts trademark (Score:3, Informative)
How is this working exactly? (Score:1)
How does Apple's mail spamfilter work? (Score:2)
A call for suggestions, and coders... (Score:2)
When the first Bayesian spam filtering article was posted, I thought it was a great idea, and this article just reinforces that idea. However, it would be interesting to build some sort of Sendmail module (or whatever MTA you like), but add some additional functionality:
1. Option to return a 550 error if the message is determined to be spam: "550 Delivery blocked; Bayesian filter reports spam probability of nn%"
- Right before reporting this error, wait n seconds or alternately, slow connection to n bps for n minutes.
- After reporting the error, "deliver" the Subject and Body of the email to the spam words database.
2. Inclusion of a whitelist, by IP, reverse DNS, MAIL FROM address, or RCPT TO address, header To: address, header From: address, etc.
3. Configuration of account where spams can be forwarded to, for automatic addition to the database.
- Perhaps this could be combined with the blacklist/whitelist. For example, any emails to spamthis@antispamdomain.com are always added to the DB. The entry could be as follows (similar to the Sendmail access map):
spamthis@antispamdomain.com <tab> BAYESIAN:SILENT
- This would allow for either silent addition to the filter (sender thinks mail was delivered -- good for spam harvesting emails, or for users to send their spam to), or a more "vocal" addition much like item #1 above, where a 550 error is reported... eg, BAYESIAN:550 or perhaps BAYESIAN:REJECT
I realize this would block a lot of mail, but I have my Sendmail currently configured to actually block spam (or what it considers spam) and have had very few issues with valid messages bouncing. Obviously, results may vary, but I'm a firm believer in rejecting spam during the SMTP conversation, not accepting it and then deleting it silently.
Does anyone else have any suggestions?
Already Patented by Microsoft... (Score:2)
patent 6,161,130 [uspto.gov]
How is this working exactly? (Score:1)
Bayesian filtering software (Score:2)
Paul's article lists a few of the bayesian spam filters, but here's a short list of the ones I've tried:
Gary Arnold's bayespam [garyarnold.com] is implemented in perl and geared towards qmail using maildir storage.
Brian Burton's spamprobe [sourceforge.net], written in C++, tries to remember already-seen messages, so that you can dump your spams/good mails on separate folders, have spamprobe learn from them, and delete them afterwards. Spamprobe remembers which ones it already processed, and won't reprocess a message if it's already seen it.
Eric Raymond's bogofilter is a typical ESR tool: concise, with a baroquely written man page, and quite simplistic, but does its job and does it well. ESR even uses some funny terms, like "spamicity", and "ham" (the opposite of spam). I don't like its dependency on the Judy libraries for dynamic arrays but what the heck.
Matthew Walker's BayesSpam [squirrelmail.org] plugin for Squirrelmail provides squirrelmail users with bayesian spam filtering capabilities, no longer restricting use of the technique to those with access to procmail/mailfilter systems.
Download it. (Score:2)
It's not from the same guy, but it's definitely derivative work.
Any linux-based POP "proxies" for this? (Score:1)
I have my popmail hosted by my ISP. I usually check my mail from my windows box. I'd like to configure my Linux box to periodically pull the POP3 mail from the server, spam-filter it, and then act as a "local" POP server that I'd just point my windows Eudora at.
Anyone have an easy (relatively speaking) means of doing this? Seems like each of the 3 parts (Getting mail from ISP, filter, and being a POP server) are trivial, but anything out there that would do all this or pieces that play well together?
I'm not keen on trying to deal with SMTP right now. My internet connection is a little too flaky for that...
Thanks for any ideas.
obligatory biology humor (Score:1)
Bayesian Mimicry
(Don't clap, just throw money.)
Method seems easily breakable. (Score:2)
whats stoping the spammers from attaching, say, a random scientific article longer than the spam at the end of the spam message ? This will give the spam a high grade in these bayesian method in general, but more so with his normalizing metric.
This is a reinvention... (Score:1)
The application to spam filtering is trivial. Simply take a document set (your inbox for a month), identify the spam set (manually) and the algorithm will generate term weightings for you.
Then apply these term weightings to previous unclassified records (emails) and BINGO!
BugBear
Mozilla (Score:1)
Andrea
filtering on qmail... (Score:1)
Anyone tried TMDA? (Score:2)
The best filter I've used (Score:1)
.. is Bayesspam 2.x [squirrelmail.org] for Squirrelmail [squirrelmail.org]. Its an easily installable plugin for a php-based webmail system, that uses MySQL to store the Bayesian corpus. It's also got options to limit the size of the messages to be filtered, and displays the spam probability and the 'mark as spam/nonspam' links in each email header.
Re:What happened to IBM/Redhat Article (Score:1)
Re:But what about pr0n spam? (Score:1)
Re:But what about pr0n spam? (Score:1)
Re:spam is already keeping up? (Score:1)
Re:Why filter spam? (Score:2)
Hopefully to something large and hungry.
Re:best method .. is ..to .. (Score:1)
Re:Why is spam still a problem? (Score:1)
My position is that there is already an authorization step for virtually all senders of email - you need to get the recipients address.
Most people are careful not to publish their address for obvious reasons. To get someones address, you have to ask them in person somehow.
I use this system, and I haven't had any problems when I ask someone "by the way, what address will you be sending from? I need to add you to my list." Most don't ask why since most people don't ask for someone's address and refuse to give their own - it's stupid anyway, since pretty soon you'll see their from address in their message.
Anyone who doesn't want you to know their address before sending a message is probably malacious and you don't need their email anyway. Anyone who doesn't mind will gladly give it to you when they ask you for your email address.