Journal tepples's Journal: Bayesian Filtering: Is It Doomed? 7
Bayesian text classification is a statistical method of determining the probability that a message is in a given category. It works by making a database of how often each word occurs in messages from a corpus that are or aren't in the category, looking up this probability for each word in a given new message, and then using Bayes' theorem on the probabilities to predict how likely the message is to be in the category.
When applied to the corpus "e-mail" and the category "unsolicited bulk e-mail", the method is called Bayesian spam filtering. For example, words such as "Viagra", "mortgage", "Rolex", "Nigeria", and the like are likely to occur in spam, but some other words are more likely not to occur in spam. Many e-mail service providers applied Bayesian filtering to their customers' incoming e-mail and moved likely spam into a separate folder. This worked
After several months, spammers discovered ingenious techniques to defeat filters. First they disguised the operative words by "creatively" spelling them, Some spammers just misspelled key words: "Ciallis", "mortagee". Others randomly replaced letters with near-homoglyphs from l33tsp34k or from foreign languages: "Viagra" became "Wla9ra" or "\/ 1 A G R @", or "porno" might use a Greek omicron or Cyrillic o or replace the 'p' with the Greek rho or Cyrillic er, both of which look like a Latin 'p'. Anti-spam filters eventually began to check for such techniques and flag them specifically.
Later, spammers attacked the method by using innocuous words in e-mail in order to fool the filter into thinking that a message is not spam. First they used random sequences of letters. Filters blocked words with too many consonants for the target language. Then they used random dictionary words. Filters blocked too many long words in a row. Then they used sentences from literature, as seen in so-called Gutenberg spam and Hobbit spam. These techniques are intended to increase the spam probability of innocuous words, introducing noise into the database and causing the filter to misclassify messages.
However, not all people have the same words marked as not-spam. For instance, people on a constructed language mailing list are more likely to have linguistic jargon marked as not-spam, while people on a video game mailing list may have video game terminology marked as not-spam. Thus, a spammer could collect addresses from a newsgroup, a public web board, a public mailing list, or the contact page of a public web site, and associate each address with words that appear on the same page as the address. How will Bayesian filters block this? Can it be blocked at all?
Blocking domain names in URLs (Score:2)
Now, this kinda takes human intervention, so it is not automated, but a reasonably big ISP can afford to hire someone to go through spam sent to a honeypot and blacklist domain names that ar
Joe-jobbing domain names in URLs (Score:2)
So how would this prevent joe jobs [wikipedia.org], where spammers list an additional innocuous domain (such as joes.com) in an attempt to introduce noise into the filter? In a usual eBay/PayPal scam message, all the links except for one or two go to the actual eBay or PayPal site.
Email is Doomed! (Score:2)
No, that wouldn't work because people are human and make mistakes so there would be false possitives, and as has happened before, innocent people get blacklisted.
OK, so email in general is screwed.
Personally, I am tempted to just try to whitelist everything I want, and have a very aggressive spam filter for everything else.
But then I will still have problems with other people not g
Re: (Score:2)
What, and go back to getting calls in the middle of dinner again?
Re: (Score:2)
We're talking about mail, not instant messaging. Eat dinner, ignore the phone, and pick up the voice mail once you've cleared the table.
Works for me. (Score:2)
My filter isn't as accurate as I'd like it to be -- BogoFilter is definitely not as good as Dspam was, but Dspam was crashing and bouncing messages, so I had to switch. It's still more than good enough -- I can deal with 5-10 "unsure" messages, almost always spam, much more easily than I can 40-50 spa
SpamBayes (Score:2)