Please create an account to participate in the Slashdot moderation system


Forgot your password?
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×

Journal tepples's Journal: Bayesian Filtering: Is It Doomed? 7

Bayesian text classification is a statistical method of determining the probability that a message is in a given category. It works by making a database of how often each word occurs in messages from a corpus that are or aren't in the category, looking up this probability for each word in a given new message, and then using Bayes' theorem on the probabilities to predict how likely the message is to be in the category.

When applied to the corpus "e-mail" and the category "unsolicited bulk e-mail", the method is called Bayesian spam filtering. For example, words such as "Viagra", "mortgage", "Rolex", "Nigeria", and the like are likely to occur in spam, but some other words are more likely not to occur in spam. Many e-mail service providers applied Bayesian filtering to their customers' incoming e-mail and moved likely spam into a separate folder. This worked ... for a while.

After several months, spammers discovered ingenious techniques to defeat filters. First they disguised the operative words by "creatively" spelling them, Some spammers just misspelled key words: "Ciallis", "mortagee". Others randomly replaced letters with near-homoglyphs from l33tsp34k or from foreign languages: "Viagra" became "Wla9ra" or "\/ 1 A G R @", or "porno" might use a Greek omicron or Cyrillic o or replace the 'p' with the Greek rho or Cyrillic er, both of which look like a Latin 'p'. Anti-spam filters eventually began to check for such techniques and flag them specifically.

Later, spammers attacked the method by using innocuous words in e-mail in order to fool the filter into thinking that a message is not spam. First they used random sequences of letters. Filters blocked words with too many consonants for the target language. Then they used random dictionary words. Filters blocked too many long words in a row. Then they used sentences from literature, as seen in so-called Gutenberg spam and Hobbit spam. These techniques are intended to increase the spam probability of innocuous words, introducing noise into the database and causing the filter to misclassify messages.

However, not all people have the same words marked as not-spam. For instance, people on a constructed language mailing list are more likely to have linguistic jargon marked as not-spam, while people on a video game mailing list may have video game terminology marked as not-spam. Thus, a spammer could collect addresses from a newsgroup, a public web board, a public mailing list, or the contact page of a public web site, and associate each address with words that appear on the same page as the address. How will Bayesian filters block this? Can it be blocked at all?

This discussion has been archived. No new comments can be posted.

Bayesian Filtering: Is It Doomed?

Comments Filter:
  • Back when I was customizing my spamassassin scripts (for some reason it stopped working at some point after my ISP did an upgrade. This is an account that provides a shell account and I read my email on it by SSHing into the the ISP) I would just start blacklisting the domain name of any spam that got through the filter.

    Now, this kinda takes human intervention, so it is not automated, but a reasonably big ISP can afford to hire someone to go through spam sent to a honeypot and blacklist domain names that ar
    • I would just start blacklisting the domain name of any spam that got through the filter.

      So how would this prevent joe jobs [], where spammers list an additional innocuous domain (such as in an attempt to introduce noise into the filter? In a usual eBay/PayPal scam message, all the links except for one or two go to the actual eBay or PayPal site.

      • Well, I guess I would have someone on staff checking out the domain and seeing if they are selling what the spam is advertising.

        No, that wouldn't work because people are human and make mistakes so there would be false possitives, and as has happened before, innocent people get blacklisted.

        OK, so email in general is screwed.

        Personally, I am tempted to just try to whitelist everything I want, and have a very aggressive spam filter for everything else.

        But then I will still have problems with other people not g
        • by Qzukk ( 229616 )
          Maybe we should just go back to the telephone and snail mail? :)

          What, and go back to getting calls in the middle of dinner again? ;)
          • by tepples ( 727027 )
            and go back to getting calls in the middle of dinner again? ;)

            We're talking about mail, not instant messaging. Eat dinner, ignore the phone, and pick up the voice mail once you've cleared the table.

  • I see all sorts of messages like this. Basically, the more they try to obfuscate it, the worse it is for them. When was the last time a legit email had the words "ci" and "al" in it? Not to mention headers.

    My filter isn't as accurate as I'd like it to be -- BogoFilter is definitely not as good as Dspam was, but Dspam was crashing and bouncing messages, so I had to switch. It's still more than good enough -- I can deal with 5-10 "unsure" messages, almost always spam, much more easily than I can 40-50 spa
  • My email has been publicly posted for some years now, and I get about 100 spam a day. I've been using SpamBayes for quite some time, now. Only once in that time did a spam get put in "good", and never did a good get put in bad. Every now and then there is a small uptick in "uncertain" classification (maybe up to 10 a day out of 100), but after a week or two these disappear again as the filter gets trained. The adding words, misspellings, etc. have not had a big impact SpamBayes ability to filter them ou

Money can't buy love, but it improves your bargaining position. -- Christopher Marlowe