Spam, Polluting Bayesian Spamfilters and (Technological) Cen

Journal acesuares's Journal: Spam, Polluting Bayesian Spamfilters and (Technological) Cen

Journal by acesuares on Saturday September 06, 2003 @12:28PM

Spam is a problem.

Back in the days, I didn't have any spamfilter. My belief was (and, actually, is) that a mail directed to a valid user is none of my business and the end user needs to deal with it.

Soon enough, that changed, because so many viruses where being sent trough mail, and so many MS Windows users where suffering from it. I installed virusscanners, and that worked fine for a while.

These virusscanners where sending out an email to the address the virus came from, telling him or her that their computer was infected, and that their mail was not delivered. At the same time, the intended recipient got a message that a virus was blocked, coming from the senders address.

After a short period of time, virus-writers changed their ways and started using forged email addresses as the senders address. Too bad, because the warnings the virusscanners where sending out, either would go nowhere, and bounce, or end up in the mailbox of some unsuspecting end user - who might or might not be infected with the specific virus.

Thus, I disabled all warning messages, both those to the sender as those to the recipient. My users don't have a clue how many viruses are intercepted. And they also have no clue what exactly is considered a virus.

In fact, I am censoring the mail to my end users based on a unknown set of rules.

How many spam would a spamblocker block if a spamblocker could block nets ?

I still didn't have any spamfilters installed at that time. And the amount of spam was still 'acceptable'. At least there was more wanted mail coming in, then unwanted mail. Those where the days...

Anyway, I ran into a different problem: my server was blacklisted by some spamblock-servers. This wasn't my fault - users at my server where not spamming, I didn't have an open relay... but my upstream provider had hosted a notorious spammer (or so it was said) and the spamblockers blocked a whole netblock. That netblock included my server and the server of some other people that had nothing to do with spamming or spammers. Our only fault was that we had an IP-address from a range that was considered 'tainted' by some people that I don't know and have no control over (but it wasn't the governement this time).

In a heated discussion amongst friends, I took a very strong point against the way spamblock-servers work (no one agreed, though). The mere idea that some unknown entity can put my servers on a blacklist and as a result of it, mail passing trough my servers is not delivered at all the hosts that use those blacklists, is to me an unacceptable form of censorship.

I do realize that spamblock maintainers don't force anyone to use their blacklists, but the fact is that ISP's do use them. And I understand that it does stop a lot of spam, and that some might consider an occasional mistake a small price for getting rid of tons of spam.

But to me, the system in itself is principally wrong. It defies the idea of Internet mail, by giving third parties the authority to filter out mail with a valid destination address.

Murder Spam!

After a while, spam really got annoying and end users started complaining about it. I needed to do something. I had two choices: issuing a press release to my clients and tell them how I think Internet mail should work (and that it is the end users responsibility to handle the mail that is sent to their address), or install some spamfilter. Unfortunately, I chose the latter.

I installed spamassassin, which at the time looked like a good choice. It has the possibility to mark messages as 'possible spam', and in that way the end user is still responsible for sorting out spam and ham, by applying filtering rules within their email-software. No messages where blocked (except the virusses), and I could go to sleep with a clear conscience. That I thought.

But after more time had passed, it turned out that spamassassin let's through a fair amount of spam. Maybe I should update my anti-spam rules more often, but I haven't found an easy way to do that yet, and that might be entirely my fault. Nevertheless, spammers are getting more clever and write spam in such a way that a points-based boolean filter like spamassassin is helpless in some (and nowaday many) cases.

At the same time, end users started complaining that every time when they download mail, they are downloading more then 50% spam mails and only a few real mails. Yes, they use the filters in their mail program but still they need to download that mail, and the number of messages in their spam folder grows very fast.

I offered to drop the spam before it reaches the end users mailbox, but did a small survey before I effectuated it. And to my surprise some end users actually reported some false negatives: they had looked through their spam-folder and found one or more messages that they really would have liked to read. Ham amongst the spam!

So, silently dropping all 'possible spam' would be another form of censorship.

Polluting Bayesian Spamfilters

What next? Lately, the use of Bayesian Filters has gained popularity on the net. The theory behind these 'learning filters' looks promising and the results are excellent. And there is a lot of helpfull software being written to utilize these new techniques.

Two of these software packages I have looked at, briefly, and they might be a very good help for the end user. These programs need to be installed on your computer, and proxy the pop3 connection. They 'check your mail before you do' and tag mail as spam, ham, or any other label you want to give it. It has a web interface to help you train the filter effectively. So far so good.

But as far as I can see, the training of the filter is not a very easy task for the average end user (and certainly not for the below-average end user). In the manuals of both software, as well as in 'A Plan for Spam' and its sequel, the authors repeatedly state that the filter needs continous training, and that misclassifying mail wil definetely yield in decreasing effectiveness of the filter.

Now, let's say a spammer starts sending out perfectly normal messages. Without even a hint towards selling goods or naked lunch. And let's say the spammer does that every 2 out of 3 mailings. What will the Bayesian spamfilter learn from the average end user? That a lot of mail which looks quite common, is being classified as spam.

I believe that in the newer, improved bayesian spamfilters, where message headers are taken in account more seriously, this effect will be dampened, but I am 'afraid' that in a year from now all the training that end users have done on their webinterfaces, will prove less and less valuable. And more and more ham and spam will be 'unclassified' or 'unsure'.

All that said, the good thing is that mail is not being censored by a third party, and the user is still responsible for handling their mail themselves.

Block or White?

Just before I installed spamassassin, I was pointed to another method of avoiding spam. With this software, a 'whitelist' is kept, and only emails whose address is on the whitelist will be delivered. Mail that originates from an 'unknown' source is kept in a queue for a number of days. The software then sends a mail back to the sender with a confirmation code. If that mail bounces, the original mail is considered spam and deleted. If the sender does reply to the activation message, the mail is delivered and the address is added to the whitelist.

At that time I decided to go for spamassassin instead of the whitelist method. To me it seemed, that there could be a number of reasons why the whitelisting software could throw away important mail. There are a zillion legitimate reasons why someone can send an email but can not receive email (going on a vacation? computer crash? borrowed account?).

And what happens when you send a mail to 'user@company.com' but the reply is coming from 'user@department.company.com' and the latter is not on your whitelist?

Last but not least, spammers might set up an email address that handles the activation messages. That address might not exist very long due to the sheer number of activation messages that arrive at that address, or the action that the anti-spam community is going to take to such an addres, but nevertheless a lot of spam will come through unharmed. And since people tend to whitelist some very well known email addresses (if you are a client of provider.com, you might well have whitelisted support@provider.com) and mailinglist addresses, valid, well-known and generic email addresses might become de facto unusable.

Whitelisting is censorship applied by the user self - automated whitelisting after replying to an activation message appears to me as some sort of technological censorship.

Final Countdown
For now, there is no solution to keep your mailbox free of spam and at the same time have Internet mail the way it was intended - a one-to-one or one-to-many or many-to-many communication channel without third-party or technological censorship.

Both manual whitelisting and content filters may aid end users in sifting through their email much quicker, but automated activation messages, dropping spam based on content, and blacklisting based on ip-address are forms of censorship that I don't really feel the need or willingness to implement.

Ace. (2003 09 01)

This discussion has been archived. No new comments can be posted.