Holy crap, that's a lot of spam

Journal Tet's Journal: Holy crap, that's a lot of spam 2

Journal by Tet on Thursday July 16, 2009 @08:05AM

Some while back, spam became a sufficiently large problem for me that I decided to do something about it. After looking around at the options, I decided to go with bogofilter. It's small, fast, and written in C. These are all good things. It looks at each message and builds up a database of words and the frequency with which they appear in spam and ham, and uses that to classify each incoming message. So far so good. It seems to work very well. When it's not sure, it says so, and files the message in an "unsure" folder, and asks me to manually classify those messages. It also uses that clarification to further learn the difference between ham and spam for future messages. This all sounds ideal.

The problem is that my main server was starting to spend a significant proportion of its time in disk wait state, while the bogofilter processes tried to access the word database. The load average hit 100 last night, which was an indication that things had gone badly wrong and I needed to do something about it. I don't know how Berkeley DB is implemented internally, but my guess would be that it's vague hash table shaped, and so the size of the database shouldn't significantly affect lookup times. But I don't know that for sure, and my database had grown to a non-trivial 1.5GB in size. Furthermore, bogoutil was failing to read the database file, possibly indicating that it was corrupt, which might have been causing each bogofilter pass to take much longer than necessary, and hence increase contention for the database file on disk.

So at this point, I decided to scrap the word database and start again. I have a corpus of both ham and spam that I used for the initial training, which I knew wouldn't be perfect, but it was a good start, and I could further train it up over time. Then I left it and went to bed. This morning, I realised just how much work bogofilter had been doing on my behalf. You see, in the 12 hours since I switched over to the new word database, I received 18,000 spam messages that bogofilter was now no longer filtering out for me. I knew the spam problem was bad, but that's at least an order of magnitude worse than I expected.

The next step is to train up bogofilter again so that it filters the majority of those out for me, and given the volume of mail that I'm getting, I think I'm going to need to use bogofilter to feed a Jef Poskanzer style dynamically updated block list that I can pass to blackmilter. 18,000 messages (plus however many bogofilter blocked) in 12 hours? That's insane.

This discussion has been archived. No new comments can be posted.