Journal Tet's Journal: Holy crap, that's a lot of spam 2
The problem is that my main server was starting to spend a significant proportion of its time in disk wait state, while the bogofilter processes tried to access the word database. The load average hit 100 last night, which was an indication that things had gone badly wrong and I needed to do something about it. I don't know how Berkeley DB is implemented internally, but my guess would be that it's vague hash table shaped, and so the size of the database shouldn't significantly affect lookup times. But I don't know that for sure, and my database had grown to a non-trivial 1.5GB in size. Furthermore, bogoutil was failing to read the database file, possibly indicating that it was corrupt, which might have been causing each bogofilter pass to take much longer than necessary, and hence increase contention for the database file on disk.
So at this point, I decided to scrap the word database and start again. I have a corpus of both ham and spam that I used for the initial training, which I knew wouldn't be perfect, but it was a good start, and I could further train it up over time. Then I left it and went to bed. This morning, I realised just how much work bogofilter had been doing on my behalf. You see, in the 12 hours since I switched over to the new word database, I received 18,000 spam messages that bogofilter was now no longer filtering out for me. I knew the spam problem was bad, but that's at least an order of magnitude worse than I expected.
The next step is to train up bogofilter again so that it filters the majority of those out for me, and given the volume of mail that I'm getting, I think I'm going to need to use bogofilter to feed a Jef Poskanzer style dynamically updated block list that I can pass to blackmilter. 18,000 messages (plus however many bogofilter blocked) in 12 hours? That's insane.
Expect to continue retraining... (Score:2)
I use... (Score:1)