It is a tough nut to crack unless you have access to the complete mailboxes for the following reason:
- Any sort of AI/neural net/bayesian net is going to be only as good as the sample you train the system on. In most cases, it is easy to accumulate spam mails (honeypots etc), but it is hard to get hams (good mails). No enterprise customer would donate his "good mails" for research purposes.
- Running any sort of optimized neural network on customer box (via some sort of toolbar etc) doesn't help, because that is the first thing they disable.
- People are more likely to delete a mail rather than report a spam mail. Without access to usability data from their mail client, this causes more spam to more or less leak through.
- Spams are generally targeted regionally. A spam received by a person in USA is very different from the spam received by a person in China. This further restricts the accuracy of spam filters.
(Now these are not a problem for Google/Microsoft etc who have access to all these data)
Which leaves only secondary ways of detection:
- Black list/pink lists/grey lists .. these are reactive rather than predictive, so some spams will always get through.
- Rule based (regex/strings): Needs to be updated constantly, is less scalable, and needs a lot of multilingual people to stay up to date. Not very scalable.
- Reliance on the likes of libspf, which is still not as widespread as we'd like it to be.
Most email spam engines to my knowledge can easily catch upto 95% of spam.. may be 99% on a good day, but that remaining 1-5% earns them the ire of their customers. It seems to be just a labor intensive job, which is just not as rewarding as we'd like.