Wrong. 1 false positive can be acceptable, and in fact is probably better than how things are now.
At USENIX '03 there was a paper presented on artificial intelligence techniques for spam detection. I can't provide a link since only USENIX members can download the paper (at this point, at least). I was a coauthor of that paper.
One of the things we've discovered in our research is that some classes of filters (most notably, the one I have been developing along with a few other individuals) are actually more effective at correctly classifying email than humans are. That is to say, you can train the learning algorithm on mostly-correctly-classified data, then re-run it over the training data, and almost miraculously, it discovers all kinds of email in the training set that was incorrectly classified.
I.e., this filter has discovered mail that I myself incorrectly thought was spam. It's scary, because there's a lot of it.
To assume that a human will always be 100% accurate at classifying their own email isn't just arrogant, it's plain wrong. Newer filters that will be introduced in the near future might possibly be more accurate than you, a frail human, could ever be.
"To assume that a human will always be 100% accurate at classifying their own email isn't just arrogant, it's plain wrong."
Yeah, OK. It's not like arrogance doesn't show up all over the place anyway.
I use human classification (I guess) of email to identify spam and it is 100% accurate. That's because I let those who are best at identifying spam do it: the spammers. I trap relay spam, only spammers attempt to send through open relays such as the one I fake. Presto: 100% accuracy, and no actual filter
Go ahead: do it, run one. I did for years. How open do you want the relay to be? I must admit mine was truly open only to local users for whom I added a filter rule to let them through (and one remote user, for a while) but I didn't really want a truly open relay, just one open locally. I could have used send-after-receive if I'd had the right MTA software but I didn't have that so I had to improvise.
There's two ways that could work: if you could reliably identify spam or i
No, you "get me" correctly. I've been consistently amazed at the reactionary attitudes people seem to display towards spam. Ranging from "false positives are unacceptable," to "open relays are evil," to "we need tough new laws to stop this," I think these people haven't thought through the issue as fully as they could. If you can mute the spammers in any way possible, consistently, then they will eventually wither away -- and there's no need for us to give up what I view as priveledges: the ability to trans
"I think these people haven't thought through the issue as fully as they could."
Amen. It's a programmers' feast: pick an idea and start coding - before any real analysis.
Make that a bad programmers' feast. Come to a hard point to analyze? No problem: make up something plausible (to yourself, at least) and go right on working on what you want to do.
Heck, RFC 2505 said in plain language that securing open relays was not an effective anti-spam method. Did that matter? Humph. "They" proceeded to demon
Is this filter as generic that it can be used on meta-search engines? Yes, I mean non-spamfilter purposes.
I'm talking about so called 'intelligent'/'smart' *cough* searchengine. When you really search some information you are already willing to spend more than half an hour on it, so why not learn the computer what you search for?
I get these chain emails from my brother. They are always some funky scheme to get money that won't work. I'd love to just delete them...but if I do this, he tells my mom I don't answer his email.
She then laces into me like you would not believe...blah blah blah he's your brother and you should love him. I don't need that grief...so instead I respond with a "not interested, no cash right now." Keeps the family happy.
I could see it being more important than this, though. Your boss sends you direct mail HE received and appends a "Should we do this" to the bottom. Or, worse, your marketting team constructs a direct mailing that fails your spam filter (no comments from the peanut gallery...obviously this is a good thing to find out, but this is not the way to find it out). Missing that one email could make somebody VERY angry and put you in danger. I have had messages from my boss/CEO/etc go into my junk folder and found them when cleaning it out.
It is correct for the spam engine to label these as spam email. It would be incorrect for it to delete them before they got to you. And so I subscribe to the school of thought that a single false positive makes any spam filter absolutely worthless. It is very easy to delete a message that gets through the filter. It is impossible to resurrect a mailing you never even knew you got.
The way it's currently implemented, the spam mail isn't deleted -- it simply drops into a "Spam" folder which can be perused as your leisure.
After doing the math, I've come to absolutely trust the filter, even though it occassionally misses a legitimate mail. That's because my rate of classification error is actually higher than the filter's rate of error. The filter is actually better than me at doing it. The fact that it occassionally is wrong is irrelevant.
Yeah, well good for your rate of classification error. Mine is 100%...i can always tell whether i'm reading what I consider to be spam and what i consider to be a real email.
Of course, my rate of classification error for what YOU consider to be spam to ME is different. And it's always going to be variable. You've discovered a filter which YOU consider to be good enough. However, as admins, we are hosts to our users. Not lords over them. Therefore, a broad use spam filter should only as good as what o
Dunno. My ISP either drops the spam in another POP3 address, throws it away or changes the subject line.
Now once in a while I check my spam POP3 address for positive emails. Say once a month. 99 % of that spam is easily identified, once and a while there is indeed a mailinglist on it though.
Easily remedied this way.
Warper
BTW I don't want mail from "unknown" IP adresses delivered one hour late. Especially not work related mail.
Would you be willing to share a pre-print of your paper, or at least tell me the reference? My research focus is on e-mail classification and I try to keep up with what everyone's doing in the area.
If you'd rather talk privately, my e-mail address is public.
I found a copy of the final draft online: Learning Spam: Simple techniques for freely-available software. [pdx.edu] The paper covers several machine learning techniques. The particular one I'm talking about here is the information-theoretic clustering and neural network approach.
The draft linked above is, AFAIK, identical to the published paper. Usenix rules allow preprints by the authors on the authors' web site.
The bottom line is that e-mail gets lost: anyone who acts as if delivery is 100% reliable is in a dream world anyway. Spam filter false positives are just one more way for e-mail to get lost. As long as it happens very infrequently, the probability of the lost message being something important is low for most folks (certainly for me). To a naive first approximation:
1 false positive is not acceptable. (Score:3, Insightful)
"it practically eliminates the main problem of other solutions: the false-positive."
What does 'practically eliminates' mean? If it gives false positives at all, it is just as useless as all those 'other solutions'.
Re:1 false positive is not acceptable. (Score:5, Interesting)
At USENIX '03 there was a paper presented on artificial intelligence techniques for spam detection. I can't provide a link since only USENIX members can download the paper (at this point, at least). I was a coauthor of that paper.
One of the things we've discovered in our research is that some classes of filters (most notably, the one I have been developing along with a few other individuals) are actually more effective at correctly classifying email than humans are. That is to say, you can train the learning algorithm on mostly-correctly-classified data, then re-run it over the training data, and almost miraculously, it discovers all kinds of email in the training set that was incorrectly classified.
I.e., this filter has discovered mail that I myself incorrectly thought was spam. It's scary, because there's a lot of it.
To assume that a human will always be 100% accurate at classifying their own email isn't just arrogant, it's plain wrong. Newer filters that will be introduced in the near future might possibly be more accurate than you, a frail human, could ever be.
Re:1 false positive is not acceptable. (Score:2)
Re:1 false positive is not acceptable. (Score:2)
But how's the filter supposed to know when I want to see hot, naked teens? ;)
Re:1 false positive is not acceptable. (Score:1)
(If classified with a header and sorted/filtered to different folders, obviously)
Re:1 false positive is not acceptable. (Score:2)
Yeah, OK. It's not like arrogance doesn't show up all over the place anyway.
I use human classification (I guess) of email to identify spam and it is 100% accurate. That's because I let those who are best at identifying spam do it: the spammers. I trap relay spam, only spammers attempt to send through open relays such as the one I fake. Presto: 100% accuracy, and no actual filter
Re:1 false positive is not acceptable. (Score:2)
Finally we'll be able to have anonymous remailers again, without spammers abusing them.
Re:1 false positive is not acceptable. (Score:2)
Go ahead: do it, run one. I did for years. How open do you want the relay to be? I must admit mine was truly open only to local users for whom I added a filter rule to let them through (and one remote user, for a while) but I didn't really want a truly open relay, just one open locally. I could have used send-after-receive if I'd had the right MTA software but I didn't have that so I had to improvise.
There's two ways that could work: if you could reliably identify spam or i
Re:1 false positive is not acceptable. (Score:2)
Re:1 false positive is not acceptable. (Score:2)
Amen. It's a programmers' feast: pick an idea and start coding - before any real analysis.
Make that a bad programmers' feast. Come to a hard point to analyze? No problem: make up something plausible (to yourself, at least) and go right on working on what you want to do.
Heck, RFC 2505 said in plain language that securing open relays was not an effective anti-spam method. Did that matter? Humph. "They" proceeded to demon
Re:1 false positive is not acceptable. (Score:1)
I'm talking about so called 'intelligent'/'smart' *cough* searchengine. When you really search some information you are already willing to spend more than half an hour on it, so why not learn the computer what you search for?
Re:1 false positive is not acceptable. (Score:4, Insightful)
I get these chain emails from my brother. They are always some funky scheme to get money that won't work. I'd love to just delete them...but if I do this, he tells my mom I don't answer his email.
She then laces into me like you would not believe...blah blah blah he's your brother and you should love him. I don't need that grief...so instead I respond with a "not interested, no cash right now." Keeps the family happy.
I could see it being more important than this, though. Your boss sends you direct mail HE received and appends a "Should we do this" to the bottom. Or, worse, your marketting team constructs a direct mailing that fails your spam filter (no comments from the peanut gallery...obviously this is a good thing to find out, but this is not the way to find it out). Missing that one email could make somebody VERY angry and put you in danger. I have had messages from my boss/CEO/etc go into my junk folder and found them when cleaning it out.
It is correct for the spam engine to label these as spam email. It would be incorrect for it to delete them before they got to you. And so I subscribe to the school of thought that a single false positive makes any spam filter absolutely worthless. It is very easy to delete a message that gets through the filter. It is impossible to resurrect a mailing you never even knew you got.
Re:1 false positive is not acceptable. (Score:2)
After doing the math, I've come to absolutely trust the filter, even though it occassionally misses a legitimate mail. That's because my rate of classification error is actually higher than the filter's rate of error. The filter is actually better than me at doing it. The fact that it occassionally is wrong is irrelevant.
Re:1 false positive is not acceptable. (Score:2)
Of course, my rate of classification error for what YOU consider to be spam to ME is different. And it's always going to be variable. You've discovered a filter which YOU consider to be good enough. However, as admins, we are hosts to our users. Not lords over them. Therefore, a broad use spam filter should only as good as what o
Re:1 false positive is not acceptable. (Score:1)
Reference for that paper (Score:2)
If you'd rather talk privately, my e-mail address is public.
Re:Reference for that paper (Score:3, Informative)
Re:Reference for that paper (Score:2)
The draft linked above is, AFAIK, identical to the published paper. Usenix rules allow preprints by the authors on the authors' web site.
The bottom line is that e-mail gets lost: anyone who acts as if delivery is 100% reliable is in a dream world anyway. Spam filter false positives are just one more way for e-mail to get lost. As long as it happens very infrequently, the probability of the lost message being something important is low for most folks (certainly for me). To a naive first approximation:
Re:1 false positive is not acceptable. (Score:1)