Comment Re:Some comments (Score 1) 197
(SORRY, REPOST FROM FURTHER ON, BUT I WANTED HENRY TO SEE IT)
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
Thanks SlashDot!
Shawn Evans
Wow, I never thought this would generate this much discussion or attention! To everyone that's giving me good feedback (Henry Stern especially), thank you very much as it is appreciated... now for the detractors and script kiddies.
Ok, first off, I was not de-railing existing spam filters, so you guys that took it that I was personally attacking your spam filter, get over yourselves. If you don't have a problem with spam, then congratulations, you get a freakin' cookie. Since most users do, I was trying to create a forum for discussion on new techniques and ideologies for addressing it, not lameass quotes like "this guy's an idiot". And, most of those comments were because people didn't read all the article or didn't understand it. I guess that's what happens when you mix script kiddies with democracy.
Next, if other people have tried and tested this method, great, I wasn't trying to steal their work. This is basically a guy, a garage, and a computer type of project, and I don't have access to research done in deep academic circles. If this overlaps with other work, my deepest apologies to those individuals.
Now, for you people that thought I was trying to say this was the next best thing to canned bread, give me a break! If you read the conclusion, I basically state that no one simple algorithm can adequately address spam (this one included!). That it would take a multi-pronged attack, and that we should start treating spam the same way we treat evolving organisms. That was the point of the article. For the algorithm, I was just trying to present a starting point, not an end.
Whew, now that I've got that off my chest, to the other problems with the article.
Henry, you were absolutely right, the training set does need to be randomly shuffled each time per epoch. My fault on that one guys.
Also, I do apologize about the metaphors, I should have used the proper ANN terminology. I used the metaphors to re-enforce my point about the similarities between this and biology.
Now, I am not quite sure why it takes so many training cycles to train the network in SpamAssasin, but since I'm using standard BP, it is quite fast, and does not require an extensive training set because of the generalization traits of a MLP.
I do agree about the ability of spammers to exploit the error inherent within the hidden layer of a MLP. But, I would think it would be extremely difficult to do this, especially if multiple distributions are created of the trained MLP, each with different structures within the hidden layer and with different initial weights. Then, they would only be able to take advantage of certain traits for a small percentage of filters. Also, I would argue that training should also happen on the users PC, as a background task, say weekly. This allows the network to adapt to new types of spam, and when initialized with random weights, it should make the error within the hidden layer random among differnt installations, effectively killing that exploit, as each machine will converge differently.
Now, regarding spam messages adapting, I am sure their are many more exploits they can implore, but as of this point, they are running out of options, atleast "structual" options. If we can work together to identify the exploitable structures, then we should be able to design a filter that can adapt to the different features within these structures.
That's all I have, thanks again to the folks who gave me constructive feedback on the article. To those who's remarks were out of complete ignorance, **** off.
Thanks SlashDot!
Shawn Evans