Distributed Checksum Clearinghouse vs Spam 216
AllSpammedOut writes: "Spam could be more easily detected if everyone were
to compare the mail messages they received. Using the Distributed
Checksum Clearinghouse, MTAs can report the checksums for all messages
they receive and be notified when a checksum has already been reported by many other systems." Obviously there are issues with something like this (especially mailing lists, and worms that do attachments). I suspect spammers would just include a counter to break checksums tho."
at least hackers are smart (Score:2)
Re:Worms? (Score:2)
Relevant but somewhat off-topic question (Score:4)
Re:What's the big deal? (Score:2)
Hello "Don't Spam JeffSketch's hotmail address", what's that address? JeffSketch@hot... hmmm something.com... JefSkatch@hotmail.com? no... that's not it. I wonder why it would be so dangerous to post an email address on a web forum.
Maybe I should forward you the contents of my Hotmail account. It is up to 540 pieces of filtered spam. Only about 50% of my spam gets successfully blocked. This renders my occasional-use Hotmail account nearly useless.
But wait, that's a free account. I guess that means that nobody is paying for it. Neither in my time nor Microsoft's money.
Alas dear troll, if indeed you were not afraid of spam you would not be hiding your email address at all.
Say, I Recognize This (Score:2)
;)
No, I did propose something along these lines on Advogato back in February in a piece entitled "Realtime Worm Filtering System [advogato.org]," but I'm not accusing the author of ripping off my blatently-obvious and not-uncommon idea. That system is intended to stop worms, obviously, and not spam. Worms tend to be easier to stop because they're seldom wholly polymorphic, often retaining enough similarities that collaborative filtering is quite feasible.
-Waldo
Re:Checksums? (Score:2)
And, if so, with cheap storage, why not store the whole SPAM; in case of a high number of checksum matches, a final precide double-check could be made.
--
Re:"Pretty close" checksums? (Score:2)
Here is my method: http://slashdot.org/comments.pl?sid=01/07/30/14442 47&cid=108 [slashdot.org]
--
Re:What's the big deal? (Score:2)
Re:What's the big deal? (Score:2)
Here in Norway for example, which is probably about representative, about half the people dial into the Internet with modems, or by ISDN. Flat rate on telephone-calls is uncommon, the vast majority of that half pay about $1 an hour for the connection to the net. That works out as $50 a GB for those on ISDN, and $66 a GB for those on modem.
Even this estimate still assumes that the link is perfectly full, that is, that a person with a ISDN-connection downloads email at a rate of 64kbps, which isn't nessecarily true. (allthough it should be close for your ISP's local mailserver)
Re:Relevant but somewhat off-topic question (Score:2)
Also, in countries like China, which are currently booming in regard to new businesses going online, there is a very common usage of pirated copies of older versions of Microsoft Exchange which did not have the capability to stop spam, or have it disabled by default. Not being licensed copies they don't get the latest patches. And they usually don't even have a sysadmin, or if they do, it's one who is incompetent or one who can't read English. Unfortunately, most of the help to close relays is primarily in English. This is bad as English is not really so universal as Yanks and Brits might like to think. Translations to all languages is needed.
Spammers cost money to those who get spammed. Pushing the cost back to spammers and the ISP who (perhaps through inept management) support them, is one way to stop them. Laws will not since this is an international thing.
Re:What's the big deal? (Score:2)
Paper spam has never been as significant a problem as electronic spam, because the sender pays most of the costs for paper spam whereas the receiver pays most of the costs for electronic spam. There is an economic throttle for the sender of paper spam. If we allow electronic spam to simply continue, it will scale up as most businesses would then perceive it to be legitimate. You'd end up having to delete thousands and tens of thousands per day. It would keep growing if there is the perception that it is legitimate and that it cost you nothing to delete.
Electronic spam does cost the receiver time and money. This includes the receiver's ISP. If you are on a dialup line (as most people still are because of the DSL debacle) the spam takes up more time on your mail downloads. As the problem grows it takes more time.
To sum it up, it might not appear to be that much of a problem for you at this moment, but if you scale it up to where it would be if no effort was made to stop it, you would not be able to handle the load. Some of us do understand the scaling issue. If every business in the world sent you ONE message PER YEAR, and somehow this were just evenly spread out in time, you would be deleting this crap every 2 to 3 seconds, 24 hours a day, 7 days a week, all year long. The scale of the internet is simply not suited for spam.
If you really have to get back to work, what do you do? Do you send spam all day, or do you delete it? Or do you just not get much of it?
Re:Relevant but somewhat off-topic question (Score:3)
A network of authenticated mail servers could be very useful. But the effectiveness would be limited unless entry to the network requires agreement to terms to apply strong enforcement against spam, such as:
Re:Just use mail filters (Score:3)
Show me one that works on my mail server without overloading it. Mail comes in at a rate of about 20 per second. It will need to check it all. If you think the problem is solved at the client, you misunderstand the problem.
Re:Spam Hunters (Score:2)
Lorenzo Lamas stars in e-Renegade!
Reno Raines is back! After being forced at gunpoint to break RSA's strongest encryption while getting a blow-job, Reno is wanted by the Financial Businessmen Incorporated, the FBI, for violation of the DMCA! On the run from bought-and-paid-for law enforcement, Reno has changed his identity and now works for his Native American friend, Robbie Spamkiller.
Chasing down unlicensed spammers, Reno searches for the evidence that will clear his name, bring justice to those who "blew" his career and reputation, and let him marry Robbie's sister, Cheyenne "Shy" Phillipshead.
--
Phone #==goatse.cx (Score:2)
This is Customer Service for Comcast Cable in Indianapolis. I would guess it's as close as you can come on the phone.
--
The way I heard it (Score:2)
Who knows if this actually happened.. It's really too bad that AI professors can't get their own material. I'm sure EVERY compsci student who took a software engineering class heard the anecdote about the computer-controlled radiation/x-ray machine, that killed a patient by giving them like 10,000 times the normal dose. This error was traced to a lack of bounds checking in software.
--
Re:Just Because they would counter it. (Score:2)
A lot of spam I get already has a unique identifing ID included in it. I assume this it to track valid e-mail addresses of people stupid enough to try to be "removed" from their lists.
--
Re:Checksums? (Score:2)
Well with a CRC I guess a slighly changed message will only have a slightly different checksum. But there is a good chance that 2 dissimlar messages will have the same sum. You'd need something like a large md5 sum to make sure your false positives are low. But the problem with md5 is just changing 1 byte largely effects the sum. So there would be no fuzzy matchting.
--
Re:I can't see this working (Score:2)
I can't see this working (Score:5)
A more robust method of spam detection, IMHO, would be to develop an algorithm that would take emails, and encode them in a way that they could be input to a neural network. the output of the network would be 0=not spam/1=spam
If anyone with some machine learning experience wants to work on a project like this with me, send me an email!
Re:Relevant but somewhat off-topic question (Score:2)
This can do a bounce back on spam saying that your user doesn't exist. This is for linux, I couldn't find any windows applications that could do this.
Checksums? (Score:4)
This sounds like a terrible plan. As mentioned, a simple counter would blow this thing out immediately.
However, a number the represented how closely related an incoming email and a known spam message would be a useful metric. Then you could have fuzzy filters that determined how close you would want to be before outright rejecting a similar message, or maybe just relocating it to a seperate inbox.
Comment removed (Score:3)
Re:I can't see this working (Score:2)
You'd have to also ensure that ISP's are clued up enough to turn off this feature for abuse@ mailboxes for obvious reasons!
--
Re:Checksums? (Score:2)
You'd effectivly be forcing the spammer to send every email. i.e. they could no longer rely on simply feeding a relay machine a string of RCPT TO commands.
Thus spamming becomes far more difficult.
Re:Checksums? (Score:2)
An alternative would be to send everything with public key encryption. Though you'd need to devise a DNS like mechanism for distributing public keys. (You also want to cut out as much relaying as possible, since a third party relay will never have access to the private key.)
Re:Relevant but somewhat off-topic question (Score:2)
A certain set of software requires a third party relay to work at all. It's quite possible for those setting up such relays to create an open relaying situation (especially with complex networks.)
Re:Relevant but somewhat off-topic question (Score:2)
How often was SMTP over UUCP (and the like) used anyway
That changed once TCP/IP became the norm, but relays were still necessary for the transition phase.
MX records came into existance in the late 1980s...
Even today, there are still people who's mailboxes aren't connected 24/7 that require a relay service, though they are definitely a minority.
What they actually need is one or more (off site) secondary MX records.
Which is totally transparent to any MTA which follows the spec.
A depressing number of sites require that email come from the "correct" IP address (your From: address must have the same MX record as your IP address) which means your ISP must maintain a relay for your use, though it doesn't have to be an "open".BR>
This is mixing up two things. The first is something like the DUL which requires use of an ISP provided third party relay. The second is ISP provided relays having restrictions on what they will relay based on the MAIL FROM: command.
The actual major reason ISPs provide third party relays is that software such as Netscape Communicator and Outlook Express simply won't work without one.
With most ISPs, it's easy to bipass relays and send email directly to port 25 on the target machine, so blocking open relays wouldn't help much, it would just push the problem back one step.
Actually it helps a lot. A problem with all relays is that they can be used in the mode of send one message and a list of recipients and the relay machine will do the work of sending out N copies. Remove all relays and the spammer has to actually send ever message themselves.
Re:Relevant but somewhat off-topic question (Score:2)
Requires a rather algorithm to work this out. Also it would cause problems with machines on a dialup running proper MTAs attempting to process their mail queue on connection.
A simpler method would be to start dropping packets at random if all (or more than a certain portion) of the traffic from an IP address consists of outgoing TCP connections to port 25.
The only thing which needs examining is IP and TCP headers.
Re:Worms? (Score:2)
This also makes spamming "hit and run". By the time the spam starts arriving the spammer has gone.
Most spammers don't have the bandwidth that it takes to send each user a personalized message, because they are almost always on a throwaway dialup.
They also need processing power to do the personalisation, software which understands the full SMTP spec (rather than that required to get by sending to a relay) and can handle identd requests.
Only the professionals can afford to send unique messages, because they often have a DSL line and a pink contract with their ISP (which permits them to continue spamming).
They also need a frequently changing IP address...
Re:Countermeasures (Score:2)
Forcing spammers to customise each email would make spamming considerably more expensive. Because they then have to actually send each email, rather than being able to use third party relay machines to duplicate their junk.
Re:bulk-mail should be refused by default (Score:2)
Problem is that ISP provided third party relays render this method useless...
Re:Relevant but somewhat off-topic question (Score:2)
Re:Relevant but somewhat off-topic question (Score:3)
Unfortunatly, there are always systems where the sysadmin hasn't updated for years, because it's not causing him any problems.
Re:hmm (Score:2)
Re:Issues... (Score:4)
Because everybody knows that Orange rinds offer better memory density than banana peels. And orange peels are more resistant to the excess steam from the CPU. Banana peels would just disintegrate with even a minimal amount of overclocking.
Re:False Positives (Score:2)
While the system could be broken by using counters, this could be countered by parsing only certain portion of the mail or counting the frequency of certain words. Would work very well on pure text spam, but not on attachement stuff.
Actually, that technique works reasonably well.
I used to administer the trouble ticket system for a very large ISP that got so many complaints that they became unmanagable. (Not all their fault, but that's another story.) Anyway, we had software that would take the bodies of the emails being complained about, remove whitespace and anything that wasn't in the dictionary, sort it, uniq it and generate an MD5 of the list of words that came out. I never studied it over the long haul, but tests on live data showed a match rate of about 90%.
The real flaw in DCC is that it doesn't protect early recipients of the spam, because it won't have built up enough hits to be considered bulky. The only way to make it work would be to submit the checksum and hold the letter for some amount of time to see how bulky it gets. Most people would probably not like the lag time they'd get on legitimate mail.
Re:Cell phones are great (Score:2)
Forget comparing spam! How about universal naming? (Score:2)
A database of image characteristics (like those used by D'peg! [somewareonthe.net] would make this less likely. People would be discouraged from changing the file's originally agreed-upon universal name.
Publishers could upload their image characteristics into the database, along with a tag like "Originally from somepornsite.com". So if I someday come across an image I really like, I could check the database and see where to get the rest of the series. This would supercede obnoxious watermarking to indicate the source of an image.
This could of course be used for mp3's too, which are all-too-often renamed incorrectly. Checksums would be enough for a particular song encoded by a particular encoder with particular parameters, but audio fingerprinting would be necessary to accomodate different encoders. I don't think that's a deal-killer.
By the way, D'peg! is really neat, but it's amazingly slow the first time if you have a lot of images. (As in: My win98 uptime record is 11 days. Dpeg's projected completion time was 34. Good thing it can resume after a crash.)
Re:Just use mail filters (Score:2)
Plus, mail filters have the benefit of not breaking in the face of a trivial change to the body (like a counter).
--
I have no fin
no wing no stinger
no claw no camouflage
I have no more to say...
Just use mail filters (Score:3)
--
I have no fin
no wing no stinger
no claw no camouflage
I have no more to say...
Filter messages before checksumming (Score:2)
In addition to the raw message checksum, possible filters include:
Re:The problem is... (Score:2)
Btw, check out
http://www.picante.com/~gtaylor/spam/
to read about someone's efforts to get rid of spam via a slew of techniques, including an automated classification system (Naive Bayes).
Jason
Re:Cell phones are great (Score:2)
---
man diff (Score:2)
diff message1 message2 | wc -l
--
Re:Why go through that much trouble to detect SPAM (Score:2)
Re:Hashed bigrams count (Score:2)
There is no such thing as a "common histogram". They will all be different. However, two identical messages will have identical histogram. Two almost identical messages will have almost identical histogram (while two almost identical messages usually have very different checksums).
The reverse is usually true (of course, there's not absolute garanty): two almost identical histograms are very likely to come from two almost identical messages. The more you increase N (the bound for the hash result and size of the histogram), the more accurate the result. Also, using trigrams would likely be more accurate.
While it is possible for spammers to vary their messages, they cannot send thousands of messages that are really different one from the other and this is why this technique should work almost all the time. Of course, you'd need to get rid of headers and any html tags and garbage before computing the histograms.
Re:Hashed bigrams count (Score:2)
Once again, your assuming there is such a thing as a "normal histogram". Remember, that we're not checking whether the "histogram" is normal or not. We're checking to see if this particular histogram (from a spam e-mail) as been seen more than x times before. Even if the manage to get a piece of spam match to the exact same histogram as a valid e-mail, the piece of spam will still be rejected with the unfortunate side effect that the valid message might be rejected (but since they cannot read your mail, they cannot get one of your e-mails rejected).
As for the CPU time, sure you don't want to make N too large...
Re:Hashed bigrams count (Score:2)
Manhattan distance, aka L1 norm of the difference.
And the reason I said it should work is that I have already tried that a while ago for a slightly different task. The only thing I'm not too sure it CPU time.
As for histogram randomness, evan if the N-dimension (N ~ 1000) vectors (histograms) don't have a uniform distribution in the 1000-D space. You'd have to be very unlucky to get the same (or approx.) value for all of the 1000 bins.
Hashed bigrams count (Score:5)
Re:Relevant but somewhat off-topic question (Score:2)
Who ever thought people would ABUSE this sort of stuff?
Hell, at one point it was an accepted practice of being a good net citizen to have guest accounts on your machines too.
These are, of course, all legacy attitudes. Sorry to see them go, of course. Would be great to live in that world, wouldn't it?
-Steve
Re:Checksums? (Score:2)
Pretty impressive procedure!
I would have thought that going to the next level of spam filtering would require shoving messages of dubious origin into some delayed-delivery hopper that would be scrutinized carefully against the results of incoming messages from throw-away spam-gathering accounts on other machines.
Your system of historical analysis makes it possible to defer the date when we will be forced to resort to multi-account inbox comparisons to filter out spam.
Re:My Life as a Spammer (Score:2)
The checksum is fuzzy (Score:5)
Download the source tarball [rhyolite.com], uncompress, untar and read
On a deeper note, it's sad that so many Slashdot readers, including apparently CmdrTaco, underestimate others so severely. Do you really thing someone put in the effort to make something like dcc and never thought about how a message could be varied to evade the checksum? And why not read the linked document first? You would have found: Summary: read before you criticize, and recognize that others probably thought the same thing you're thinking.
hmm (Score:3)
So ...
What's the big deal? (Score:2)
After all, e-mail is checked when I want to check it and when I see any subject asking me what the state of my sexual arousal is or offering me a university diploma or just something from 348djkea23@yahoo.com I know I can easily delete it. It's not like a phone call where I don't know who's calling me and I kind of have to answer it right then. I do have caller id, but that's an additional service I have to pay for and most of my friends are out of state so they show up as 'unavailable' along with all the other salesmen.
For unsolicited mail, I have to handle it no matter what, I can't just leave it in my mail box forever. But with e-mail I never really have to see it and I can delete it without having to ever give it a second thought and it's gone gone and not just taking up space in my trash can or recycle bin.
Perhaps someone here can enlighten me.
p.s. I'm sure I have more to say on this topic, but I really need to be getting back to work
--BEGIN SIG BLOCK--
I'd rather be trolling for goatse.cx [slashdot.org].
Re:Cell phones are great (Score:5)
Check out http://www.scambusters.org/809Scam.html if you don't know what I'm talking about.
randomised strings (Score:2)
Add invalid HTML tags (Score:2)
Re:"Pretty close" checksums? (Score:2)
http://slashdot.org/comments.pl?sid=01/07/30/1444
Re:Add invalid HTML tags (Score:2)
Re:Add invalid HTML tags (Score:2)
Re:Checksums? (Score:3)
i tried that, had very good success. read more about it at:
http://www.blackant.net/code/oth/random/nlp-spamfi lter.php [blackant.net]
i collected a sample of 30-plus spam messages as well as 30-plus not spam messages and ran some word and phrase frequency counts on each group, then threw that data into a couple mysql tables. Next i match the phrase and word frequency counts to new mail that arrives, and depending on how closely the new mail matches the known groups, i can tell whether or not the mail is spam.
by tweaking the exact amount needed to be determined as spam or not-spam, i had very, very good success rate - out of 32 messages checked using this method, all were appropriately identified as either spam or not-spam.
I've been meaning to continue with this line of spam detection, increasing the size of the db and testing it on a larger sample of mail (read: all my mail) and then seeing if the results were still as good, but...
-f
Re:Personalised spam (Score:2)
You could even improve this when you've received several of the same by cross-comparing them and working out which paragraphs change and which stay the same. You could then combine the individual paragraph checksums into a single checksum, and only check that part of the message - that'll save on storage of lots of checksums.
The only trouble I can see is when this is one of those three-line ones that just says "Feeling horny? Go to here for XXX" or whatever. If those added some destination-specific heading, it would be difficult to set the filter tolerances tight enough so that genuine emails with one or two sentences that match don't get filtered.
Grab.
Countering Counters (Score:2)
Eg if we assume that much of the spam problem is from open relays, then recognising that >N% of local users have gotten a message mailed through a given relay may be enough to flag it suspicious.
Doesn't help the mailing list problem of course.
I think the best anti-spam measure is simply to divide email into high quality and low quality lists based on the sender and have the user say which senders should be treated as high quality in future. If people you sent mail to were added to the high quality list by default that would take much of the work out of it. Since this way you are trying to pick out good stuff rather than remove spam, it is harder to counter.
Add to that a magic word system. Messages with the magic word in the subject are tagged as high quality. Then you can give people you really want to hear from the magic word along with your email address. Change the word regularly and old information won't come back to spam you.
_O_
spambouncer works great for me (Score:3)
I guess this doesn't solve the problem of server resources getting stolen, but it certain saves me from having to look at the crap.
-matthew
the ultimate spam filter (Score:2)
Just Because they would counter it. (Score:5)
One thing bothers me though, as I was clearing out a large 'stuck' email for one of our dial-up customers the other day, I happened to casually mention "Wow, you sure do get alot of spam!" to which they replied "Whats that?" "You know, junk email" "Junk e-mail? I read it all" People like that are why our boxes receive such garbage. You fire enough bullets and SOMEone is going to die.
Re:hmm (Score:2)
Or something like that, depending on what you use for filtering news and email. For me, it's got to be GNUS Score files and Procmail.
Duh... (Score:5)
Leave it to the Slashdot crowd to make things a million times more comples than they need to be...
Surprisingly, that can work! (Score:2)
However multiple checksums of subsets of the email would not usually all be changed by one or a few changes/counters and checksums will be sufficiently discriminating to screen emails and can do a very good jobs of detecting any widespread junk emails.
It would be difficult that all checksums of all characters of a particular length (say 20 characters) be made sufficiently different that ALL of the subsets of the junk emails can be different.
(Checksums that checksum all the strings for a particular length are not difficult to generate as a matter of fact; little more than a circular buffer is required.)
Re:"Pretty close" checksums? (Score:2)
Yeah, there are. 'Rdist' does this as a way of trying to only send the minimum set of changes necessary to keep two ftp/web sites synchronised.
Actually to be precise, the checksum isn't imprecise, as rdist relies on checksums of subsets of the documents they are trying to synchronise.
This neatly sidesteps the counter issue...
Re:"Pretty close" checksums? (Score:2)
When anybody receives an email, they can check a handful of random checksums against the checksum server, if enough of them match, then do a few more to be sure and deal with the email according to any settings by the user.
Still, there are issues. What happens if the email marketeers start appending random web pages to their email to dilute it down? What percentage of similarity is enough? There are some fixes- I think to be successful junk mail has to be fairly short- people rarely page down to cut to the chase; but adjusting the checksum points to emphasise the beginning and end of the email is probably a good thing.
Re:hmm (Score:2)
A grand per message. Nice.
Issues... (Score:2)
I submitted a story about building a steam-powered microprocessor with RAM made out of banana peels, and that didn't get posted--why this?
Re:Spam Hunters (Score:2)
Well there is this classic [segfault.org] from a couple years ago on Segfault [segfault.org]:
Mafia Don Announces New Anti-Spam Venture
Posted on Fri 02 Apr 19:25:26 1999 PST
As the NSA and FBI fear, traditional crime organizations have been incorporating high-tech communication into their organizations. Although Janet Reno was quoted stating "This is law enforcement's worst nightmare.", techies around the world are sure to be pleased with one New York Syndicate's new venture.
It all started when Don Dominiqi signed onto his AOL account last Monday morning. His inbox was filled with "Make Money Fast", "Viagra On-Line", and "Teenybopper Web Sex" ads. Lost amidst the drivel was an important note detailing a non-taxed shipment of Marlboros, which were later confiscated by the BATF. Little did he know, as he shouted "Bring me the left hand of this f*cking gutterslime!" what would become of it all.
Later that same day, Billy "Run!" Brutekowski and Larry "My Eyes!" Plucker cornered the pasty-faced offender of the Family in a small cyber cafe in Grenich Village. "This was by far the creepiest place the Boss has ever sent us." stated Billy, who only spoke on condition of anonymity. "Everyone in this place looked pale and sickly, like they had already been 'spoken to'. We asked for this punk, and several people quickly pointed him out. Most of the scum we find in gin joints aren't so quick to finger one of their own," Billy continued.
"He must not watch much TV, because this sh*t didn't even flinch when we came to the corner he was hiding in," Larry proceeded to relate. "We dropped this sheet of paper the Boss had given us on his table and he says 'So you guys want to make money fast, eh?' He puts out his and says to give him $20. This scrawny little dirtball tells me to give him $20!" Larry was quite agitated at this part in his story, and his description of how Sammy Spammer's hand fell off was quite garbled.
Billy continued, "Up till now, this was a routine visit. We was just being playful. The weird sh*t began when we tried to leave." "This pimply faced kid blocks the door as we try to leave, and I'm thinking to myself 'Great, a f*cking Karate Kid hero. He just stand there, and then he hands me a $5 bill." Billy pulls out the $5, and holds it like it is his first quarter from his favorite grandmother. "They lined up after that, and we had $175 in 'tips' when we left the joint."
Later that day the Don himself visited the café, unwilling to believe the story. Although the details are unclear, sources at the café indicate that the Don has hired them to build and host a new Anti-Spam site. Through a SSL transaction system, the site will accept spam complaints and credit card donations towards 'solutions to problems'. Multiple complaints against the same spammer are added to the total until an acceptable solution has been found.
Larry tells us that a typical $250 solution is a broken hand, and for $2000 all anyone ever sees again of 'the problem' are his shoes.
The URL is to be announced next week, and the cyber café's phones have been jammed with requests for more information.
Spam Hunters (Score:4)
Spammers need to be licensed (preferably with an ear tag, but i'll consider substitutes) and fully identified. all spam needs to have a spam license number in the header someplace.
Fees can then be and need to be collected by your favorite government agencies (I think the IRS, the NSA, and BATF will do for now). ISPs and users need to be able to bill spammers some amount for the spam processed and received. Fees need to be large enough that it is worthwile to go after them, and then we can have bounty hunters. Fees can be high enough to reduce the cost of access. Penalities for abuse can be heavy (20 years in jail, for example)
Then we can have spam hunters who will go out and collect from the spammers for you in exchange for a percentage.
Re:I can't see this working (Score:2)
I attended a talk Dr.Landauer did on it a couple years ago, and one of the more interesting uses for the system is text categorization (They were using it to mark term papers...this paper is similar to an A paper...this paper is similar to a D, that sort of thing...they actually got a fairly high correlation with human markers)
Anyway, I started to wonder if it could be applied to spam hunting...Should I ever get the system to a useable state ('training' the system requires some rather large matrix manipulations...and my poor dual Celeron just couldn't handle it...27-42 hours worth of processing time on the small samples I was working with for a term paper at the time)
The fact that I've upgraded to a significantly faster machine since then, and if I were to take some time to optimize the code, I might be able to get down to the point where I could start training it on my ever-growing "Library Of Spam".
Of course, I'm probably one of the few ppl on the planet who actually COLLECTS spam...and my friends tell me I need a gf!
Anyway, at the point it's at now, it's still at just a 'hey, wouldn't it be neat if' stage...I honestly haven't a clue how well it will work...
Who knows? The analysis might make an interesting master's thesis some day...It would certainly be handy to have a research-class number cruncher to handle the matrices involved...
Re:hmm (Score:2)
Pseudonymity provides more continuity (there are some Slashdot posters whom I recognize by name), gives people less incentive to be stupid ("FIRST POST! Natalie Portman and hot grits!"), means that the poster is more likely to catch a reply, and generally says, "I was willing to at least go through the trouble of getting a throw-away hotmail account so I could register on Slashdot." Is it a cure all? No. Are there worthwhile AC posts? Yes. But for the most part, it isn't worth the effort to wade through the garbage to catch the good ones. Besides, some of the good ones'll get caught by moderators, anyway.
And, if you want accountability, don't go to usenet, or stay in moderated groups.
Great! I propose a solution that doesn't stop anyone from posting, but allows me to selectively filter what I read, yet some genius AC declares, "If you don't like the way it is, go somewhere else." ...and yet he still wonders why I feel superior to the ACs of Slashdot.
(As an aside, I'll generally read AC messages that reply directly to posts that I make. But more and more often, I wonder why I even bother.)
Re:hmm (Score:3)
M`A.K,E M:O'N"E,Y F.A`S'T
*shudder* Every now and again, I wish we would have optional accountability in Usenet, similar to how I can set my default read-level on Slashdot high enough that J. Random Anonymous Coward never shows up. Couple that with a clause in the ISPs contract that allows them to assess significant fines against spammers, and we'd be (theoretically) set.
Then I wake up and realize that people'll just steal accounts or even use litigation [paetec.net] to block the ISP from cutting them off for spamming. That's when I wish we could just train those kids who want to go on school shooting rampages to just take out spammers instead, killing two birds with one stone.
Cell phones are great (Score:2)
Fingerprinting required, not checksumming (Score:2)
The intellectual property protection people have been thinking about this sort of problem for a long while now. Just as they want to be able to detect when something has been copied, the spam-haters want to detect when something is a copy. Both want to be successful in the presence of countermeasures. It's the same problem!
There's a vast amount of literature available out there. Any half-way decent search engine should throw up more than you can read in a reasonable time.
Paul
Re:What's the big deal? (Score:2)
Re:Laws about PRON (Score:4)
Re:"Pretty close" checksums? (Score:2)
Re:I can't see this working (Score:4)
A few years ago, during the big push for a "smart army", millions of dollars were poared into having individual tanks recognize enemy tanks on the battlefield. Well, it turns out they did it with a neural network, and after quite a bit of training they got it to reliably recognize enemy tanks as such.
Then, the eventual day when the general shows up arrived, and they had to give the demo. As you can probably predict, it crashed and burned. Why? Well, the system was trained on bright, sunny days in the middle of the desert (real sun!), and the demo was on the first overcast day in a year, and the neural net had trained itself to recognize the *shadow* of a tank, not the tank itself.
Caveat neural-net-user.
False Positives (Score:3)
While the system could be broken by using counters, this could be countered by parsing only certain portion of the mail or counting the frequency of certain words. Would work very well on pure text spam, but not on attachement stuff.
What would be funny would be to see the false positives of such a system. Many mails I get from the administration all look the same, I wonder if they would be considered as spam - they are quite similar to spam: useless and to numerous...
Re:Add invalid HTML tags (Score:2)
Now discard all articles and very common words (ones that don't convey information and can't be used to form whole sentences. Don't eliminate any verbs). You're left with the bare essence of what the emai conveys, and anything that's not in this can't be in the original. Then crc this one. Heheh, try to get around that, spammer.
Er, actually, one thing I notice is that I didn't address "random" spacing. My system wouldn't realize that "random" is a word there. Solution: don't split on white space, remove all white-space and then use a dictionary that lets you see how close something is to being a word, then add letters until you're now farther from being a word than you originally word, and pop that off as a separate word. You can look ahead slightly, so that you don't pop "nation" just because it's more of a word than "nationa" is, if the letters afterward are "lity".
Sound good?
--
Re:Add invalid HTML tags (Score:2)
b) I've already figured out a way around it! As a spammer, have your spam engine combine your sentences in arbitrary order. What about sentence matching? Set it so it adds removable phrases, I repeat you will never be charged, with modifyers like "seriously", and "we're not kidding", and even "very", "extremely", etc.
Your "Spam Engine Markup for Interception-Neutralization and -Avoidance Language" (Seminal!) can have special tags telling you where you can put filler phrases. At the end, you can include a lot of random words from a news site or whatever, to throw off word-frequency analysis.
The idea is that it's a lot easier for a spammer to change things around in random order than it is for a mail server to order them back again for comparison. So, plan no-go
--
Re:Spam Hunters (Score:2)
maybe spammers could be branded with a giant S with a hot-iron like they did with cattle in the old west....
Re:Cell phones are great (Score:2)
Wow, sounds like you have a great cell phone plan. Do you get local calls free too ? ;-)
Re:What's the big deal? (Score:4)
When you get a telemarketing call, they pay their long distance company for the right to call you. It doesn't cost you a penny to pick up the phone. When you get junk (snail) mail, the marketer had to pay the postal service to send mail out to each and every address. Not only does it not cost you anything, but in the case of the U.S. Postal Service these bulk rates actually lower the cost of you sending mail, since they use it subsidize part of the cost of personal mail.
Bulk E-mail on the other hand is a different thing. First off, if you're not on a land-based U.S. phone line, odds are you're paying per-minute for your connection -- which sucks since you have to pay to get spam dumped in your E-mail program's inbox.
Even if you have a flat rate connection, you're still inevitably paying for spam mail, whether or not it's directly. Bandwidth isn't free -- take a 5k spam mail message and multiply it by 10 million messages, both of which are probably conversative estimates, and you're talking about 50 megabytes each time a spam is sent out. If you get 3 spam messages a day, that's 150 megabytes of bandwidth just for the messages that you received -- which is only a tiny fraction of all the spam sent out in a day. Multiply 50 megabytes by the countless number of messages, and that's a lot of bandwidth going up in smoke daily.
Guess who's paying for it? Hint: with spammers usually using stolen ISP accounts and fake credit card numbers, probably not them. Another hint: when ISPs' bandwidth costs go up, they pass it on to the users.
Not to mention the fact that spammers shoving millions of messages through creaky mail servers can take them down. So even excluding the monetary damage, what's it worth if a piece of E-mail sent to/from you was on that server when it went down in flames? Your message may be delayed, or it may never show up at all.
"Pretty close" checksums? (Score:3)
Does someone have a link?
-- I had a female crustacean once, but I lobster...
Counters? Already do. (Score:2)
There's also the spam that includes customized URLs in the message (image downloads that, say, have your email address embedded in the query string -- sneaky little "live address" confirmation technique).
Re:I can't see this working (Score:3)
It depends. If we think of cryptographic hash functions, you are right. They are designed that way in order to avoid collisions and forging of messages that are mapped to a given value by a particular function.
But if we think of error correcting codes, the situation is different. They are designed with the opposite goal in mind -- changing gracefully when certain errors (i.e., small changes for some definition of "small") occur, to allow for reconstruction of the original data.
Ususally both the checksum and the corrupted data (or the corrupted data + checksum string, to be precise) is needed in the case of error correcting codes. But perhaps concepts from both -- closely related -- fields could be combined to create something usable for spam detection under hostile spammer conditions?
Re:Relevant but somewhat off-topic question (Score:2)
Sadly, relays are still needed today because of spam blockers. A depressing number of sites require that email come from the "correct" IP address (your From: address must have the same MX record as your IP address) which means your ISP must maintain a relay for your use, though it doesn't have to be an "open" relay.
With most ISPs, it's easy to bipass relays and send email directly to port 25 on the target machine, so blocking open relays wouldn't help much, it would just push the problem back one step.
Re:What's the big deal? (Score:2)
This is of course, bullshit.
Email is so cheap, that for most people the costs of throwing away the junk mail they receive is greater than the cost of downloading the spam. If you figure bandwidth at $10 / gigabyte, which is very high, then a 10K email costs a hundreth of a penny.
The true cost of spam is the time wasted reading the crap. And if people weren't up in arms about it, there would be a lot more of it in your email box. It's sort of like flaming people for bad posts on usenet - it's not that the posts/spam is so bad, it's that if we don't do it, they'll just get worse and worse.
How I filter spam (Score:4)
Now, the interesting thing is what I do once I've decided to filter the mail. Since my rules catch legitimate mail, I don't just throw it away. I wrote a small collection of Perl scripts (which I'll release to the world someday soon, but they need documentation) that maintain a whitelist of sender addresses.
If a filtered message is from an address that's marked valid, it's delivered. If it's from an address that's marked invalid, it's discarded. If it's from an unknown address, the message is put in a holding area and an autoreply is sent back to the sender from a magic address asking them to reply in order to validate themselves.
The magic address is unique per filtered message -- it uses qmail's address extension mechanism -- and mail to the magic address never gets delivered to me, so I don't care if it gets added to spam lists. The Perl script behind the magic address does a quick check to make sure it's not processing a bounce, then marks the sender of the original message as valid and delivers the original message (or messages if more than one arrived while awaiting validation).
Held messages are cleaned out by a cron job when they get too old.
This is sort of similar in concept to the password mechanism of SpamBouncer or (a closer cousin) SpamCop's whitelist feature, but it doesn't require senders to retransmit their messages, which I always thought was pretty annoying to ask people to do since not everyone saves their outgoing mail. Granted, asking them to do anything is kind of annoying, but at least this is less so since they can just hit "reply" and "send".
This setup is cool because it allows friends to Bcc me on stuff without my "I must be listed as a recipient" rule trashing their messages, even if they've just switched E-mail addresses. It is admittedly based on the assumption that spammers don't read replies to their mail and/or wouldn't go to the effort of unlocking themselves; I have yet to see a spammer do that, and given the economics of spamming I think that'll be a safe assumption for the foreseeable future, unless this approach gets so popular that spammers start writing automated unlock bots!
For USENET! (Score:3)
Anyway, just my 2 cents...
Worms? (Score:3)
Is there some hidden reason why we would want millions of copies of an email worm's attachment to get through? This could actually be part of the solution to two problems.
Also, do note that a common method of spamming is to connect to an open relay and have the relay take care of sending out thousands of identical messages by simply sending thousands of "RCPT TO:" commands. Checksumming spam would completely break this spamming method and would force the spammer to retransmit the entire message for every recipient in order to vary it, thus making the process more costly.
-all dead homiez