Slashdot is powered by your submissions, so send in your scoop

Distributed Checksum Clearinghouse vs Spam 216

Posted by CmdrTaco on Monday July 30, 2001 @11:43AM from the something-to-think-about dept.

AllSpammedOut writes: "Spam could be more easily detected if everyone were to compare the mail messages they received. Using the Distributed Checksum Clearinghouse, MTAs can report the checksums for all messages they receive and be notified when a checksum has already been reported by many other systems." Obviously there are issues with something like this (especially mailing lists, and worms that do attachments). I suspect spammers would just include a counter to break checksums tho."

This discussion has been archived. No new comments can be posted.

Distributed Checksum Clearinghouse vs Spam

Load All Comments

Search 216 Comments Log In/Create an Account

Comments Filter:

at least hackers are smart (Score:2)

by Anonymous Coward writes:

so why don't we have a spammers vs. hackers war? they could fight over who's the most annoying, winner take all. spammers spam the crap outta hackers sites and mailboxes, while hackers launch DOS attacks on the spammers service provider. it might just keep both sides busy enough to buy the rest of us a litlte piece and quiet.
Re:Worms? (Score:2)

by Anonymous Coward writes:

This is true. It is claimed that over 90% of spam [wired.com] is sent through open relays, meaning that the spammer uses multiple RCPT TO commands and sends the identical message to each recipient. Most spammers don't have the bandwidth that it takes to send each user a personalized message, because they are almost always on a throwaway dialup. Only the professionals can afford to send unique messages, because they often have a DSL line and a pink contract with their ISP (which permits them to continue spamming).
Relevant but somewhat off-topic question (Score:4)

by Have Blue ( 616 ) writes: on Monday July 30, 2001 @08:01AM (#2183175) Homepage

Why do open relays exist? Is there some beneficial use for them that I'm not aware of? Is this a relay's default state and the sysadmin is too busy or dumb to lock it down? Why doesn't everyone just secure their mail servers and cut off spam before it gets out?

Share
twitter facebook
Re:What's the big deal? (Score:2)

by Dr. Evil ( 3501 ) writes:

Hello "Don't Spam JeffSketch's hotmail address", what's that address? JeffSketch@hot... hmmm something.com... JefSkatch@hotmail.com? no... that's not it. I wonder why it would be so dangerous to post an email address on a web forum.
Maybe I should forward you the contents of my Hotmail account. It is up to 540 pieces of filtered spam. Only about 50% of my spam gets successfully blocked. This renders my occasional-use Hotmail account nearly useless.
But wait, that's a free account. I guess that means that nobody is paying for it. Neither in my time nor Microsoft's money.
Alas dear troll, if indeed you were not afraid of spam you would not be hiding your email address at all.
Say, I Recognize This (Score:2)

by waldoj ( 8229 ) writes:

This certainly looks familiar.

;)

No, I did propose something along these lines on Advogato back in February in a piece entitled "Realtime Worm Filtering System [advogato.org]," but I'm not accusing the author of ripping off my blatently-obvious and not-uncommon idea. That system is intended to stop worms, obviously, and not spam. Worms tend to be easier to stop because they're seldom wholly polymorphic, often retaining enough similarities that collaborative filtering is quite feasible.

-Waldo
Re:Checksums? (Score:2)

by Pig Hogger ( 10379 ) writes:

However, a number the represented how closely related an incoming email and a known spam message would be a useful metric.

Not really. You could break each SPAM in 3 to 5 parts, and have a checksum on each part. Unless the "counter" spans two parts, only one of the checksums would be different.
And, if so, with cheap storage, why not store the whole SPAM; in case of a high number of checksum matches, a final precide double-check could be made.

--
Re:"Pretty close" checksums? (Score:2)

by Pig Hogger ( 10379 ) writes:

Me too!!!
Here is my method: http://slashdot.org/comments.pl?sid=01/07/30/14442 47&cid=108 [slashdot.org]

--
Re:What's the big deal? (Score:2)

by Eivind ( 15695 ) writes:

Actually, 5K times 10 million is 50 gigabytes, not 50 megabytes. So it's a lot worse than you state above.
Re:What's the big deal? (Score:2)

by Eivind ( 15695 ) writes:

$10 a GB is ridiculously low for most people on the Internet. It's possibly true for those with a flat-rate high-bandwith connection, but if you think that's the majority, then you're up for a surprise.
Here in Norway for example, which is probably about representative, about half the people dial into the Internet with modems, or by ISDN. Flat rate on telephone-calls is uncommon, the vast majority of that half pay about $1 an hour for the connection to the net. That works out as $50 a GB for those on ISDN, and $66 a GB for those on modem.
Even this estimate still assumes that the link is perfectly full, that is, that a person with a ISDN-connection downloads email at a rate of 64kbps, which isn't nessecarily true. (allthough it should be close for your ISP's local mailserver)
Re:Relevant but somewhat off-topic question (Score:2)

by Skapare ( 16644 ) writes:

Also, in countries like China, which are currently booming in regard to new businesses going online, there is a very common usage of pirated copies of older versions of Microsoft Exchange which did not have the capability to stop spam, or have it disabled by default. Not being licensed copies they don't get the latest patches. And they usually don't even have a sysadmin, or if they do, it's one who is incompetent or one who can't read English. Unfortunately, most of the help to close relays is primarily in English. This is bad as English is not really so universal as Yanks and Brits might like to think. Translations to all languages is needed.

Spammers cost money to those who get spammed. Pushing the cost back to spammers and the ISP who (perhaps through inept management) support them, is one way to stop them. Laws will not since this is an international thing.
Re:What's the big deal? (Score:2)

by Skapare ( 16644 ) writes:

Paper spam has never been as significant a problem as electronic spam, because the sender pays most of the costs for paper spam whereas the receiver pays most of the costs for electronic spam. There is an economic throttle for the sender of paper spam. If we allow electronic spam to simply continue, it will scale up as most businesses would then perceive it to be legitimate. You'd end up having to delete thousands and tens of thousands per day. It would keep growing if there is the perception that it is legitimate and that it cost you nothing to delete.

Electronic spam does cost the receiver time and money. This includes the receiver's ISP. If you are on a dialup line (as most people still are because of the DSL debacle) the spam takes up more time on your mail downloads. As the problem grows it takes more time.

To sum it up, it might not appear to be that much of a problem for you at this moment, but if you scale it up to where it would be if no effort was made to stop it, you would not be able to handle the load. Some of us do understand the scaling issue. If every business in the world sent you ONE message PER YEAR, and somehow this were just evenly spread out in time, you would be deleting this crap every 2 to 3 seconds, 24 hours a day, 7 days a week, all year long. The scale of the internet is simply not suited for spam.

If you really have to get back to work, what do you do? Do you send spam all day, or do you delete it? Or do you just not get much of it?
Re:Relevant but somewhat off-topic question (Score:3)

by Skapare ( 16644 ) writes: on Monday July 30, 2001 @08:54AM (#2183196) Homepage
A network of authenticated mail servers could be very useful. But the effectiveness would be limited unless entry to the network requires agreement to terms to apply strong enforcement against spam, such as:
- Limit each dynamic IP host to not more than 1 email message every 2 minutes.
- Require dedicated network owners to agree to the same anti-spam agreement in writing to be allowed access to port 25 outbound or to access unthrottled mail servers.
- Require legitimate bulk mailers to agree to certain terms such as using only opt-in lists even though the law otherwise permits them to use an opt-out list.
- Must provide a contact address and/or telephone number for reporting abuse. Abuse reports from the general public must have a human response within 24 hours. Abuse reports from a member administrator/manager/engineer must have a human response within 2 hours.
Share
twitter facebook
Re:Just use mail filters (Score:3)

by Skapare ( 16644 ) writes: on Monday July 30, 2001 @09:11AM (#2183197) Homepage

Show me one that works on my mail server without overloading it. Mail comes in at a rate of about 20 per second. It will need to check it all. If you think the problem is solved at the client, you misunderstand the problem.

Share
twitter facebook
Re:Spam Hunters (Score:2)

by sharkey ( 16670 ) writes:

New this fall on FOX:

Lorenzo Lamas stars in e-Renegade!

Reno Raines is back! After being forced at gunpoint to break RSA's strongest encryption while getting a blow-job, Reno is wanted by the Financial Businessmen Incorporated, the FBI, for violation of the DMCA! On the run from bought-and-paid-for law enforcement, Reno has changed his identity and now works for his Native American friend, Robbie Spamkiller.

Chasing down unlicensed spammers, Reno searches for the evidence that will clear his name, bring justice to those who "blew" his career and reputation, and let him marry Robbie's sister, Cheyenne "Shy" Phillipshead.

--
Phone #==goatse.cx (Score:2)

by sharkey ( 16670 ) writes:

(317) 872-2225

This is Customer Service for Comcast Cable in Indianapolis. I would guess it's as close as you can come on the phone.

--
The way I heard it (Score:2)

by CaptainSuperBoy ( 17170 ) writes:

The way I heard it, they trained it using pictures of tanks, and pictures that weren't tanks. Of course, the pictures of tanks were taken in broad daylight, while the control group pictures were taken later the same day, when it wasn't as bright.
Who knows if this actually happened.. It's really too bad that AI professors can't get their own material. I'm sure EVERY compsci student who took a software engineering class heard the anecdote about the computer-controlled radiation/x-ray machine, that killed a patient by giving them like 10,000 times the normal dose. This error was traced to a lack of bounds checking in software.

--
Re:Just Because they would counter it. (Score:2)

by Neon Spiral Injector ( 21234 ) writes:

Actually they are already countering it without even knowing about it.

A lot of spam I get already has a unique identifing ID included in it. I assume this it to track valid e-mail addresses of people stupid enough to try to be "removed" from their lists.

--
Re:Checksums? (Score:2)

by Neon Spiral Injector ( 21234 ) writes:

However, a number the represented how closely related an incoming email and a known spam message would be a useful metric. Then you could have fuzzy filters that determined how close you would want to be before outright rejecting a similar message, or maybe just relocating it to a seperate inbox.

Well with a CRC I guess a slighly changed message will only have a slightly different checksum. But there is a good chance that 2 dissimlar messages will have the same sum. You'd need something like a large md5 sum to make sure your false positives are low. But the problem with md5 is just changing 1 byte largely effects the sum. So there would be no fuzzy matchting.

--
Re:I can't see this working (Score:2)

by x mani x ( 21412 ) writes:

that's funny, my first AI prof told us the exact same anecdote. It seems to be pretty popular in AI circles, as I've seen it on several machine learning websites as well. :)
I can't see this working (Score:5)

by x mani x ( 21412 ) writes: <mghase@cs[ ]gill.ca ['.mc' in gap]> on Monday July 30, 2001 @08:03AM (#2183206) Homepage

Checksums do not change gracefully given different inputs. As in, if there's the slightest change in a spam email, let's say the date and sendto in the email header change, the entire checksum will appear completely different. Therefore the checksums will only apply to specific spam messages, and not entire classes of similar spam emails (this would be the desirable solution). And most spam mails these days are smart enough to put your name or something in the email subject and body.

A more robust method of spam detection, IMHO, would be to develop an algorithm that would take emails, and encode them in a way that they could be input to a neural network. the output of the network would be 0=not spam/1=spam ... there's definately enough examples out there for it to learn from. The hardest part, as usual, would be to find a way to encode the emails. So let's say you receive an email. Your client then encodes it, and sends the encoding to a local or remote server with the trained neural net. It returns with the results, and your client either dumps the email to your inbox or your spam folder.

If anyone with some machine learning experience wants to work on a project like this with me, send me an email!

Share
twitter facebook
Re:Relevant but somewhat off-topic question (Score:2)

by MindStalker ( 22827 ) writes:

The best thing I can bind is a program called blackhole http://freshmeat.net/projects/blackhole/
This can do a bounce back on spam saying that your user doesn't exist. This is for linux, I couldn't find any windows applications that could do this.
Checksums? (Score:4)

by Matt2000 ( 29624 ) writes: on Monday July 30, 2001 @07:51AM (#2183208) Homepage

This sounds like a terrible plan. As mentioned, a simple counter would blow this thing out immediately.

However, a number the represented how closely related an incoming email and a known spam message would be a useful metric. Then you could have fuzzy filters that determined how close you would want to be before outright rejecting a similar message, or maybe just relocating it to a seperate inbox.

Share
twitter facebook
Comment removed (Score:3)

by account_deleted ( 4530225 ) writes: on Monday July 30, 2001 @08:41AM (#2183210)

Comment removed based on user account deletion

Share
twitter facebook
Re:I can't see this working (Score:2)

by IIH ( 33751 ) writes:

Your client then encodes it, and sends the encoding to a local or remote server with the trained neural net. It returns with the results, and your client either dumps the email to your inbox or your spam folder
You'd have to also ensure that ISP's are clued up enough to turn off this feature for abuse@ mailboxes for obvious reasons!
--
Re:Checksums? (Score:2)

by mpe ( 36238 ) writes:

This sounds like a terrible plan. As mentioned, a simple counter would blow this thing out immediately.

You'd effectivly be forcing the spammer to send every email. i.e. they could no longer rely on simply feeding a relay machine a string of RCPT TO commands.
Thus spamming becomes far more difficult.
Re:Checksums? (Score:2)

by mpe ( 36238 ) writes:

I believe Ron Rivest had an idea about how to handle spam: make anyone who sends email to you perform a small computational task in order for the message to get through. The task would be something like factoring an N-bit number, with N tweaked to adjust the difficulty.

An alternative would be to send everything with public key encryption. Though you'd need to devise a DNS like mechanism for distributing public keys. (You also want to cut out as much relaying as possible, since a third party relay will never have access to the private key.)
Re:Relevant but somewhat off-topic question (Score:2)

by mpe ( 36238 ) writes:

Why do open relays exist? Is there some beneficial use for them that I'm not aware of?

A certain set of software requires a third party relay to work at all. It's quite possible for those setting up such relays to create an open relaying situation (especially with complex networks.)
Re:Relevant but somewhat off-topic question (Score:2)

by mpe ( 36238 ) writes:

Open relays mainly exist because of legacy. Once upon a time we needed them, because most systems weren't connected 24/7

How often was SMTP over UUCP (and the like) used anyway

That changed once TCP/IP became the norm, but relays were still necessary for the transition phase.

MX records came into existance in the late 1980s...

Even today, there are still people who's mailboxes aren't connected 24/7 that require a relay service, though they are definitely a minority.

What they actually need is one or more (off site) secondary MX records.
Which is totally transparent to any MTA which follows the spec.

A depressing number of sites require that email come from the "correct" IP address (your From: address must have the same MX record as your IP address) which means your ISP must maintain a relay for your use, though it doesn't have to be an "open".BR>
This is mixing up two things. The first is something like the DUL which requires use of an ISP provided third party relay. The second is ISP provided relays having restrictions on what they will relay based on the MAIL FROM: command.
The actual major reason ISPs provide third party relays is that software such as Netscape Communicator and Outlook Express simply won't work without one.

With most ISPs, it's easy to bipass relays and send email directly to port 25 on the target machine, so blocking open relays wouldn't help much, it would just push the problem back one step.

Actually it helps a lot. A problem with all relays is that they can be used in the mode of send one message and a list of recipients and the relay machine will do the work of sending out N copies. Remove all relays and the spammer has to actually send ever message themselves.
Re:Relevant but somewhat off-topic question (Score:2)

by mpe ( 36238 ) writes:

Limit each dynamic IP host to not more than 1 email message every 2 minutes.

Requires a rather algorithm to work this out. Also it would cause problems with machines on a dialup running proper MTAs attempting to process their mail queue on connection.
A simpler method would be to start dropping packets at random if all (or more than a certain portion) of the traffic from an IP address consists of outgoing TCP connections to port 25.
The only thing which needs examining is IP and TCP headers.
Re:Worms? (Score:2)

by mpe ( 36238 ) writes:

It is claimed that over 90% of spam is sent through open relays, meaning that the spammer uses multiple RCPT TO commands and sends the identical message to each recipient.

This also makes spamming "hit and run". By the time the spam starts arriving the spammer has gone.

Most spammers don't have the bandwidth that it takes to send each user a personalized message, because they are almost always on a throwaway dialup.

They also need processing power to do the personalisation, software which understands the full SMTP spec (rather than that required to get by sending to a relay) and can handle identd requests.

Only the professionals can afford to send unique messages, because they often have a DSL line and a pink contract with their ISP (which permits them to continue spamming).

They also need a frequently changing IP address...
Re:Countermeasures (Score:2)

by mpe ( 36238 ) writes:

re: Countermeasures: the spammer would integrate something random into the message that would foul identification. There is simply no way around this. So the question becomes: at what point does the countermeasure become so expensive and difficult that the spam itself reaches the point of diminishing returns?

Forcing spammers to customise each email would make spamming considerably more expensive. Because they then have to actually send each email, rather than being able to use third party relay machines to duplicate their junk.
Re:bulk-mail should be refused by default (Score:2)

by mpe ( 36238 ) writes:

What you are describing is basically a "Teergrube" (german for tar pit).

Problem is that ISP provided third party relays render this method useless...
Re:Relevant but somewhat off-topic question (Score:2)

by gorilla ( 36491 ) writes:

Yes, but this is still symptomatic of the original problem - originally SMTP servers normally acted as relays, and it's only the more recent versions which don't by default.
Re:Relevant but somewhat off-topic question (Score:3)

by gorilla ( 36491 ) writes: on Monday July 30, 2001 @08:06AM (#2183221)

They exist because up until the early 90's, almost all SMTP servers were open relays. It wasn't until spam started that the MTA authors started putting in anti-relay code, and people started installing the new versions.
Unfortunatly, there are always systems where the sysadmin hasn't updated for years, because it's not causing him any problems.

Share
twitter facebook
Re:hmm (Score:2)

by csbruce ( 39509 ) writes:

What you really need is some generic mail-message pattern-matching and a complaint & moderation system. You don't really need automatic detection of spam, since there would probably be plenty of people willing to complain if there was an effective place to complain to, and if mail clients as well as mail servers could consult the spam-detection service to eliminate confirmed spam before it reaches your eyeballs.
Re:Issues... (Score:4)

by Flounder ( 42112 ) writes: on Monday July 30, 2001 @07:52AM (#2183223)

I submitted a story about building a steam-powered microprocessor with RAM made out of banana peels, and that didn't get posted--why this?
Because everybody knows that Orange rinds offer better memory density than banana peels. And orange peels are more resistant to the excess steam from the CPU. Banana peels would just disintegrate with even a minimal amount of overclocking.

Share
twitter facebook
Re:False Positives (Score:2)

by Blrfl ( 46596 ) writes:

Matthias Wiesmann writes:
While the system could be broken by using counters, this could be countered by parsing only certain portion of the mail or counting the frequency of certain words. Would work very well on pure text spam, but not on attachement stuff.
Actually, that technique works reasonably well.
I used to administer the trouble ticket system for a very large ISP that got so many complaints that they became unmanagable. (Not all their fault, but that's another story.) Anyway, we had software that would take the bodies of the emails being complained about, remove whitespace and anything that wasn't in the dictionary, sort it, uniq it and generate an MD5 of the list of words that came out. I never studied it over the long haul, but tests on live data showed a match rate of about 90%.
The real flaw in DCC is that it doesn't protect early recipients of the spam, because it won't have built up enough hits to be considered bulky. The only way to make it work would be to submit the checksum and hold the letter for some amount of time to see how bulky it gets. Most people would probably not like the lag time they'd get on legitimate mail.
Re:Cell phones are great (Score:2)

by QuoteMstr ( 55051 ) writes:

What is the phone equivalent of goatse.cx?
Forget comparing spam! How about universal naming? (Score:2)

by Myself ( 57572 ) writes:

I, for one, am always pissed off when I spend hours on my dialup leeching pr0n from some newsgroup, only to discover that I already had it on my drive under a different name. Somewhere along the line, somebody renamed the series.

A database of image characteristics (like those used by D'peg! [somewareonthe.net] would make this less likely. People would be discouraged from changing the file's originally agreed-upon universal name.

Publishers could upload their image characteristics into the database, along with a tag like "Originally from somepornsite.com". So if I someday come across an image I really like, I could check the database and see where to get the rest of the series. This would supercede obnoxious watermarking to indicate the source of an image.

This could of course be used for mp3's too, which are all-too-often renamed incorrectly. Checksums would be enough for a particular song encoded by a particular encoder with particular parameters, but audio fingerprinting would be necessary to accomodate different encoders. I don't think that's a deal-killer.

By the way, D'peg! is really neat, but it's amazingly slow the first time if you have a lot of images. (As in: My win98 uptime record is 11 days. Dpeg's projected completion time was 34. Good thing it can resume after a crash.)
Re:Just use mail filters (Score:2)

by yellowstone ( 62484 ) writes:

Thus spake Skapare

Show me one that works on my mail server without overloading it.

Well, simple mail filters aren't going to overload your mail server any more than computing a checksum on each peice of email, and then querying some database to see if it matches the checksum for known spam.
Plus, mail filters have the benefit of not breaking in the face of a trivial change to the body (like a counter).

-- I have no fin no wing no stinger no claw no camouflage I have no more to say...
Just use mail filters (Score:3)

by yellowstone ( 62484 ) writes: on Monday July 30, 2001 @08:11AM (#2183233) Homepage Journal
I've found that a handful of simple mail filters takes care of much of the spam I receive:
- Junk anything that comes BCC (preceded by a white-list of subscribed mailing lists). This takes care of 70-80% of the spam that comes my way.
- Filter out by keywords in the subject (like "marketing", "webmaster", and "viagra"). This takes care of a good chunk of the rest.
-- I have no fin no wing no stinger no claw no camouflage I have no more to say...
Share
twitter facebook
Filter messages before checksumming (Score:2)

by hamjudo ( 64140 ) writes:
Checksumming the raw message isn't much value. It's an arms race. We'll have to have a way of dynamicly updating filters.
In addition to the raw message checksum, possible filters include:
- checksum paragraphs individually
- ignore whitespace, punctuation and capitalization.
- drop HTML tags
- drop numbers
- drop all non-dictionary words.
Then analyze what gets by and add new filters as appropriate.
Re:The problem is... (Score:2)

by jrennie ( 79374 ) writes:

Naive Bayes is a damn good text classifier that has already proven to be a good spam identifier. The problem is that no such automated classifier system will ever be able to get rid of most spam without throwing away a few non-spam messages too. It's a fact of life.

Btw, check out

http://www.picante.com/~gtaylor/spam/

to read about someone's efforts to get rid of spam via a slew of techniques, including an automated classification system (Naive Bayes).

Jason
Re:Cell phones are great (Score:2)

by spectro ( 80839 ) writes:

This would work great if "caller pays" like cellphones work in South America. Here, however, they would suck all your minutes.

---
man diff (Score:2)

by mrogers ( 85392 ) writes:

For plain text, you could just measure the length of the diff between the two messages. A simple counter would only change one line of the message.
diff message1 message2 | wc -l

--
Re:Why go through that much trouble to detect SPAM (Score:2)

by Ryu2 ( 89645 ) writes:

Check out the services of spamcop.net It lets you submit spam mail, extracts the IPs from the header, discarding the bogus ones, allows you to automatically send a note to the abuse department of the offending ISP, and tells you exactly how many people have submitted the same message, and now many times that ISP has been responsible for messages that generated spamcop complaint. Very cool.
Re:Hashed bigrams count (Score:2)

by jmv ( 93421 ) writes:

certain histogram patterns would be common in non-spam email messages

There is no such thing as a "common histogram". They will all be different. However, two identical messages will have identical histogram. Two almost identical messages will have almost identical histogram (while two almost identical messages usually have very different checksums).

The reverse is usually true (of course, there's not absolute garanty): two almost identical histograms are very likely to come from two almost identical messages. The more you increase N (the bound for the hash result and size of the histogram), the more accurate the result. Also, using trigrams would likely be more accurate.

While it is possible for spammers to vary their messages, they cannot send thousands of messages that are really different one from the other and this is why this technique should work almost all the time. Of course, you'd need to get rid of headers and any html tags and garbage before computing the histograms.
Re:Hashed bigrams count (Score:2)

by jmv ( 93421 ) writes:

They could look at the histogram of a bunch of regular emails and just send the spam messages whose histograms are close to a lot of the histograms of the regular emails. This assumes that spammers would have access to the hash function though.

Once again, your assuming there is such a thing as a "normal histogram". Remember, that we're not checking whether the "histogram" is normal or not. We're checking to see if this particular histogram (from a spam e-mail) as been seen more than x times before. Even if the manage to get a piece of spam match to the exact same histogram as a valid e-mail, the piece of spam will still be rejected with the unfortunate side effect that the valid message might be rejected (but since they cannot read your mail, they cannot get one of your e-mails rejected).

As for the CPU time, sure you don't want to make N too large...
Re:Hashed bigrams count (Score:2)

by jmv ( 93421 ) writes:

what similarity function would you use?

Manhattan distance, aka L1 norm of the difference.

And the reason I said it should work is that I have already tried that a while ago for a slightly different task. The only thing I'm not too sure it CPU time.

As for histogram randomness, evan if the N-dimension (N ~ 1000) vectors (histograms) don't have a uniform distribution in the 1000-D space. You'd have to be very unlucky to get the same (or approx.) value for all of the 1000 bins.
Hashed bigrams count (Score:5)

by jmv ( 93421 ) writes: on Monday July 30, 2001 @08:34AM (#2183254) Homepage

One way that would be much more effective is to take pair of words (eg. in this sentence: "One way", "way that", "that would", ...) and apply a hash function that returns a number between 0 and N (N usually between 1000 and 100000). You then compare the histogram (how many of each hash value) of a mail to the database. If histograms are too close to a spam message, you delete it.

Share
twitter facebook
Re:Relevant but somewhat off-topic question (Score:2)

by TheCarp ( 96830 ) writes:

Furthermore, it was considered good "netiquette" to have your relays be open to the world. It simplified things. MTA gets a message, it sees that its not a local delivery, so it is nice and tries to forward it to the right place.

Who ever thought people would ABUSE this sort of stuff?

Hell, at one point it was an accepted practice of being a good net citizen to have guest accounts on your machines too.

These are, of course, all legacy attitudes. Sorry to see them go, of course. Would be great to live in that world, wouldn't it?

-Steve
Re:Checksums? (Score:2)

by 4of12 ( 97621 ) writes:

Pretty impressive procedure!

I would have thought that going to the next level of spam filtering would require shoving messages of dubious origin into some delayed-delivery hopper that would be scrutinized carefully against the results of incoming messages from throw-away spam-gathering accounts on other machines.

Your system of historical analysis makes it possible to defer the date when we will be forced to resort to multi-account inbox comparisons to filter out spam.
Re:My Life as a Spammer (Score:2)

by crucini ( 98210 ) writes:

I think you want Behind Enemy Lines [freewebsites.com].
The checksum is fuzzy (Score:5)

by crucini ( 98210 ) writes: on Monday July 30, 2001 @01:16PM (#2183259)

Many posters seem to be naively assuming that dcc uses a checksum such as md5 which would change radically for a minor change in input. Dcc does in fact use md5 as a component but the actual checksum is adapted to the requirement.
Download the source tarball [rhyolite.com], uncompress, untar and read /dcclib/ckfuz1.c. This checksum is clearly designed to be resilient to minor changes.
On a deeper note, it's sad that so many Slashdot readers, including apparently CmdrTaco, underestimate others so severely. Do you really thing someone put in the effort to make something like dcc and never thought about how a message could be varied to evade the checksum? And why not read the linked document first? You would have found:

Because simplistic checksums of spam would not be very effective, the main DCC checksum is fuzzy and ignores various aspects of messages. The fuzzy checksum will need to be changed as spam evolves.
Summary: read before you criticize, and recognize that others probably thought the same thing you're thinking.

Share
twitter facebook
hmm (Score:3)

by Troed ( 102527 ) writes: on Monday July 30, 2001 @07:49AM (#2183260) Homepage Journal

This system already exists on news-servers and clients, and the spammers have already countered with random data appended to the spam (and random numbers in the subject headers)
So ...

Share
twitter facebook
What's the big deal? (Score:2)

by Mr. Sketch ( 111112 ) writes:

I haven't figured out why the online community is so uptight about getting unsolicited e-mails and having companies selling out their e-mail addresses to people. About 80% of the mail I get at my house is unsolicited and 95% of the phone calls I get are salesmen. How did they get my number/address? Most likely the phone company (or credit card company) sold it to them and this is a very common practice. I guess I just don't see what the big deal is when e-mail is so much easier to delete/avoid than unsolicited real mail and phone calls.

After all, e-mail is checked when I want to check it and when I see any subject asking me what the state of my sexual arousal is or offering me a university diploma or just something from 348djkea23@yahoo.com I know I can easily delete it. It's not like a phone call where I don't know who's calling me and I kind of have to answer it right then. I do have caller id, but that's an additional service I have to pay for and most of my friends are out of state so they show up as 'unavailable' along with all the other salesmen.

For unsolicited mail, I have to handle it no matter what, I can't just leave it in my mail box forever. But with e-mail I never really have to see it and I can delete it without having to ever give it a second thought and it's gone gone and not just taking up space in my trash can or recycle bin.

Perhaps someone here can enlighten me.

p.s. I'm sure I have more to say on this topic, but I really need to be getting back to work :).

--BEGIN SIG BLOCK--
I'd rather be trolling for goatse.cx [slashdot.org].
Re:Cell phones are great (Score:5)

by zulux ( 112259 ) writes: on Monday July 30, 2001 @08:07AM (#2183267) Homepage Journal

Just leave a message, and tell them your phone number is one of those Bahama-$20-a-second numbers. Wheee!

Check out http://www.scambusters.org/809Scam.html if you don't know what I'm talking about.

Share
twitter facebook
randomised strings (Score:2)

by 13013dobbs ( 113910 ) writes:

Most spammers use some sort of random character string in both the subject and body to get around filters that look for identicle messages being sent to the same system. I don't think checksums are going to do any better then the current filters that look for dupes. Sure, you could just look at the first, N lines, but spammers are also inserting invalid HTML tags in their messages to foil pattern matching. Since the tags are invalid, people dont see them. (considering that most people use some sort of HTML enabled mail reader)
Add invalid HTML tags (Score:2)

by 13013dobbs ( 113910 ) writes:

All a spammer would have to do is add invalid HTML tags all over his/her spam. Most users use some sort of HTML based mail reader and the invalid tage would not show. Look at the HTML source of this post to see for yourself. They can even put the tags in the middle of words, to be an even bigger bastard/bitch.
Re:"Pretty close" checksums? (Score:2)

by 13013dobbs ( 113910 ) writes:

I have already posted a way to get around that. Look here [slashdot.org]. For the goatsecx paranoid here is the link to cut and paste:
http://slashdot.org/comments.pl?sid=01/07/30/14442 47&cid=48
Re:Add invalid HTML tags (Score:2)

by 13013dobbs ( 113910 ) writes:

Please read what I said again. Checking the entire massage would be useless due to the fact that there may be hundreds of random invalig HTML tags in the message. These tags would still show up in the message, but would be ignored by the mail reader. The tags would still be visible to the MTA.
Re:Add invalid HTML tags (Score:2)

by 13013dobbs ( 113910 ) writes:

Sounds good, but what kind of processing power are you going to need to do all that? If you had a hundred or so users, it may not be that bad, but for large ISPs, it might be horrible.
Re:Checksums? (Score:3)

by friscolr ( 124774 ) writes: on Monday July 30, 2001 @09:15AM (#2183278) Homepage

However, a number the represented how closely related an incoming email and a known spam message would be a useful metric.Then you could have fuzzy filters
i tried that, had very good success. read more about it at:
http://www.blackant.net/code/oth/random/nlp-spamfi lter.php [blackant.net]
i collected a sample of 30-plus spam messages as well as 30-plus not spam messages and ran some word and phrase frequency counts on each group, then threw that data into a couple mysql tables. Next i match the phrase and word frequency counts to new mail that arrives, and depending on how closely the new mail matches the known groups, i can tell whether or not the mail is spam.
by tweaking the exact amount needed to be determined as spam or not-spam, i had very, very good success rate - out of 32 messages checked using this method, all were appropriately identified as either spam or not-spam.
I've been meaning to continue with this line of spam detection, increasing the size of the db and testing it on a larger sample of mail (read: all my mail) and then seeing if the results were still as good, but...
-f

Share
twitter facebook
Re:Personalised spam (Score:2)

by Grab ( 126025 ) writes:

You could fix that by checksumming individual paragraphs. If more than 95% of an email's paragraphs match the checksums of a known spam, it can safely be rejected. This will require more storage, but the processing time won't be significantly longer (the longest time is calculating the checksums, which will take the same time for individual paragraphs as for the whole message, since it's a per-character time).

You could even improve this when you've received several of the same by cross-comparing them and working out which paragraphs change and which stay the same. You could then combine the individual paragraph checksums into a single checksum, and only check that part of the message - that'll save on storage of lots of checksums.

The only trouble I can see is when this is one of those three-line ones that just says "Feeling horny? Go to here for XXX" or whatever. If those added some destination-specific heading, it would be difficult to set the filter tolerances tight enough so that genuine emails with one or two sentences that match don't get filtered.

Grab.
Countering Counters (Score:2)

by R.Caley ( 126968 ) writes:

To avoid the problem of trivial changes to the message one would need to check the bits of the message they don't have control of. The middle bit of the Received: list would seem like a candidate.
Eg if we assume that much of the spam problem is from open relays, then recognising that >N% of local users have gotten a message mailed through a given relay may be enough to flag it suspicious.
Doesn't help the mailing list problem of course.
I think the best anti-spam measure is simply to divide email into high quality and low quality lists based on the sender and have the user say which senders should be treated as high quality in future. If people you sent mail to were added to the high quality list by default that would take much of the work out of it. Since this way you are trying to pick out good stuff rather than remove spam, it is harder to counter.
Add to that a magic word system. Messages with the magic word in the subject are tagged as high quality. Then you can give people you really want to hear from the magic word along with your email address. Change the word regularly and old information won't come back to spam you.
_O_
spambouncer works great for me (Score:3)

by misleb ( 129952 ) writes: on Monday July 30, 2001 @10:22AM (#2183283)

I am running the Spambouncer [spambouncer.org] procmail filter on my shell/IMAP account. I used to get 10 SPAMS a day. Now I don't get ANY. Its pretty intelligent.
I guess this doesn't solve the problem of server resources getting stolen, but it certain saves me from having to look at the crap.
-matthew

Share
twitter facebook
the ultimate spam filter (Score:2)

by aozilla ( 133143 ) writes:

Someone needs to collect all these ideas together and make a nice pluggable framework for it. I'm not sure how it does it, but hotmail's spam filter has stopped 100% of my spam so far, with no false positives. If they can do it, so can we.
Just Because they would counter it. (Score:5)

by BiggestPOS ( 139071 ) writes: on Monday July 30, 2001 @07:51AM (#2183287) Homepage

Doesn't mean we shouldn't do it. Its an arms race, with each side consistently and constantly upping the ante. We really need to send the spammers a message that we DO still care.
One thing bothers me though, as I was clearing out a large 'stuck' email for one of our dial-up customers the other day, I happened to casually mention "Wow, you sure do get alot of spam!" to which they replied "Whats that?" "You know, junk email" "Junk e-mail? I read it all" People like that are why our boxes receive such garbage. You fire enough bullets and SOMEone is going to die.

Share
twitter facebook
Re:hmm (Score:2)

by peccary ( 161168 ) writes:

That one is easy. (\w\W){5,99}
Or something like that, depending on what you use for filtering news and email. For me, it's got to be GNUS Score files and Procmail.
Duh... (Score:5)

by ErikTheRed ( 162431 ) writes: on Monday July 30, 2001 @09:57AM (#2183293) Homepage

All you have to do is filter on the words "This e-mail is not spam!"

Leave it to the Slashdot crowd to make things a million times more comples than they need to be...

Share
twitter facebook
Surprisingly, that can work! (Score:2)

by WolfWithoutAClause ( 162946 ) writes:

The big issue is counters and other subtle changes to the emails that would destroy a naive checksum.

However multiple checksums of subsets of the email would not usually all be changed by one or a few changes/counters and checksums will be sufficiently discriminating to screen emails and can do a very good jobs of detecting any widespread junk emails.

It would be difficult that all checksums of all characters of a particular length (say 20 characters) be made sufficiently different that ALL of the subsets of the junk emails can be different.

(Checksums that checksum all the strings for a particular length are not difficult to generate as a matter of fact; little more than a circular buffer is required.)
Re:"Pretty close" checksums? (Score:2)

by WolfWithoutAClause ( 162946 ) writes:

>Aren't there algorithms that will report messages that are pretty close?

Yeah, there are. 'Rdist' does this as a way of trying to only send the minimum set of changes necessary to keep two ftp/web sites synchronised.

Actually to be precise, the checksum isn't imprecise, as rdist relies on checksums of subsets of the documents they are trying to synchronise.
This neatly sidesteps the counter issue...
Re:"Pretty close" checksums? (Score:2)

by WolfWithoutAClause ( 162946 ) writes:

Impossible? I don't think so. All you have to do is each time somebody receives a junk email they mark it as junk email, the mail software can calculate one checksum starting at a random place in the file, and upload it to a checksum server. For any frequently received junk email the server will fairly quickly get enough checksums that the whole document will be covered.

When anybody receives an email, they can check a handful of random checksums against the checksum server, if enough of them match, then do a few more to be sure and deal with the email according to any settings by the user.

Still, there are issues. What happens if the email marketeers start appending random web pages to their email to dilute it down? What percentage of similarity is enough? There are some fixes- I think to be successful junk mail has to be fairly short- people rarely page down to cut to the chase; but adjusting the checksum points to emphasise the beginning and end of the email is probably a good thing.
Re:hmm (Score:2)

by andyh1978 ( 173377 ) writes:

Couple that with a clause in the ISPs contract that allows them to assess significant fines against spammers

The ISP I use for my website has such a clause:

19. You will not use the Service to send unsolicited commercial messages, Unsolicited Junk Messages, SPAM or any other bulk message to a recipient who has not expressly requested to receive that message. This shall apply to messages sent via electronic mail, USENET news postings or any other medium which may be intrusive. If you breach this Condition of Use you agree that you will pay us compensation of no less than one thousand pounds sterling plus interest at 8% above the base lending rate of the Bank of England at the date you breach this condition from that date. You agree that you will pay this compensation in respect of each recipient address of each message sent in breach of this Condition. You agree that you will not run an "open mail relay" on any computer system connected to the serice. You will not seek to use the facilities offered as part of you account to run an email service using our equipment.

A grand per message. Nice.
Issues... (Score:2)

by bribecka ( 176328 ) writes:

If there are so many issues with this, and it seems like an idea that probably won't work, why is this posted?
I submitted a story about building a steam-powered microprocessor with RAM made out of banana peels, and that didn't get posted--why this?
Re:Spam Hunters (Score:2)

by Alien54 ( 180860 ) writes:

They're just a bigger version of the mafia, and the Don requires his tithe for you to do business on his turf.
Well there is this classic [segfault.org] from a couple years ago on Segfault [segfault.org]:
Mafia Don Announces New Anti-Spam Venture Posted on Fri 02 Apr 19:25:26 1999 PST
As the NSA and FBI fear, traditional crime organizations have been incorporating high-tech communication into their organizations. Although Janet Reno was quoted stating "This is law enforcement's worst nightmare.", techies around the world are sure to be pleased with one New York Syndicate's new venture.

It all started when Don Dominiqi signed onto his AOL account last Monday morning. His inbox was filled with "Make Money Fast", "Viagra On-Line", and "Teenybopper Web Sex" ads. Lost amidst the drivel was an important note detailing a non-taxed shipment of Marlboros, which were later confiscated by the BATF. Little did he know, as he shouted "Bring me the left hand of this f*cking gutterslime!" what would become of it all.

Later that same day, Billy "Run!" Brutekowski and Larry "My Eyes!" Plucker cornered the pasty-faced offender of the Family in a small cyber cafe in Grenich Village. "This was by far the creepiest place the Boss has ever sent us." stated Billy, who only spoke on condition of anonymity. "Everyone in this place looked pale and sickly, like they had already been 'spoken to'. We asked for this punk, and several people quickly pointed him out. Most of the scum we find in gin joints aren't so quick to finger one of their own," Billy continued.

"He must not watch much TV, because this sh*t didn't even flinch when we came to the corner he was hiding in," Larry proceeded to relate. "We dropped this sheet of paper the Boss had given us on his table and he says 'So you guys want to make money fast, eh?' He puts out his and says to give him $20. This scrawny little dirtball tells me to give him $20!" Larry was quite agitated at this part in his story, and his description of how Sammy Spammer's hand fell off was quite garbled.

Billy continued, "Up till now, this was a routine visit. We was just being playful. The weird sh*t began when we tried to leave." "This pimply faced kid blocks the door as we try to leave, and I'm thinking to myself 'Great, a f*cking Karate Kid hero. He just stand there, and then he hands me a $5 bill." Billy pulls out the $5, and holds it like it is his first quarter from his favorite grandmother. "They lined up after that, and we had $175 in 'tips' when we left the joint."

Later that day the Don himself visited the café, unwilling to believe the story. Although the details are unclear, sources at the café indicate that the Don has hired them to build and host a new Anti-Spam site. Through a SSL transaction system, the site will accept spam complaints and credit card donations towards 'solutions to problems'. Multiple complaints against the same spammer are added to the total until an acceptable solution has been found.

Larry tells us that a typical $250 solution is a broken hand, and for $2000 all anyone ever sees again of 'the problem' are his shoes.

The URL is to be announced next week, and the cyber café's phones have been jammed with requests for more information.
Spam Hunters (Score:4)

by Alien54 ( 180860 ) writes: on Monday July 30, 2001 @07:56AM (#2183306) Journal

I still think that we have to make it profitable for folks to go after spammers.
Spammers need to be licensed (preferably with an ear tag, but i'll consider substitutes) and fully identified. all spam needs to have a spam license number in the header someplace.
Fees can then be and need to be collected by your favorite government agencies (I think the IRS, the NSA, and BATF will do for now). ISPs and users need to be able to bill spammers some amount for the spam processed and received. Fees need to be large enough that it is worthwile to go after them, and then we can have bounty hunters. Fees can be high enough to reduce the cost of access. Penalities for abuse can be heavy (20 years in jail, for example)
Then we can have spam hunters who will go out and collect from the spammers for you in exchange for a percentage.

Share
twitter facebook
Re:I can't see this working (Score:2)

by hal200 ( 181875 ) writes:

Actually, I've been slowly plunking away at a spam recognition tool based on Thomas Landauer's work on Latent Semantic Analysis. (Try http://lsa.colorado.edu)

I attended a talk Dr.Landauer did on it a couple years ago, and one of the more interesting uses for the system is text categorization (They were using it to mark term papers...this paper is similar to an A paper...this paper is similar to a D, that sort of thing...they actually got a fairly high correlation with human markers)

Anyway, I started to wonder if it could be applied to spam hunting...Should I ever get the system to a useable state ('training' the system requires some rather large matrix manipulations...and my poor dual Celeron just couldn't handle it...27-42 hours worth of processing time on the small samples I was working with for a term paper at the time)

The fact that I've upgraded to a significantly faster machine since then, and if I were to take some time to optimize the code, I might be able to get down to the point where I could start training it on my ever-growing "Library Of Spam".

Of course, I'm probably one of the few ppl on the planet who actually COLLECTS spam...and my friends tell me I need a gf! ;)

Anyway, at the point it's at now, it's still at just a 'hey, wouldn't it be neat if' stage...I honestly haven't a clue how well it will work...

Who knows? The analysis might make an interesting master's thesis some day...It would certainly be handy to have a research-class number cruncher to handle the matrices involved...
Re:hmm (Score:2)

by Erasmus Darwin ( 183180 ) writes:

Why do you feel so superior, exactly ?
Pseudonymity provides more continuity (there are some Slashdot posters whom I recognize by name), gives people less incentive to be stupid ("FIRST POST! Natalie Portman and hot grits!"), means that the poster is more likely to catch a reply, and generally says, "I was willing to at least go through the trouble of getting a throw-away hotmail account so I could register on Slashdot." Is it a cure all? No. Are there worthwhile AC posts? Yes. But for the most part, it isn't worth the effort to wade through the garbage to catch the good ones. Besides, some of the good ones'll get caught by moderators, anyway.
And, if you want accountability, don't go to usenet, or stay in moderated groups.
Great! I propose a solution that doesn't stop anyone from posting, but allows me to selectively filter what I read, yet some genius AC declares, "If you don't like the way it is, go somewhere else." ...and yet he still wonders why I feel superior to the ACs of Slashdot.
(As an aside, I'll generally read AC messages that reply directly to posts that I make. But more and more often, I wonder why I even bother.)
Re:hmm (Score:3)

by Erasmus Darwin ( 183180 ) writes: on Monday July 30, 2001 @07:59AM (#2183309)

the spammers have already countered with random data appended to the spam (and random numbers in the subject headers)
...and the worst of the bunch -- randomly inserting punctuation in the entire message:
M`A.K,E M:O'N"E,Y F.A`S'T
*shudder* Every now and again, I wish we would have optional accountability in Usenet, similar to how I can set my default read-level on Slashdot high enough that J. Random Anonymous Coward never shows up. Couple that with a clause in the ISPs contract that allows them to assess significant fines against spammers, and we'd be (theoretically) set.
Then I wake up and realize that people'll just steal accounts or even use litigation [paetec.net] to block the ISP from cutting them off for spamming. That's when I wish we could just train those kids who want to go on school shooting rampages to just take out spammers instead, killing two birds with one stone.

Share
twitter facebook
Cell phones are great (Score:2)

by AintTooProudToBeg ( 187954 ) writes:

My cell phone offers free long distance. So I call the number on every piece of spam that I get. Mostly you get an answering machine, so I request a call back. This costs the spammers time plus hopefully a little money for the call back. Mostly they're semi-pathetic business-type people who really don't know anything about computers and are somewhat apologetic/embarrassed. I did get one asshole who hung up on me when I started asking where he got my email address from... so I called back (CallerId is great!). Anyways, call those spammers!
Fingerprinting required, not checksumming (Score:2)

by Xilman ( 191715 ) writes:

I've now read a whole bunch of comments saying that checksumming is useless because adding junk/serial numbers/whatnot will defeat the spam detectors. True, but irrelevant.
The intellectual property protection people have been thinking about this sort of problem for a long while now. Just as they want to be able to detect when something has been copied, the spam-haters want to detect when something is a copy. Both want to be successful in the presence of countermeasures. It's the same problem!
There's a vast amount of literature available out there. Any half-way decent search engine should throw up more than you can read in a reasonable time.
Paul
Re:What's the big deal? (Score:2)

by atheos ( 192468 ) writes:

About 80% of the mail I get at my house is unsolicited and 95% of the phone calls I get are salesmen. 95% of your phone calls are salesman???? you must be one big sucker!
Re:Laws about PRON (Score:4)

by atheos ( 192468 ) writes: on Monday July 30, 2001 @07:59AM (#2183314) Homepage

Ya, this same argument is used when discussing censoring the entire internet. Ever though about running for office? Spammers aren't the only ones I blame. I run a small mail server (less than 1k messages a day), and every night I e-mail ISP's informing them of open relays, and dialup customers abusing their systems. I have received a few auto-replies, and not ONE god damn response from someone who cares. I'd like to assume that most people are way too busy fixing the problem, but the same culprits keep showing up in my mail log. When discussing legal action against spammers, I think the same legal repercussions should be directed to ISP's who don't know/care how to run a mail server.

Share
twitter facebook
Re:"Pretty close" checksums? (Score:2)

by exploder ( 196936 ) writes:

One problem I can see immediately with these "blockwise" checksums is that the spammers could easily insert not only text with random content but also random length. Do any of these "pretty close" methods handle offsets appropriately as well?
Re:I can't see this working (Score:4)

by 11223 ( 201561 ) writes: on Monday July 30, 2001 @08:24AM (#2183318)

A neural net anecdote from a teacher of mine:
A few years ago, during the big push for a "smart army", millions of dollars were poared into having individual tanks recognize enemy tanks on the battlefield. Well, it turns out they did it with a neural network, and after quite a bit of training they got it to reliably recognize enemy tanks as such.
Then, the eventual day when the general shows up arrived, and they had to give the demo. As you can probably predict, it crashed and burned. Why? Well, the system was trained on bright, sunny days in the middle of the desert (real sun!), and the demo was on the first overcast day in a year, and the neural net had trained itself to recognize the *shadow* of a tank, not the tank itself.
Caveat neural-net-user.

Share
twitter facebook
False Positives (Score:3)

by Matthias Wiesmann ( 221411 ) writes: on Monday July 30, 2001 @08:01AM (#2183325) Homepage Journal

While the system could be broken by using counters, this could be countered by parsing only certain portion of the mail or counting the frequency of certain words. Would work very well on pure text spam, but not on attachement stuff.

What would be funny would be to see the false positives of such a system. Many mails I get from the administration all look the same, I wonder if they would be considered as spam - they are quite similar to spam: useless and to numerous...

Share
twitter facebook
Re:Add invalid HTML tags (Score:2)

by 3-State Bit ( 225583 ) writes:

and, to make your point compreHENSIBLE (you're just not expressing yourself, dobbs), the html tags would differ from mailing to mailing. Thus, the seen text is the same, but the unseen text is different enough to mess any crc up. Simple solution: exclude all punctuation and html tags. Make all lowercase. Split the results on whitespace. Foreach(word), spell-check and accept the first suggestion of whatever spell-checker you're using (as long is it's deterministic, heh). Replace each word with a deterministic thesuarus's suggestion for what the most common word is that is sometimes its synonym. (This way simple thesaurusing can't mess us up). It doesn't matter if the 'whittled-down' version we're now working with doesn't make sense in English--as long as we can always get to it deterministically.
Now discard all articles and very common words (ones that don't convey information and can't be used to form whole sentences. Don't eliminate any verbs). You're left with the bare essence of what the emai conveys, and anything that's not in this can't be in the original. Then crc this one. Heheh, try to get around that, spammer.

Er, actually, one thing I notice is that I didn't address "random" spacing. My system wouldn't realize that "random" is a word there. Solution: don't split on white space, remove all white-space and then use a dictionary that lets you see how close something is to being a word, then add letters until you're now farther from being a word than you originally word, and pop that off as a separate word. You can look ahead slightly, so that you don't pop "nation" just because it's more of a word than "nationa" is, if the letters afterward are "lity".

Sound good?

--
Re:Add invalid HTML tags (Score:2)

by 3-State Bit ( 225583 ) writes:

a) You're right. quite a bit of processing.
b) I've already figured out a way around it! As a spammer, have your spam engine combine your sentences in arbitrary order. What about sentence matching? Set it so it adds removable phrases, I repeat you will never be charged, with modifyers like "seriously", and "we're not kidding", and even "very", "extremely", etc.

Your "Spam Engine Markup for Interception-Neutralization and -Avoidance Language" (Seminal!) can have special tags telling you where you can put filler phrases. At the end, you can include a lot of random words from a news site or whatever, to throw off word-frequency analysis.
The idea is that it's a lot easier for a spammer to change things around in random order than it is for a mail server to order them back again for comparison. So, plan no-go :(

--
Re:Spam Hunters (Score:2)

by eclectro ( 227083 ) writes:

Spammers need to be licensed (preferably with an ear tag, but i'll consider substitutes)

maybe spammers could be branded with a giant S with a hot-iron like they did with cattle in the old west....
Re:Cell phones are great (Score:2)

by tmark ( 230091 ) writes:

My cell phone offers free long distance.
Wow, sounds like you have a great cell phone plan. Do you get local calls free too ? ;-)
Re:What's the big deal? (Score:4)

by DeadMeat (TM) ( 233768 ) writes: on Monday July 30, 2001 @09:12AM (#2183331) Homepage

The big difference is who pays for it.
When you get a telemarketing call, they pay their long distance company for the right to call you. It doesn't cost you a penny to pick up the phone. When you get junk (snail) mail, the marketer had to pay the postal service to send mail out to each and every address. Not only does it not cost you anything, but in the case of the U.S. Postal Service these bulk rates actually lower the cost of you sending mail, since they use it subsidize part of the cost of personal mail.
Bulk E-mail on the other hand is a different thing. First off, if you're not on a land-based U.S. phone line, odds are you're paying per-minute for your connection -- which sucks since you have to pay to get spam dumped in your E-mail program's inbox.
Even if you have a flat rate connection, you're still inevitably paying for spam mail, whether or not it's directly. Bandwidth isn't free -- take a 5k spam mail message and multiply it by 10 million messages, both of which are probably conversative estimates, and you're talking about 50 megabytes each time a spam is sent out. If you get 3 spam messages a day, that's 150 megabytes of bandwidth just for the messages that you received -- which is only a tiny fraction of all the spam sent out in a day. Multiply 50 megabytes by the countless number of messages, and that's a lot of bandwidth going up in smoke daily.
Guess who's paying for it? Hint: with spammers usually using stolen ISP accounts and fake credit card numbers, probably not them. Another hint: when ISPs' bandwidth costs go up, they pass it on to the users.
Not to mention the fact that spammers shoving millions of messages through creaky mail servers can take them down. So even excluding the monetary damage, what's it worth if a piece of E-mail sent to/from you was on that server when it went down in flames? Your message may be delayed, or it may never show up at all.

Share
twitter facebook
"Pretty close" checksums? (Score:3)

by geekplus ( 248023 ) writes: on Monday July 30, 2001 @07:56AM (#2183332)

Aren't there algorithms that will report messages that are pretty close, i.e., within N arbitrary bits of each other, as the same checksum? Or at least something approximating a checksum..., i.e. two different checksums that nonetheless return true when passed to an equals(cs1, cs2) method?
Does someone have a link?
-- I had a female crustacean once, but I lobster...

Share
twitter facebook
Counters? Already do. (Score:2)

by J'raxis ( 248192 ) writes:

Haven't you gotten the spam that says something like:

EARN $$$$$ AT HOME!!! xyzzygx

In each copy, that xyzzygx is a different string of crap. I think this technique was originally developed to be a filter-foiler (you see this in Usenet a lot more than in email), but that'd do it.
There's also the spam that includes customized URLs in the message (image downloads that, say, have your email address embedded in the query string -- sneaky little "live address" confirmation technique).
Re:I can't see this working (Score:3)

by Sven Tuerpe ( 265795 ) writes: <svenNO@SPAMgaos.org> on Monday July 30, 2001 @08:58AM (#2183343) Homepage

Checksums do not change gracefully given different inputs.

It depends. If we think of cryptographic hash functions, you are right. They are designed that way in order to avoid collisions and forging of messages that are mapped to a given value by a particular function.

But if we think of error correcting codes, the situation is different. They are designed with the opposite goal in mind -- changing gracefully when certain errors (i.e., small changes for some definition of "small") occur, to allow for reconstruction of the original data.

Ususally both the checksum and the corrupted data (or the corrupted data + checksum string, to be precise) is needed in the case of error correcting codes. But perhaps concepts from both -- closely related -- fields could be combined to create something usable for spam detection under hostile spammer conditions?

Share
twitter facebook
Re:Relevant but somewhat off-topic question (Score:2)

by AnotherBlackHat ( 265897 ) writes:

Open relays mainly exist because of legacy. Once upon a time we needed them, because most systems weren't connected 24/7, and just routing traffic was a major issue. That changed once TCP/IP became the norm, but relays were still necessary for the transition phase. Even today, there are still people who's mailboxes aren't connected 24/7 that require a relay service, though they are definitely a minority.
Sadly, relays are still needed today because of spam blockers. A depressing number of sites require that email come from the "correct" IP address (your From: address must have the same MX record as your IP address) which means your ISP must maintain a relay for your use, though it doesn't have to be an "open" relay.
With most ISPs, it's easy to bipass relays and send email directly to port 25 on the target machine, so blocking open relays wouldn't help much, it would just push the problem back one step.
Re:What's the big deal? (Score:2)

by AnotherBlackHat ( 265897 ) writes:

The usual reply is that I'm paying for it instead of the spammer.
This is of course, bullshit.
Email is so cheap, that for most people the costs of throwing away the junk mail they receive is greater than the cost of downloading the spam. If you figure bandwidth at $10 / gigabyte, which is very high, then a 10K email costs a hundreth of a penny.
The true cost of spam is the time wasted reading the crap. And if people weren't up in arms about it, there would be a lot more of it in your email box. It's sort of like flaming people for bad posts on usenet - it's not that the posts/spam is so bad, it's that if we don't do it, they'll just get worse and worse.
How I filter spam (Score:4)

by koreth ( 409849 ) writes: on Monday July 30, 2001 @01:04PM (#2183353)
I do a few things that are extremely effective in filtering out spam. I have procmail rules to do the following:
- Mail that doesn't list one of my addresses, or the address of a mailing list I know I'm on, in the To: or Cc: lines gets filtered. This alone catches a solid 85-90% of my spam flow, though it seems to be getting less effective as time goes on.
- Mail that's from a free E-mail service (Hotmail, Angelfire, etc.) gets filtered.
- Mail that contains certain keyphrases (e.g. "free" in all caps, or "this is not spam" or "S.1618") gets filtered.
- Mail that has passed through a .cn or .tw or .kr host gets filtered. Those countries seem to have an abundance of open relays. At some point I hope to change this to check against ORBS/DUL instead.
Now, the interesting thing is what I do once I've decided to filter the mail. Since my rules catch legitimate mail, I don't just throw it away. I wrote a small collection of Perl scripts (which I'll release to the world someday soon, but they need documentation) that maintain a whitelist of sender addresses.
If a filtered message is from an address that's marked valid, it's delivered. If it's from an address that's marked invalid, it's discarded. If it's from an unknown address, the message is put in a holding area and an autoreply is sent back to the sender from a magic address asking them to reply in order to validate themselves.
The magic address is unique per filtered message -- it uses qmail's address extension mechanism -- and mail to the magic address never gets delivered to me, so I don't care if it gets added to spam lists. The Perl script behind the magic address does a quick check to make sure it's not processing a bounce, then marks the sender of the original message as valid and delivers the original message (or messages if more than one arrived while awaiting validation).
Held messages are cleaned out by a cron job when they get too old.
This is sort of similar in concept to the password mechanism of SpamBouncer or (a closer cousin) SpamCop's whitelist feature, but it doesn't require senders to retransmit their messages, which I always thought was pretty annoying to ask people to do since not everyone saves their outgoing mail. Granted, asking them to do anything is kind of annoying, but at least this is less so since they can just hit "reply" and "send".
This setup is cool because it allows friends to Bcc me on stuff without my "I must be listed as a recipient" rule trashing their messages, even if they've just switched E-mail addresses. It is admittedly based on the assumption that spammers don't read replies to their mail and/or wouldn't go to the effort of unlocking themselves; I have yet to see a spammer do that, and given the economics of spamming I think that'll be a safe assumption for the foreseeable future, unless this approach gets so popular that spammers start writing automated unlock bots!
Share
twitter facebook
For USENET! (Score:3)

by gnovos ( 447128 ) writes: <gnovosNO@SPAMchipped.net> on Monday July 30, 2001 @09:14AM (#2183362) Homepage Journal

An idea similar to this could and should be tried to bring the USENET back into the hands of masses. Having some sort of k5 style moderation used on USENET message id could potentially end spam as we know it. The simplest appriach would be to have a few groups fo competing "moderation" servers that you could query and rate messages by thier message id and then build in some client plugins to filter based on a given threshhold. Of course to really get the system to work, some thought would have to be put into authentication (say only 5 moderations allowed per IP per day, or even have an actualy login proccess to moderate) to keep spammers from moderating up thier own posts. If we have a loose network of many of these moderation servers, they all use different ways to pick out the good posts and user preference would dictate which system works best.

Anyway, just my 2 cents...

Share
twitter facebook
Worms? (Score:3)

by All Dead Homiez ( 461966 ) writes: on Monday July 30, 2001 @07:53AM (#2183368)

Obviously there are issues with something like this (especially mailing lists, and worms that do attachments)
Is there some hidden reason why we would want millions of copies of an email worm's attachment to get through? This could actually be part of the solution to two problems.
Also, do note that a common method of spamming is to connect to an open relay and have the relay take care of sending out thousands of identical messages by simply sending thousands of "RCPT TO:" commands. Checksumming spam would completely break this spamming method and would force the spammer to retransmit the entire message for every recipient in order to vary it, thus making the process more costly.
-all dead homiez

Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

at least hackers are smart (Score:2)

Re:Worms? (Score:2)

Relevant but somewhat off-topic question (Score:4)

Re:What's the big deal? (Score:2)

Say, I Recognize This (Score:2)

Re:Checksums? (Score:2)

Re:"Pretty close" checksums? (Score:2)

Re:What's the big deal? (Score:2)

Re:What's the big deal? (Score:2)

Re:Relevant but somewhat off-topic question (Score:2)

Re:What's the big deal? (Score:2)

Re:Relevant but somewhat off-topic question (Score:3)

Re:Just use mail filters (Score:3)

Re:Spam Hunters (Score:2)

Phone #==goatse.cx (Score:2)

The way I heard it (Score:2)

Re:Just Because they would counter it. (Score:2)

Re:Checksums? (Score:2)

Re:I can't see this working (Score:2)

I can't see this working (Score:5)

Re:Relevant but somewhat off-topic question (Score:2)

Checksums? (Score:4)

Comment removed (Score:3)

Re:I can't see this working (Score:2)

Re:Checksums? (Score:2)

Re:Checksums? (Score:2)

Re:Relevant but somewhat off-topic question (Score:2)

Re:Relevant but somewhat off-topic question (Score:2)

Re:Relevant but somewhat off-topic question (Score:2)

Re:Worms? (Score:2)

Re:Countermeasures (Score:2)

Re:bulk-mail should be refused by default (Score:2)

Re:Relevant but somewhat off-topic question (Score:2)

Re:Relevant but somewhat off-topic question (Score:3)

Re:hmm (Score:2)

Re:Issues... (Score:4)

Re:False Positives (Score:2)

Re:Cell phones are great (Score:2)

Forget comparing spam! How about universal naming? (Score:2)

Re:Just use mail filters (Score:2)

Just use mail filters (Score:3)

Filter messages before checksumming (Score:2)

Re:The problem is... (Score:2)

Re:Cell phones are great (Score:2)

man diff (Score:2)

Re:Why go through that much trouble to detect SPAM (Score:2)

Re:Hashed bigrams count (Score:2)

Re:Hashed bigrams count (Score:2)

Re:Hashed bigrams count (Score:2)

Hashed bigrams count (Score:5)

Re:Relevant but somewhat off-topic question (Score:2)

Re:Checksums? (Score:2)

Re:My Life as a Spammer (Score:2)

The checksum is fuzzy (Score:5)

hmm (Score:3)

What's the big deal? (Score:2)

Re:Cell phones are great (Score:5)

randomised strings (Score:2)

Add invalid HTML tags (Score:2)

Re:"Pretty close" checksums? (Score:2)

Re:Add invalid HTML tags (Score:2)

Re:Add invalid HTML tags (Score:2)

Re:Checksums? (Score:3)

Re:Personalised spam (Score:2)

Countering Counters (Score:2)

spambouncer works great for me (Score:3)

the ultimate spam filter (Score:2)

Just Because they would counter it. (Score:5)

Re:hmm (Score:2)

Duh... (Score:5)

Surprisingly, that can work! (Score:2)

Re:"Pretty close" checksums? (Score:2)

Re:"Pretty close" checksums? (Score:2)

Re:hmm (Score:2)

Issues... (Score:2)

Re:Spam Hunters (Score:2)

Spam Hunters (Score:4)

Re:I can't see this working (Score:2)

Re:hmm (Score:2)

Re:hmm (Score:3)