New Kind of Spam 'Un-Training' Filters? - Slashdot

Catch up on stories from the past week (and beyond) at the Slashdot story archive

×

New Kind of Spam 'Un-Training' Filters? 454

Posted by ScuttleMonkey on Wednesday August 09, 2006 @12:53PM from the battle-lines-being-drawn dept.

Zaphod2016 writes to tell us the Wall Street Journal is reporting that email in-boxes are under a new kind of spam attack. This new spam has confused many people due to its lack of advertising, viruses, or request for personal information. One popular theory is that these innocuous blocks of text, often drawn from popular literature, are being used to "un-train" spam filters to allow more malicious spam through in the future.

This discussion has been archived. No new comments can be posted.

New Kind of Spam 'Un-Training' Filters?

Search 454 Comments Log In/Create an Account

Comments Filter:

NPR article (Score:2, Informative)

by Anonymous Coward writes: on Wednesday August 09, 2006 @01:02PM (#15874868)

I heard an interview yesterday on NPR about this.

[npr.org]http://www.npr.org/templates/story/story.php?story Id=5624749 [npr.org]

Share
twitter facebook
Un-training? Hardly. (Score:5, Informative)

by pclminion ( 145572 ) writes: on Wednesday August 09, 2006 @01:03PM (#15874879)

Bayesian and other filters do not rely on "spammy" words alone -- they also rely on "unspammy" words, and spammers have no idea what those words are because each person receives different email.

A scenario, with made up (but plausible) numbers: Suppose you're a developer of a Linux driver for the Bozodrive 1000. The majority of your legitimate email comes from Linux driver development mailing lists. A full 50% of those emails contain the word "IRQ." 99% of the emails contain the word "driver," and 15% contain the word "Johannsen" which is in the signature of one of your friends. And precisely 0% of the emails containing any of these terms have ever been found to be spam.

Any decent spam filter will give a huge weight to the presence of these "unspammy" words, because of the extremely high probability of emails containing them to be non-spam. The presence of randomly selected confusion words in empty spams is not going to affect these frequency counts.

In order to defeat a filter by confusing it, the spammer must guess what the SPECIFIC non-spam words for that PARTICULAR email user are, and then produce bogus, spam messages containing those words in the appropriate frequencies. This will cause the classification counts for those words to become more equalized, and the value of those words in determining spammyness to be greatly reduced. However, this is an impossible task unless the spammer has access to the actual emails of the target.

Perhaps the intent of the empty spams is to confuse the filters, but whoever devised the method has no understanding of how these things actually work, whatsoever.

Share
twitter facebook
Re:Not very effective and may be easy to work arou (Score:5, Informative)

by pclminion ( 145572 ) writes: on Wednesday August 09, 2006 @01:12PM (#15874970)

By having a baysian filter forget over time, it also helps shrink down the database and helps it adapt as the contents of spam change over time.

Having the filter forget is the ONLY effective policy. In statistical filtering, it is certainly NOT true that more data == better results. You want a sample of data that most accurately represents the sort of content you are receiving RIGHT NOW. I completely purge my Firefox Bayesian database every couple of months and retrain on recent emails only. The result is ALWAYS an increase in accuracy, particularly a reduction in false positives.

Parent Share
twitter facebook
No, unless people send that text to you. (Score:5, Informative)

by khasim ( 1285 ) writes: <brandioch.conner@gmail.com> on Wednesday August 09, 2006 @01:13PM (#15874973)

I still flag crap like this as spam, so it seems like it'd train my spam filter to have more false positives, no?
No. Unless the people you usually corresponde with also include blocks of the same text.

The only way to increase the false positives is to get the spam filter to learn the words that usually appear in your legitimate messages.

Since the spammers have no way of knowing what those words are, there is no way they can bypass your filters ... and still be effective in getting through any one else's filters.

Parent Share
twitter facebook
Re:Not everybody develops Linux drivers (Score:4, Informative)

by pclminion ( 145572 ) writes: on Wednesday August 09, 2006 @01:16PM (#15875001)

Take my dad for instance; he isn't on any mailing list; 99% of his email is along the lines of "how are you" and "give my love" etc; pretty run of the mill stuff.
People who ask those sorts of things usually sign their name to their email. Those names will become strong non-spam keywords. ANYTHING your dad talks about specifically will help -- hobbies, places he usually goes, etc. You'd be surprised how much specific, intelligent content even the most "ordinary" of people will produce.

Parent Share
twitter facebook
Re:Other way around? (Score:5, Informative)

by TubeSteak ( 669689 ) writes: on Wednesday August 09, 2006 @01:24PM (#15875072) Journal

My limited experience is that whatever filtering Hotmail uses has been allowing lots of Spam to slip through in the last few weeks.

Anyone else?
How's Yahoo & G-Mail been doing?

Parent Share
twitter facebook
Re:Other way around? (Score:3, Informative)

by fbjon ( 692006 ) writes: on Wednesday August 09, 2006 @01:45PM (#15875224) Homepage Journal

I recommend greylisting. It's a somewhat dubious way of dealing with it, but I can't remember the last time I received a spam-ish mail, must be more than a year ago. I really have no idea how big a problem spam is these days because I just don't get any, even though my address can be found by googling.

Parent Share
twitter facebook
Re:I just thought they were weird. (Score:5, Informative)

by CohibaVancouver ( 864662 ) writes: on Wednesday August 09, 2006 @02:00PM (#15875368)

be interested to know how many people put up money for products / services they were spammed with.
Quite a few, apparently.
I read one article which claimed that one spammer in particular "received 10,000 credit card orders in one month [snip] each for $39.95 US."
So that's nearly $400,000 per month. Nice work if you can get it.
Source:
http://www.cbc.ca/story/business/national/2005/04/ 08/spam-050408.html [www.cbc.ca]

Parent Share
twitter facebook
Re:The text comes from the Gutenberg Project (Score:5, Informative)

by Ed Avis ( 5917 ) writes: <ed@membled.com> on Wednesday August 09, 2006 @02:02PM (#15875376) Homepage

If the spammers are now sending round Gutenberg texts, this is entirely appropriate. Project Gutenberg caused probably the first ever spam, when Michael Hart launched the project by trying to mail everyone on ARPANET with the U.S. Declaration of Independence. (source [lwn.net])

Parent Share
twitter facebook
Re:The text comes from the Gutenberg Project (Score:4, Informative)

by letxa2000 ( 215841 ) writes: on Wednesday August 09, 2006 @02:02PM (#15875377)

think that is the point. They want to either poison those words so you get more false positives or they want to push other REAL spam related words out of the "this is spam" dictionaries. Maybe both. If these messages had some common theme, they would all get blocked and would have no net effect. They need you to click "this is spam" to poison your filters. Question is, does it work?

Answer is: No, it won't. At least not with Bayesian. The only way to mess up a Bayesian filter is if they can send you messages that are heavy in words/terms that often appear in your good email. And that's going to vary from user to user. Unless you're sending me the exact words that I use in my daily emails, adding a plethora of other words is not going to make my filter any less accurate or create more false positives. It will either let my filter recognize your "poison" as spam itself or, at worst, be neutral.
My Bayesian filter, among other things, considers an excessive number of infrequently/never used terms as a characteristic that is itself subject to Bayesian classification. So while the "poison words" have no statistical effect on my filter, the fact that a bunch of unusual words are found in a message is going to increase the chance that my filter correctly recognize the message as spam.
My spam was constantly growing through about December of last year. This year, it seems to have leveled off. Sure, I'm still getting just under 20,000 per month which sucks, but I see almost none of them and according to my spam stats, the spam has leveled off. Hopefully this is the plateau before it falls. :)
I still want to know: Who are the idiots who BUY spammed products???

Parent Share
twitter facebook
Re:How to be smarter (Score:2, Informative)

by maird ( 699535 ) writes: on Wednesday August 09, 2006 @02:11PM (#15875463) Homepage

I use assp as my spam filter: http://assp.sourceforge.net/ [sourceforge.net] It always filtered spam very well for me but the latest version added an interesting technique that has reduced the amount of spam that's even hitting the filters to near zero. Since SMTP is considered "unreliable" a sending server will retry on failure. Apparently, spammers tend not to bother retrying. ASSP builds tables using an identity triplet (I can't remember the three message/source attributes it uses). On first view of a given triplet, ASSP responds with a SMTP error suggesting the source retry later. ASSP tables the triplet and allows that traffic to pass later on a retry. The triplet expires after some period. I'm not aware of any false rejections and the messages hitting the dump mailbox has dropped from around 10 a day to a couple a week. I suppose one might argue that it increases packet traffic and I assume spammers will workaround it but I suspect the extra packet traffic is far exceeded by the spam that I would otherwise handle and it handles the spammers for now. Sentience unnecessary perhaps.

Parent Share
twitter facebook
Re:Other way around? (Score:5, Informative)

by badasscat ( 563442 ) writes: <basscadet75@@@yahoo...com> on Wednesday August 09, 2006 @02:13PM (#15875470)

How's Yahoo & G-Mail been doing?

Here are actual samples of emails that Gmail and Yahoo have let through to my inbox over the past couple days. First, Gmail:
Wells, who has had a rather similar historyand who obviously owes something to Dickens as novelist. In some ways his outlook is verysimilar to Dickenss. No one who is really involved in the landscape ever sees thelandscape. To Chesterton the poor means small shopkeepers andservants. There is nothing psychologically false in this, either. No one who is really involved in the landscape ever sees thelandscape. It is easy to imagine what the young woman would have said to this inreal life. And given the FACT ofservitude, the feudal relationship is the only tolerable one. Theother point is that Dickenss early experiences have given him a horrorof proletarian roughness. They, and the men, always spoke of me as the younggentleman. It is one of the stockjokes of English literature, from Malvolio onwards. Buthe is remarkably free from the idiocy of regarding nations asindividuals. So were all the characteristic English novelists of thenineteenth century. The last thing anyone ever remembers about the books is theircentral story. Nevertheless hislist of most hated types is like enough to Wellss for the similarity tobe striking. A change of heart is in fact THE alibi of peoplewho do not wish to endanger the STATUS QUO. There is nothing psychologically false in this, either. Pickwick and the servant should be Sam Weller. It is noticeable thatDickens hardly writes of war, even to denounce it. Therewere no labour-saving devices, and there was huge inequality of wealth. In Dickenss novels anything in the nature of work happens off-stage. And, on the whole, his attacks on good society are ratherperfunctory. But byorigins and upbringing Thackeray happens to be somewhat nearer to theclass he is satirizing. Here perhaps Gissing is influenced by his own love of classical learning. In a rather different sense his attitude to life is extremely unphysical. It is usual to claim him as a popularwriter, a champion of the oppressed masses. Dickens would be quite incapable of this. Compare any lawsuit in Dickens with the lawsuit inORLEY FARM, for instance. I do consider the young ooman, sir, said Sam. Here the contrast between Dickens and, say, Trollopeis startling. It is true that not all his novelsare alike in this. He getshimself arrested in order to follow Mr. Progressis not an illusion, it happens, but it is slow and invariablydisappointing. If his palms are hard from work, they let him in; if his palms aresoft, out he goes. It is perhaps more significant that he shows noprejudice against Jews. At first sight this statement looks flatly untrueand it needs some qualification. A modern manservant would neverthink of doing either. There arepractically no friendly pictures of the landowning class, for instance. If one wants a modern equivalent,the nearest would be H.

Attached to the above was an image file that contained an obvious ad. So to Gmail, this apparently looks like a regular text email that happens to have an attached image.

(You can argue about how effective this is, since Gmail thumbnails all images, meaning you'd need to click a separate link to open it and read it.)

Now Yahoo, where I get approximately 1,000 messages to my bulk folder per day - this is the only one that's gotten through to my inbox in the last day:
FROM THE DESK OF Mrs Queen Adams
BANK OF AFRICA [BOA]
OUAGADOUGOU, BURKINA FASO.

DEAR FRIEND,

I AM HOPEFUL THAT THIS MAIL WILL REACH YOU IN GOOD CONDITION OF
HEALTH.I AM MRS QUEEN ADAMS A STAFF OF BANK OF AFRICA AND A BURKINABE RESIDENT
IN BURKINA FASO ALSO.IN THE BANK WHERE I WORK AS AN AUDITOR,I
DISCOVERED AN ABANDONED SUM OF MONEY AMOUNTING TO 15.2MILLION DOLLARS BELONGING
TO DR GEORGE BRUMLEY WHO UNFORTUNATELY DIED IN THE PLANE CRASH OF UNION
TRANSPORT AFRICAN FLIGHT BOEING 727 IN KENYA, EAST AFRICA ON SUNDAY
Read the rest of this comment...

Parent Share
twitter facebook
Re:I wonder if a spam can might be a good idea. (Score:2, Informative)

by Paco103 ( 758133 ) writes: on Wednesday August 09, 2006 @02:34PM (#15875657)

It's been done. Still going, and you can help. Don't know how effective it is, but read up
http://www.projecthoneypot.org/ [projecthoneypot.org]

Parent Share
twitter facebook
*yawn* (Score:3, Informative)

by SCHecklerX ( 229973 ) writes: <greg@gksnetworks.com> on Wednesday August 09, 2006 @02:55PM (#15875797) Homepage

I doubt these would ever get by my greylisting. If they did, they then have to get through the rudimentary checks (which most spam totally fails on), before finally being passed to spamassassin, where it will be properly classified and /dev/nulled.

Mimedefang has these things set up on my home server:
Reject if in spamhaus block list (it's easy to get yourself off of that one)
Reject if helo is not FQDN or IP address
Reject if sender tries to spoof as an address on my domain
Reject if sending SMTP server tries to issue a helo that is on my domain
Reject all RFC1918 helos from untrusted nets
Reject senders not in the lists they are trying to send to.

Between the mimedefang rules and the greylisting, spamassassin and my bayes filters rarely even have to process anything. This becomes very important as you scale a corporate system to 1000's of users.

At work we also parse the headers to see if we are getting idiotic 'bounces' from misconfigured antispam vendors replying to spoofed mail.

We also implement SPF records.

Share
twitter facebook
Re:How to be smarter (Score:2, Informative)

by maird ( 699535 ) writes: on Wednesday August 09, 2006 @03:14PM (#15875924) Homepage

I suspect that you are not observing retries but, rather, attempts to deliver multiple messages. The technique I'm describing doesn't, as I understand it, rely on source IP address. So, the same IP address could attempt to deliver 50 messages and each one would be an independent candidate for the technique. That could explain both your observations and mine. You probably did the right thing to block the actual traffic given the amount of it anyway. Your observations make me consider adding a log of smtp connects to my firewall rules just so that I can satisfy my curiosity about the traffic.

Parent Share
twitter facebook
Re:On a related tangent... (Score:2, Informative)

by erichschubert ( 96206 ) writes: on Wednesday August 09, 2006 @03:31PM (#15876035) Homepage

Been there, done that. Actually that was tried years ago. Doesn't work.

How do you expect the spammers to receive the error message? As you might know, the sender is faked.

Their software is flawed, it will even send the email body when you said the receipient doesn't exist. Or they should just go away. So they obviously don't even parse your return code... These zombies are dumb as shit.

And do you think they'll care?

They probably bought some DVDs with email adresses. They're read only anyway. And after some months they'll just buy new ones.

If spammers (or more precisely, email harvesting companies, which is probably a different company... they might even not be violating the CAN-SPAM act?) are testing email addresses to be alive, they are most likely to use a "legitimate looking" email and some hidden web bugs (!). One more reason not to use Outlook and similar software that does load web bugs. Or proper unsubscribe links. One more reason to not click on them.

Parent Share
twitter facebook
Re:Other way around? (Score:3, Informative)

by winnabago ( 949419 ) writes: on Wednesday August 09, 2006 @03:47PM (#15876139) Homepage

I know it's basic, but I'd like to add that if you have control of the HTML of the page that you are posting you email to, you can use a simple tool to confuse the mining bots. It doesn't work on forums like slashdot, but a good scrambler that I've had success with is Enkoder [automaticlabs.com].

I've wondered why more sites don't use Craigslist's method of temporary forwarding from an anonymous, random address that can be easily filtered if need be. Bandwidth?

Parent Share
twitter facebook
Re:Other way around? (Score:3, Informative)

by Omestes ( 471991 ) writes: <omestes@gmail . c om> on Wednesday August 09, 2006 @05:14PM (#15876724) Homepage Journal

I've been using Spamgourmet.com for a couple years now, with no complaints. It pretty much does what ypu describe, you create a temporary throw-away address with a limited forward amount, and everything after that is eaten. You can also make senders "trusted", and set your throw-away address to reply, if it is legitimate communications.

I get very little spam thanks to this (about 10 per week), while Spamgourmet has blocked 47,378 of 1,802 messages. The only problem is that the addresses are sometimes not allowed for online registrations, and it is a pain in the ass to write on real world forms, plus keeping track of 200+ message prefixes is a pain.

For example: slashdotDEMO.10.omestes@xoxy.net This message will forward 10 messages to me, after that they all go into the void, so it can be added to any list, or whatnot, with no pain to me, and my 3 spam filters (gmail's, junkmatcher, and mail.app's) meaning only about 1 spam per month reaches my inbox, with about 1 false positive per 3 months.

Parent Share
twitter facebook
Re:The text comes from the Gutenberg Project (Score:4, Informative)

by crabpeople ( 720852 ) writes: on Wednesday August 09, 2006 @05:30PM (#15876822) Journal

"Project Gutenberg caused probably the first ever spam,"

Close but incorrect. I believe it was an add for some kind of seminar a guy was giving on the west coast. He was from the east coast and had no contacts to sell this product in the west so he manually typed in like hundreds of addresses. I dont know if i can find a link but i remember reading about it.

Ok aparently googling for "first spam ever" yields this article [templetons.com]:

"The sender is identified as Gary Thuerk, an aggressive DEC marketer who thought Arpanet users would find it cool that DEC had integrated Arpanet protocol support directly into the new DEC-20 and TOPS-20 OS. I spoke with him to get his reflections on the event.

DEC was mostly an east coast company, and he had lots of contacts on the east coast to push the new Dec-20 to customers there. But with less presence on the west coast, he wanted to hold some open houses and reach all the people there. In those days, there was a printed directory of all people on the Arpanet. Gary spoke to his technical associate, and arranged to have all the addresses in the directory on the west coast typed in, and then added some customer contacts in other locations, including people at ARPA headquarters who did not, according to Thuerk, complain.

The engineer, Carl Gartley, was an early employee at DEC who had been called in to help with promoting the new Decsystem-20. They worked on the message for a few days, going through a few rewrites. Finally, on May 3, Gartley logged on to Gary's account to send the mail. "

so there you go. First spam May 3, 1978. Theres a reply to it from RMS too (his inital reaction was pro spam heh).

Parent Share
twitter facebook
Re:Spam is dying (Score:5, Informative)

by dodobh ( 65811 ) writes: on Wednesday August 09, 2006 @07:42PM (#15877478) Homepage

I work for a fairly large email service provider. Spam isn't dying by any means. We just doubled production hardware last week to have enough smtp listener processes to be able to accept email. Bayesian is nice for the single user. For an ISP, it isn't. ISPs are bearing the brunt of the expense right now. The day I fear is when ISPs start to go under, or start charging for spam filtering, or simply stop.

Those boxes are running at sustained loads of 40+ and are CPU bound. That's a bit rare in the email world, as you would know if you have ever run a non trivial system in production.

The spammers will send more spam is something that we have been observing in reality. I have seen AOLs numbers, and they are merely two orders of magnitude bigger than ours at the moment.

Parent Share
twitter facebook
Re:I wonder if a spam can might be a good idea. (Score:3, Informative)

by LWATCDR ( 28044 ) writes: on Wednesday August 09, 2006 @08:37PM (#15877760) Homepage Journal

I am a native speaker but I am dyslexic. Also I am not really feeling well. And yes you where being a sh*t. Good grief this is a stinking message base not an English exam or a resume. Judge the content and not the grammar or spelling.
Making fun of my typos is right up there with making fun of a blind guy tripping.

Parent Share
twitter facebook
Re:Other way around? (Score:1, Informative)

by Anonymous Coward writes: on Thursday August 10, 2006 @06:42AM (#15879274)

You really have to post the whole exchange somewhere. In http://www.419eater.com/ [419eater.com], for instance.

Parent Share
twitter facebook
Re:Other way around? (Score:3, Informative)

by Omestes ( 471991 ) writes: <omestes@gmail . c om> on Thursday August 10, 2006 @12:59PM (#15882197) Homepage Journal

how long did it take for the spam bots to send 10 messages to this address

Oddly, no spam yet. At first it does take a bit of discipline to begin with, but after awhile it becomes habitual to use it on webforms and such, though there are lapses, which explains the amount of spam I do get. As for dictionary mailers, the solution is easy, use an obscure word that probably isn't in them. My address, with spam blocking is above, and it really is not a common word (without me, there is about 20 hits on Google), and is rather easy to tell via word of mouth (unline, say, anthroporraistes@emailaddress.com, which would be a pain in the ass).

And then there is a few after-the-fact moves, such as the ever so handy bounce feature. Right now I don't trust server-side filtering, though, I want spam to get to my mailbox (at least Google's) so I make sure I don't miss anything, and to better train my filters.

Parent Share
twitter facebook

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

Related Links Top of the: day, week, month.

413 commentsChatGPT Leans Liberal, Research Shows
347 commentsAmazon CEO Says 'It's Probably Not Going To Work Out' For Employees Who Defy Return-to-Office Policy
327 commentsHotel Owners Start To Write Off San Francisco as Business Nosedives
323 commentsChina is Building Nuclear Reactors Faster Than Any Other Country
315 commentsChina is Calling in Loans To Dozens of Countries

Software production is assumed to be a line function, but it is run like a staff function. -- Paul Licker