Forgot your password?
typodupeerror
Spam

Fighting Spam with DNA Sequencing Algorithms 142

Posted by CmdrTaco
from the crushing-the-mouse-with-a-mallet dept.
Christopher Cashell writes "According to this article from NewScientist, IBM's Anti-Spam Filtering Research Project has started testing a new spam filtering algorithm, an algorithm originally designed for DNA sequence analysis. The algorithm has been named Chung-Kwei (after a feng-shui talisman that protects the home against evil spirits). Justin Mason, of SpamAssassin, is quoted as saying that it looks promising. A paper is available on the algorithm, too (PDF)."
This discussion has been archived. No new comments can be posted.

Fighting Spam with DNA Sequencing Algorithms

Comments Filter:
  • by simp (25997) on Sunday August 22, 2004 @09:12AM (#10037241)
    Excellent! This will go wel with my Feng Shui compliant wall of rocks that I use as a firewall.
    • by Anonymous Coward
      Excellent! This will go wel with my Feng Shui compliant wall of rocks that I use as a firewall.
      Make sure you have some moss or other greenery to balance its hardness, and ideally some water too. For a fully integrated experience, use a themed wallpaper like Stonehenge on your desktop.
    • by Pigbot (797016) on Sunday August 22, 2004 @09:54AM (#10037392)
      Considering how much spam I get trying to sell me Viagra or porn, I have reservations about using someone's DNA to fight spam. It just sounds dirty. And sticky. Like someone should at least buy me dinner first.
    • by BJH (11355) on Sunday August 22, 2004 @10:02AM (#10037420)
      If I'm not mistaken, Chung Kwei is the figure known as Shouki in Japanese. He's usually described in English as the "Demon Queller", which seems a suitable-enough symbol for an anti-spam program.

      I mean, come on - don't anti-spam programs have the coolest names? SpamAssassin, Vipul's Razor...
    • It's hardly appropriate that such superstition should be given encouragement in this day and age. Penn & Teller did a great bit on "feng shui" on their show, "Bullshit!". They had 3 different feng shui consultants come in to a house, and each one recommended different changes for different reasons. Some discipline.
    • "We put the CPU in the center, because that is the chi, or life force for the entire board. A centered chi provides better performance." Now don't you want one? [bbspot.com]
  • Wordfilter (Score:3, Insightful)

    by bert.cl (787057) on Sunday August 22, 2004 @09:18AM (#10037267)
    While the numbers are impressive, this just looks like a filter that does combined wordsearches?

    Even with training, isn't this just some regexp and searchting after particular strings.

    And what about short messages, that don't use as much words, is the spamscore relative or absolute? The article is a little low on details, anybody who can point to some more informative articles?

    • Re:Wordfilter (Score:3, Interesting)

      by rokzy (687636)
      91% detection is far from impressive. AFAIK the better filters today are 99.9% successful. the benefit of this one is its low false-positive rate.

      personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

      can someone point me in the direction of such a filter?
      • Re:Wordfilter (Score:4, Informative)

        by Incadenza (560402) on Sunday August 22, 2004 @09:33AM (#10037320)

        personally I'd prefer a much better set of filter tools e.g. being able to say "I only speak English, I NEVER use this account for commerce, and the people I email are professionals so score spelling mistakes much higher as probable spam".

        can someone point me in the direction of such a filter?

        How about spamassassin?
        Just add the following to /etc/mail/spamassassin/local.cf:

        ok_languages en

        And increase the score for BIZ_TLD and other tests you find more important than others. Scoring per test is fully configurable, complete list of tests here [apache.org].

    • My sentiment: Regex schmegex, so long as it works, and keeps working.

      But really- have a new algorithm that's not perfect? Work on it. More algorithms to choose for cannot mean anything but better antispam solutions.

  • Mozilla Firefox (Score:2, Insightful)

    by nycsubway (79012)
    I have to say the adaptive spam filter in Firefox works pretty darn well. I have tried other adaptive spam filters as plugins in Outlook and they work pretty darn well too.

    With the nature of new spam messages that look like real emails, the only person who can really tell if something is spam is the recipient.

    • Thunderbird (Score:2, Informative)

      by bert.cl (787057)
      I think you mean Mozilla Thunderbird?
    • Re:Mozilla Firefox (Score:3, Insightful)

      by rokzy (687636)
      I've had mixed results with Thunderbird. in the beginning it seemed to work great, then I noticed it was junking all my legitimate email too. then I fixed that but it started letting through blatantly obvious stuff.

      the newest version has been doing better so far.

      I think my problem is my rate of email is quite low so it's difficult to train. I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.
      • My biggest issue with Thunderbird is the bounce messages. A fair amount of people forge addresses which bounce to me (I'll be putting up SPF Real Soon Now, but that doesn't even mean everyone will read it). As a result, I get some legit bounce messages and some with spam in 'em. If I mark the ones with spam as Junk, I risk throwing away the ones without spam. If I mark the ones with spam as not-junk, I get spam which is similar to them thrown into my Inbox.
      • I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

        There are a few databases out there that take hashes of spam e-mails (either sent to spam traps or reported) and use them for spam tagging. SpamAssassin can use their client programs to help tag messages also - I don't know if there's an extension or anything for Thunderbird, I don't use it.

        The three that come to mind are DCC [rhyolite.com], Razor [sourceforge.net] and Pyzor [sourceforge.net].

        All have their advantages
      • I'd like it if there could be a database where if a subject header is reported as spam by one user it effects other users' scoring.

        One of my accounts is a catch all for a domain which has gotten addresses misentered into both legitimate mailing lists and as the erroneous e-mail address of people who are copied and sometimes even directly addresses by genuine personal e-mails. But to me they are all equivalent to spam, so if I was reporting spam to some authoritative list there would likely be an outbreak

    • Re:Mozilla Firefox (Score:2, Interesting)

      by danharan (714822)
      I think you mean Thunderbird.

      My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

      This method is promising because it uses spell-checking and a better way to identify spammy string sequences, something none of the two main camps of spam-filters have seem keen to do until now.
      • Re:Mozilla Firefox (Score:3, Interesting)

        by littlem (807099)

        My experience with it has been rather disapppointing. Why I need to tag as spam two messages from the same sender or with the exact same subject is a mystery to me. After the 10th "Make $/d+ in XX days" type message one has to wonder just how effective this thing is.

        This shouldn't be all that surprising - Bayesian filtering is all based on probabilities. The reason "Outlook message rules" is so bad is because a friend of mine might send me a joke about Viagra, which I don't want to have deleted indiscri

    • I really like the programs, but I get their names confused... I meant Mozilla Thunderbird in the above post.

    • by aussie_a (778472) on Sunday August 22, 2004 @10:31AM (#10037512) Journal
      I agree. The Mozilla Firefox spam filter works great for me. I no longer go to all those goatse sites that people link to thanks to the plugin :) But I have to keep uninstalling and reinstalling it, because after 2 days it says slashdot is spam.
    • Re:Mozilla Firefox (Score:3, Interesting)

      by toxic666 (529648)
      "I" being the key word in your assessment. Fine for the home user, not so good for a business.

      Maintaining an enterprise mail system based upon user-controlled spam filtering software is not practical. That small percentage of users with consistent ID 10T errors adds up fast. Try correcting false positives for a user-configured filter. It's time-consuming.

      The better approach from an administrative standpoint is controlling spam at the MTA- and MDA- levels of the mail server. I use postfix checks with
    • The problem I've been having is that spammers have stopped, well, spamming. They have a subject reading "Get new Vi'agra" or whatever, and the body is filled with those random words - I couldn't find any advertising whatsoever.

      I mean, how are these twats going to get even the most floppy, lazy, frustrated 99 year old to buy their product by telling him "rankin decisionmake portraiture approval slothful clamber teutonic activism alcoa tofu wakeful polonaise burt afghan lad sedimentary pennyroyal aristotelea

  • High tech for what ? (Score:3, Interesting)

    by Ozh (514694) on Sunday August 22, 2004 @09:21AM (#10037276)
    Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address... and so simple that it will never stop :/
    • Funny how some people develop more and more sophisticated stuffs to fight against something that is just as simple as sending out emails to random address...

      This is just like your own immune system, which uses such things as "V-D-J" recombination (and other tricks) to create billions of some what random different epitope to attack potential unknown pathogens. Cells they must further educate not to attack "self" in your own body.

      If only computer geeks took some lesson from biologist, perhaps they could get

      • If only computer geeks took some lesson from biologist, perhaps they could get a grip on principles to stop SPAM.

        Doesn't Bayesian filtering work somewhat like the immune system? After being exposed to the "environment" it learns what is "self" and what is "pathogen" and starts distinguising one from the other pretty reliably. I currently use a server-side Bayes filter on my email and I get 99.5% accuracy with very little manual intervention. And it gets more an more accurate the longer you use it.. unlik

  • This isn't "fighting spam", it's "adapting to spam".
    • Not really. As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.
      • As more and more people begin to use spam filtering (especially on the server level), spam's effectiveness will decrease.

        People have been improving filtering, and the spammers just pump up the volume. As filtering improves, the delivery rate goes down, but so does the complaint rate so they end up being able to pump more spam before they're detected.

        I've been watching this arms race for almost a decade, and the advantage is still on the spammer's side. At the moment I'm blocking between 10,000 and 20,000 connections a day just on the basis of their IP address (including blocks against entire countries), another 3-5,000 using a greylist/honeypot app I'm working on, and I'm still getting one or two hundred messages per day hitting my procmailrc. A few years back, when I was getting a few hundred spams a day without all those RBLs and personal blacklists, people were all excited about how bayesian filters were gonna make spam uneconomical... and I made the same comment back then. Now I'm filtering a couple of hundred times more efficiently and effectively and I'm still getting almost the same volume.

        I don't see anything different this time. You can't fight spam with filters, all you can do is adapt to it.
      • Wrong.

        The effectiveness of the spam that's blocked decreases, the potentcy of the spam that gets through skyrockets since it stands alone. This alone is motivation to triple the efforts of spammers. Im sure the more talented spammers out there nearly jizz themselves as they run thier latest crafted email through their local "test servers", seeing it passes through all the filters with ease, and hit the SEND button.

        Until there is new methodologies to prevent the "ability" to spam, period, everything else

    • I totally agree. Allowing unwanted files onto your system just because 'they' know the address is USER ERROR. These FILTERS are ASKING FOR SPAM!!!

      This middle-market-merchandising-madness has to stop. Bill Gates and attendent remora-ware are getting richer and richer each and every day.

      I guess if politicians can't figure out that their own computers aren't safe, or how to tax internet transactions, then we can't bloody rely on them to stop consumer gouging either can we?
      1. Acquire domain and setup your
      • yours is closest to the best idea, IMO. All email-in should be blocked by default, and only whitelist allowed in through the filter. You can use a form on a web page for a first contact.

        I'd also like to see email addys be treated exactly the same as a snail mail street address addy or a telephone number, ie, make them cost to get, so they are treated correctly. We register domains, why not email addys? If it cost 10$ a year (something like that) to register an email addy, there would be no incentive for th
        • If it cost 10$ a year . . . to register an email addy, there would be no incentive for the spammers to throw the dictionary at domains, and conversely, the spammers couldn't/wouldn't want to create thousands of email addys to spam from.

          I had not heard that angle before. That rocks! You'd think it would be the sort of thing a politician could wield in court too.

          It's strange to me that there are a whole slew of laws concerning other modes of communication, but the internet is slow to be regulated. I
      • Right now, just requiring a keyword on your subject line is more than enough protection to effectively block all spam that's not forged from your whitelisted addresses.

        Yes, spammers do successfully guess whitelisted addresses, by stealing people's address books and mailboxes through viruses and guessing that if you're in someone's address book or they've got mail from you then you're whitelisted from them.

        So, it's an effective filtering mechanism for now, but eventually you'll have to require something be
  • by Admiral Justin (628358) on Sunday August 22, 2004 @09:30AM (#10037308) Homepage Journal
    For now, Bayesian filtering still gets the job done most of the time, so I think we shouldn't get too excited.

    Besides, you have to ask yourself some questions...

    "What happens if you try to filter spam with RNA?"

    "Just how good can ACT and G manage spam?"

    and, most important of all...

    "Are you sure this spam filter uses no portion of Keanu Reeves' genetic code?"
  • Love SA... (Score:5, Informative)

    by ajs (35943) <ajs@ a j s . c om> on Sunday August 22, 2004 @09:41AM (#10037352) Homepage Journal
    You have to love SpamAssassin for it's very Perlish approach to spam filtering... "hey, there's a cool new way to filter spam... throw it in!"

    I love this mostly because it means that SA is a moving target. Spammers can figure out how to defeat pieces of it, but it deploys a wide range of static, dynamic, network-based and user-driven tests that changes so much that spammers simply can't afford to keep up.
  • by Rahga (13479) on Sunday August 22, 2004 @09:46AM (#10037366) Homepage Journal
    It looks like much of the spam I'm recieving today consits of either nearly-blank or e-mails containing news articles that seem to be designed to pass trough content filters just so users can send them back to their admins as spam, essentially making it easier for bayesian filters and such to mark legitimate e-mail as spam.... though honestly, it's more of annoyance for me, as it makes it easier for users to say "The spam filter isn't working, what are you doing wrong?"
  • Wrong title, I guess (Score:5, Interesting)

    by stm2 (141831) <sbassi&genesdigitales,com> on Sunday August 22, 2004 @09:47AM (#10037368) Homepage Journal
    According to the ./ title, it seems they used an algorithm used for DNA secuencing, when in fact they used an algorithm used for DNA analisis (or DNA sequence analisis that is the same), more specifically, gene finding techniques. As you may know, most DNA in a genome is not translated into protein (some people still call it junk, but most of it is no junk at all). So there are programs to sort genes out from the rest of DNA.
    I think we will see more and more applications like this with the growing cross-polination between Biology and CS.

  • I'd love to meet the scientist that thought this up. It probably went something like this: Boss: Well we've made promising gains in the DNA reasearch project, Now what applications could this be used for Engineer: The possibilites are litless! we could cure cancer! We could invent a super puppy that combines the abilities of a lovable puppy and tux, the friendly linux penguin! We could use it to rengenerate limbs for amputees! Marketing: Lets use it to get rid of spam emails! Boss: Great idea! Lets
    • Actually, given the current climate it would be more profitable to cause spammers cancer or remove their lims automatically for each spam sent or kill their puppies or something like that.
  • by G4from128k (686170) on Sunday August 22, 2004 @09:53AM (#10037389)
    This is interesting and promising technology. But like all antispam techniques, spammers will find a way around it. Once spammers get a copy of the software, they can create and test countermeasures in the comfort of their own sleazy lairs.

    For example, the article mentions the software accepts a message that is long but has a few "spammy" sequences. This suggests an immediate countermeasure of adding bulk to spam -- appending a copy of some news article to the spammy payload (some already do this).

    Personally, I've always thought that a simple spell check would do a good job as another layer filtering. It would place spammers in a no-win situation -- either the keyword filter or the spell check filter would get them.
    • Good point - that's why, in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

      Spell checker as anti-spam filter - that would create huge problems for most Americans :-)
      Otherwise it's a good idea.
      • in theory, closed-source software that isn't available for free download and in open-source version should be more effective against spam.

        How so?

        1) install software
        2) treat as black box
        3) spam spam spam
        4) see what gets through
        5) study, enhance
        6) goto 3)

        Just because you can't see how it works, doesn't mean you can't teach yourself how to get around it.
        • Or... (Score:3, Funny)

          by sean.peters (568334)
          1) Acquire software
          2) Decompile
          3) Study code
          4) Develop countermeasure
          5) spam spam spam

          It's not like spammers care about the EULA that says they can't look at the code. Oh, and before I forget...

          6) ???
          7) Profit!

          Sean
          • ... but I think the combination of parent and grandparent let me finally see the light on the issue of what these three question-marks are supposed to be:

            1) Collect underpants
            2) Goto 1
            3) Profit

            See, it does make sense!

    • Notwithstanding accepted wisdom espoused above, random words cannot defeat current statistical spam filters, and it is difficult to defeat such filters even if you have access to the algorithm and the recipient's mailbox.

      John Graham-Cumming presented a talk Beating Bayesian Filters at the 2004 Spam Conference [spamconference.org] detailing these results. A video recording is available; alas, no paper.

      In conducting a recent spam filter evaluation [uwaterloo.ca] I observed (but did not report) that the statistical filter attacks were not

      • That's slightly incorrect. It depends on the filter algorithm used.

        Some statistical algorithms only pick a small number of tokens according to some rationale or other (e.g. most extreme scores). For such algorithms, the padding attack is a very good idea, as with enough random words, one or more of these should have a sufficiently extreme score (so that it replaces a more legitimate token in the list of considered tokens), although whether an extreme score can be synthesised randomly would depend on the c

    • Personally, I've always thought that a simple spell check would do a good job as another layer filtering.

      Then 3/4 of slashdotters wouldn't be able to get their messages through to anybody :-)
  • OT, anyone half decent knows 'feng-shui' is a fake thing /* like astrology, tarot-card reading, ... */. ... it`s a belif-system, and as long as you believe in it your mind will make it real for you ... no *real* scientific studies back them up *AFAIK*.

    and btw, WAKE up ppl. 'Filtering' won't make SPAM *ever* go away. As long as you keep on filtering, I guess, it'll act as a cure/remedy that 'relieves pain', but it isn't a cure/remedy that'll kill 'cancer' for good.

    And from a different sidenote, 'Filter

    • ...no *real* scientific studies back them up *AFAIK*.

      Let me get this straight... You are claiming that fend-shui is fake because science doesn't back it up, then you disclaim that claim by saying you don't really know if science backs it up or not. Ok.

      SPAM eat's like *what was it* 60-80% of the total broadband (world wide) now?!

      This recent article [msn.com] says that about 80% of the e-mail in the US is SPAM... but e-mail is just a small portion of all internet traffic, less than 5% in many locations such as u
      • >Let me get this straight... You are claiming that fend-shui is fake because science doesn't back it up, then you disclaim that claim by saying you don't really know if science backs it up or not. Ok.

        no, I think you missinterpreted me. I claim (from what I've read (scientific or otherwise )) that feng-shui is a fake thing. Hence the "AFAIK".

        I don't claim I know more then I know, and if you know you know more then I know, then by all means, let me know. I sure would like to know as much as you know,

  • This isn't going to work -- you simply can't solve a social / legal problem with technology. The only way we are going to get rid of spam is if the U.S. makes it a crime, but there is no sign of that: The new law in fact has done nothing less than legalize it. Don't get your hopes up for a new one: Congress gets too much money from industry and too few Americans care to vote that it is a no-brainer for it to support the spam-makers over the citizens -- I'm sorry, the correct word these days is "consumers",
    • No offense, but there are plenty of examples of (at least partial) technological solutions to social problems. For instance, the ignition lock on my car prevents people from casuallly stealing it.

      This might not solve the social problem of people wanting to steal cars, but is a decent try at solving the technological problem of people being able to easily do it.

    • This isn't going to work -- you simply can't solve a social / legal problem with technology.

      You'll be buying all your doors without locks from now on, I take it, since burglary is a social/legal problem and the government has passed laws against it. Let us know how that goes.

      • And the locks alone without the laws would have have solved the problem of burglary? I kind of doubt that...

        The law alone will of course not make the spam magically go completly away, but it will make sure that sending spam gets a pretty risky business, instead of a completly risk free one, so people might think twice before sending out a million spam mails. Sure this won't stop people from other countries, however reducing spam from the USA would be a pretty good start.

    • The US now has Federal laws against spam, as well as a number of state laws. The CAN-SPAM law theoretically legalized some forms of spam, but in practice it had no effect - one well-reported study says that about 3% of spam made a pretense of compliance when it first came out, but it's now down to 1% or so, and I saw less effect from California's anti-spam laws. Scotty Richter's OptInRealBig made that pretense, and they're gone, but the pretense was really just to slow down the process of getting kicked o
  • by dnaboy (569188) on Sunday August 22, 2004 @10:09AM (#10037443)
    I think it's really interesting to watch the literal evolution of spam and spam filters. There are really amazing parallels to biological evolution.

    First, there's a constant tuning of both preditor and prey (Anti-spam tools and spam).

    Second, there seems to be some sort of equilibrium which is inevitably achieved, and

    Third, there are occasional discreet major developments which change the game. This would be an example. Now, spam is going to be forced to majorly adapt.

    I could see the 'Quality' of spam improving a lot as a result of tools like this. No more letters from my long lost benefactors in nigeria, and no one liners about 'Gushing like a firehose' (My coworkers and I got a good chuckle out of that one), but, as the story said, if you have keywords in a long email, it gets far less penalized. OK. Attach verses from Dante's Inferno, or Joyce's Dubliners to the email. Problem solved. You can't block words like viagra altogether or Pfizer researchers are going to have a hell of a time getting anything through.

    Another concern is that if this forces spammers to make up new and compelling spam, people will be more likely to check it out. While my parents are probably pretty confident they didn't win a secret lottery 3 or 4 times last week, they might possibly believe new and creative stories.

    Perhaps evolution of email readers is just plain going to be a neccessary part of the solution...

    • First, there's a constant tuning of both preditor and prey

      Absolutely. Unfortunately, as most predator-prey models will tell you, neither population ever goes to zero unless something catastrophic happens. And in this case, catastrophe is precisely what we want to happen to the prey.

      (If they'd simply implement my proposed scheme of a bullet to the head of every spammer, no mercy, no appeal, it'd be easy. But noooo, "spammers are human beings no matter how useless and harmful they are," waaaaah.)

    • "Gushing like a firehose." That's good, but can it compare to "Scientists find new black hole!" I thought I was getting the weekly mailing from Nature.
  • Corrections... (Score:3, Insightful)

    by littlewild (733743) on Sunday August 22, 2004 @10:26AM (#10037498) Homepage
    Chung-Kwei is a Chinese semi-deity that wards of evil. He isn't some kind of tailsman.
  • Congratulations /.

    By now, all the patent-trollster-lurkers who passively phish in the /. pool must be rushing with suitably edited claims to their frienly neighborhood USPTO.

    Can anyone who works in the IP (intellectual property NOT Internet Protocol) post a list of known trollster companies that are full of lawyers who acquire patents (by any means) and make patent litigation their primary business model?
  • This will make another nice tool to identify spam. But why not use greylisting at all the ISPs MTAs to simply refuse 99% of the spam that is being sent right now?

    Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a list of addresses spewing messages out by the thous
    • Seriously, greylisting implemented on all the ISPs MTAs would overnight block 99% of the spam being sent. Most spam at the moment is being sent from
      armies of bots run on unsuspecting users systems connected to cable and DSL service. The programs used are unsophisticated, they churn through a
      list of addresses spewing messages out by the thousands. They do not queue messages or retry them if they get an error. Greylisting uses this to
      great effect and blocks

      • But as soon as they write their code to queue the message and retry it after the delay period we pull another little trick out. During that delay period the spammer is sending out spam to lots of other sites, including a few spam traps. The spam traps add the spammer address to an RBL. When they get back to your system after the delay period you check the RBL list and drop the message now that it is showing up there.

        In over a year the spammers have not done anything different but dump and spew. You s
    • If greylisting were such a magic bullet solution, lots more ISPs would be using it. While the most important cost of spam is the wasted time of the recipients, the most direct economic costs are to companies that provide mailboxes for users (i.e. ISPs and email outsourcers), and they'd not only love to avoid the direct costs, they'd love to have a big competitive advantage over other providers. So if it were easy to implement and worked really really well, they'd jump at it.

      That doesn't mean it's not a h

  • Summary
    1) Make your PC face the North, whenever you are checking Email.
    2) Hang a metal windchime above your workstation.
    It is important that the rods of the windchime to be hollow, so that the auspicious Chi can rise up the chimes.
    3) Add a user account for the Dragon Turtle & make him the admin.
  • by mcrbids (148650) on Sunday August 22, 2004 @12:16PM (#10037933) Journal
    It's my belief that the most likely source of the birth of Artificial Intelligence will be the SPAM filter.

    Think about it - we now have software that "learns' what you like. [nuclearelephant.com]

    Sorry, but anything that "learns" fits a definition of intelligence - using past results to predict future outcomes. Note that I'm not saying "self aware" or "conscious", simply "intelligence".

    As we move forward, we'll see more and more intelligence on the part of the spammers, and the warring factions of intelligence will likely provide massive financial and political impetus to build ever more intelligence solutions - thus AI is born.

    The problem with other vehicles for developing AI is simply the budget. With SPAM, everybody has a direct, financial incentive to develop it, so development will definitely happen!

    • by Anonymous Coward
      You are 40 years behind the times. While it's chic to filter your spam using naive Bayesian text classifiers, don't kid yourself. Machine learning and text classification have been around since the 1960s.
  • I think over the next 2 decades, we'll come to a greater understand of life - and I think that we'll discover a unique aspect of life - that life is truly information technology.

    Each cell in your body contains approximately 20 GB of data. Consider the redundancy and sheer massive size of information storage capacity your body consists of! Compare THAT to an Oracle cluster...

    So, given the incredible need to process information in order to understand life itself (which could be considered a form of self-rep

  • I have tried just about every single anti-spam software out there, so I have some experience. After being fed up with getting false positives and having to deal with tons of spam getting past the spam filters I tried out Cloudmark's Spamnet - a community based approach to fighting spam. So far it has been 95-99% effective with 0 false positives which is the most important factor for me.

    In the past couple of months it has blocked 19,221 spam messages. I don't even bother to send spam to a Spam folder a
  • I just went through a couple of rounds of interviews with a spam filtering company about doing something similar. The problem these days is that spammers have figured out that "V1AGRA" can be spelled in a number of ways which fool word-based spam filters. There is also a lot of hidden information, such as html and urls, which may be significant, but is difficult to identify with exact string matching.

    The approach used to be:

    1. Find features (usually well-delimited words) in the message.
    2. Look up the
  • by po8 (187055) on Sunday August 22, 2004 @01:08PM (#10038237)

    As someone who's done some research on machine learning for spam filtering, this sure looks to me from their 8-page paper like yet another simplistic ML algorithm advocated by folks who don't know the field and tested using techniques of questionable sensitivity. Their "novel" method sounds an awful lot like feature set construction by clustering, a method that is widely used in the spam filtering literature, but with a somewhat novel clustering technique from biology.

    Message filtering starts by throwing away line breaks for no obvious reason, then optionally removing the known ham from the training set for no obvious reason. Message headers are then thrown away, for no obvious reason.

    No general method is given for corpus allocation. In the experiment reported later, the original corpus appears to have been split roughly in half. (For unreported reasons, none of these splits are exact. No rationale is given for the various corpus allocations.) The training corpus is then split into ham and spam, and the ham portion is split in half. The spam training corpus is used for "positive training": determining a complex feature set as described below. One half of the ham training corpus is then used for "negative training": filtering out complex features that are common in ham. The remainder of the ham corpus is used as a validation set to select thresholds described below. No justification is given as to the failure of the validation set to include spam messages, and the procedure is vague on this point.

    The description of the key "positive training" phase is difficult to follow: it seems to assume the pre-existence of the "SPAM vocabulary" [sic] being constructed. The key idea seems to be to use positional index of words within the body as base features, and construct complex features by using a pattern recognition algorithm to find correspondences between sets of base features across spam messages. Patterns that appear across many spam messages are treated as indicating spam.

    The final training step is to set thresholds for (1) minimum number of complex features in the spam message and (2) fraction of the message text covered by the complex features. One would expect these two criteria to be highly correlated: no effort appears to have been made to enforce or explore their orthogonality.

    The classification phase proceeds by simply counting the number of patterns in a given test message and the percent coverage of the message by the patterns. If the result exceeds both thresholds, the message is classified as spam.

    For the empirical evaluation, the corpus used seems to have consisted of approximately 130,000 messages, roughly 1/4 ham and 3/4 spam. No details of the construction or acquisition of this large corpus were given. Because of its volume, one would suspect a synthetic corpus from high volume sources. The details of this corpus construction are critical to the evaluation of the method, so no useful conclusions can really be drawn from the empirical evaluation other than that, like most machine learning methods, this method works well on some problem set.

    The claimed accuracies from the technique are at a level that is highly suspect from previous experience: there are fundamental bounds on how well any ML algorithm can do in real situations that don't appear to be met here. Indeed, messages found to be misclassified as spam in the test corpus were manually reclassified, but no effort seems to have been made to identify messages that were "correctly" classified by the algorithm but misclassified in the corpus. The error rate before manual manipulation of the results (!) appears to be about 97%, which is well within the normal expected range. Computational efficiency appears to be good.

    The vocabulary used in the paper is not particularly consistent with the vocabulary normally used in the spam filtering or machine learning literature. A few spam filtering and machine learning papers are cited, but not many: citations are primarily from the

    • >P.S.---I can't believe that the banner ad at the top of my browser window as I write this is actually blinking at me. Thanks, /. editors. Do me a favor, folks, and don't buy anything from Server Beach.

      why? *just curious, as from the post you seem like a bright person ...*

  • Why can't we start filtering based on the URL's in spam? There would need to be some verification process (otherwise valid URL's would be blocked), but wouldn't it increase the cost to spam since spammers would need to register even more domains? After a while, this should also give us a list of spam-friendly hosting providers who should be banned from the rest of the internet.
    • Why can't we start filtering based on the URL's in spam?

      ActiveState PureMessage has been doing this for years.

      Also now available for free via SURBL [surbl.com]

      Just when you though you had a new idea, it turns out to be older than the hills...

      • Also available in Vipul's Razor:

        NAME Changes - razor-agents [sourceforge.net] 2.61 (July 06, 2004) * Introduced the Whiplash signature scheme. Whiplash signatures are based on canonical domain names present in URLs embedded in spam messages. A Whiplash signature is also a function of the length of the spam message. It's important to note that not all whiplashes are used as classifiers. The Whiplash engine is augmented by sophesticated logic on the Razor2 backend to select the Whiplashes that are used to filter

  • by Ungrounded Lightning (62228) on Sunday August 22, 2004 @01:54PM (#10038463) Journal
    That should work for virus and worm detection, too!

    Even moreso, since viruses are much more a compilation of a set of previous constructions with a few mods than a new composition not necessarily based on the wording of old scams.

    And Viruses and worms (especially worms) are more constratined by their environment, requiring an exploit of a vulnerability and the instation of work-doing code. Though gene-shuffling techniques might be able to bury much of the code, the basic exploit must continue to be some sort of match to the vulnerability's "receptor".
  • I am no expert on mail or spam, but I recall reading some various good ideas of how to re-implement email to make it nearly impossible for spammers. Like that "ticket" system, where the mail actually sits on the senders mail server until you collect it. Im sure theres dozens of good ideas, that just with simple logistics make it nearly impossible / unfeasible to mass mail random people.

    Given that, why can't there just be a proposal, adopted (like a DVD format, etc) by some huge players (Microsoft, OpenSou

    • A penny or tenth of a cent would be unnoticeable to the average email user, but would break the spammer's bank.
  • My fast, efficient, method is very light on system resources and attacks spam by detecting one or more common attributes of spam and taking the appropriate action.

    Complete detailes here. [slashdot.org]

    Bryan Taylor
    iamcf13@hotpop.com
    SpamByte code: 7
    (see http://www.cf13.com/game-over-spammers.htm )
    http://www.cf13.com/press-release.htm
    All email containing unwanted content will be summarily deleted or reported as spam.
  • I just installed greylistd [debian.org] by Tor Slettnes about 24 hours ago, and haven't received a single spam yet (down from 20-30 per day before). I only have a 5 minute greylist delay, meaning there's almost no downside to this method. Assuming my correspondants don't use broken mail servers (and that's their problem if they do) there are no false positives and no maintenance with this system. I use no other spam filters of any kind. I guess they just aren't patient enough to wait 5 minutes :)

    And if they start

  • More clever filters and pattern matchers are not going to work. Just like encryption, the more something is used, the more likely it is to be hacked around. Maybe early adoptors will benefit, as the spammers have not had the time or target size to catch up yet. But on a grander scale, it is a no-win cat and mouse game.

    The solution is same one that reduces paper junk mail: postage fees. Charge 5 cents or so per message, and spam will greatly shrink.
  • They're talking about IBM's

    (((Anti-Spam) Filtering) Research) Project

    This is not the same as the

    ((Anti-(Spam Filtering)) Research) Project

    Nor is it the

    (Anti-((Spam Filtering) Research)) Project

    I'm not sure, but I think the last two are run by AT&T [slashdot.org].
  • by YU Nicks NE Way (129084) on Sunday August 22, 2004 @08:56PM (#10040626)
    It sounds like a great paper until you get down into the guts of their materials and methods. They trained their system on half of their total data, and did not then test on separate data. That captures the two classic no-nos of data driven techniques: they inflate their results by including their training data in the results, and, worse, their training data comprises a larger sample of their total data than would be seen in the real world.

    The first of these calls their sensitivity result into quesiton. If they classify their training data perfectly, then the 4.4% false negative rate they quote needs to be doubled to 8.8% -- almost one false negative in every eleven messages scanned.

    The second of these calls their false positive rate into question: training with an unrealistically thorough set leads to better catergorization, ceteris paribus. They need to show the trend with a variety of different training set sizes to support any claims about performance.

    This sounds like a fully buzzword compliant non-result to me.

To understand a program you must become both the machine and the program.

Working...