Forgot your password?
typodupeerror

Asynchronous Programming for Spam Elimination 63

Posted by timothy
from the you-do-this-while-they-do-that dept.
ttul writes "Stas Bekman (formerly the maintainer of mod_perl) has been quietly building an asynchronous programming framework to build high performance network applications in Perl. His recent Perl.com article describes how he has used the Event::Lib module (that lives on top of the popular libevent library) to write a traffic-shaping email proxy to get rid of spam. Asynchronous programming is challenging at the best of times. Read on to find out how to do it the easy way in Perl."
This discussion has been archived. No new comments can be posted.

Asynchronous Programming for Spam Elimination

Comments Filter:
  • by frenetic3 (166950) * <houston.alum@mit@edu> on Thursday October 12, 2006 @10:04PM (#16417723) Homepage Journal
    so they wrote an asynchronous proxy that slows down connections. cool trick, but not any kind of scalable solution.

    the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just give up after 30 seconds; if this throttling technique ever became commonplace, spammers would just write their own asynchronous mailer -- it's not THAT hard. windows has the same kind of async networking support (either through the winsock API and/or IO completion ports, or what have you) and i'm sure the spam/botnet software authors have no qualms about holding open a couple thousand sockets on the rooted windows machine (times a few hundred thousand machines.) furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual mail to get discarded :)

    (that, and they shoulda used twisted [twistedmatrix.com] or something :) -- using a pool of apache/mod_perl instances to handle connections is grossly inefficient.)

    ok, ok, maybe this sounds overly critical. it's a clever, thinking-out-of-the-box idea, but certainly not the panacea we're looking for to stop spam.

    -fren
    • Forget async io (completion stuff) in Windows...

      They can just make the SPAM program multithreaded and start a new thread for each new connection (each using *synchronous* IO).

      Theres no interprocess communication involved, it should be trivial.
    • by A beautiful mind (821714) on Friday October 13, 2006 @12:52AM (#16419149)
      It's an arms race. Graylisting, higher MX spam traps, etc.

      They all rely on the "we only have to be better than the neighbour's mailserver" principle. Until everyone starts doing it these things work and then new methods get invented to combat spam. Not that suprising, but saying no to this approach is basically silly. There is NO good way to eliminate spam, because stupid people exist. So people hack around the problem.
      • by Halo1 (136547)
        They all rely on the "we only have to be better than the neighbour's mailserver" principle.
        Greylisting in combination with blacklisting also has another advantage: by the time the message is no longer greylisted, there is a higher chance that the spamming server is already blacklisted.
    • by ttul (193303) * on Friday October 13, 2006 @02:57AM (#16419857) Homepage
      [full disclosure and shillery alert: I work with Stas at MailChannels]

      You make some very good points -- and these are all concerns we had when we set out to build this software.
      Fortunately for the world, these concerns have turned out to be unwarranted. Furthermore, our experience in actually deploying this technology has been far more breathtaking than we had imagined -- both in terms of spam mitigation and improvements in scalability.

      > the core assumption, and the only thing that makes this work, is that botnet spam software will _always_ just
      > give up after 30 seconds;

      I have a theory that spammers will always be impatient. I believe this theory for several reasons:

      1. Spam campaigns are now recognized by anti-spam companies in minutes or hours. New campaigns therefore have a very short life expectancy and have to be completed as fast as possible. If mail can't get delivered fast, it's time to move on to a new domain to get it moving again. With collaborative filters like Cloudmark recognizing campaigns in less than 60 seconds, spammers obviously have to move traffic fast.

      2. Botnets are not unlimited in their size or bandwidth capacity. Typicaly botnets these days are between 1,000 and 10,000 hosts. Any larger and the command and control channels are very quickly noticed and shut down by service providers. Botnets cost money too -- $250/hour for a 10K botnet is typical.

      3. Spammers raison d'etre is to send lots of mail and hope that a small percentage of recipients buy something. The only way to make the business profitable is to send huge amounts of mail. If all zombie traffic in the world was magically being slowed down, spamming would no longer be profitable and spammers would tend to focus more on things like highly targeted phishing instead. Not surprisingly, we're already starting to see this.

      4. Because #3 isn't going to happen any time soon, and in light of the technical constraints (1 and 2), spammers have no choice but to abort their connections within a very short time frame. It's just the nature of the economic beast. Hanging on is just for posterity. It doesn't make economic sense.

      5. It works. And it's very very scalable. By slowing down traffic and multiplexing what remains, mail server load drops by 90%. In big installations, that means no more being paged in the middle of the night because your cluster of 4-way Xeons with 8GB of RAM is borked by a distributed spam burst.

      Oh -- and of course you can't just slow everything down. It's important to be very selective so as not to delay everything.

      > if this throttling technique ever became commonplace, spammers would just write their
      > own asynchronous mailer -- it's not THAT hard...

      Actually, it is that hard. Even Stas got a headache working on this project.

      But even if it was easy, it would be pointless for a spammer to launch more than one connection per zombie. If a sender is marked as suspicious, the sender's concurrency is severely limited. One connection per zombie, at 5 bytes per second -- that's just not economic.

      > furthermore, i bet there are some shitty legitimate MTAs that would just give up too, causing actual
      > mail to get discarded :)

      Let's just say the gap between the patience of spammers and the patience of legitimate MTAs is very large indeed. And by carefully fingerprinting and assessing sender reputation, this problem can be minimized to the point where it is a far smaller problem than content filter false positives.

      I also want to point out that this technology does not make email suck by slowing it down. It in fact speeds up delivery of legitimate mail in most cases because the load is so reduced on the rest of the infrastructure.

      Just talk to our customers. One of them was running four 4-way Xeon boxes with 8GB of RAM each -- all this to service the spam filtering needs of just 10,000 end users. He told us he hadn't slept a full night in months because of load-based outages. Since installing the software Stas built, the only alert he's received is a notification that the load level dropped below the panic threshold!
      • by caseih (160668) on Friday October 13, 2006 @12:42PM (#16425175)
        Unfortunately I've seen a marked decrease in the effectiveness of grey-listing lately, which is similar in intent to your ideas. What I'm finding is that a lot of spam is now coming from RFC-compliant mail servers. Stock spams in particular always come through after faithfully waiting out the greylist timeout. So obviously some spammers are able to wait, even up to 45 minutes, to send their spam to me. So despite your arguments spammers will find a way to still economically spam while tolerating delays, holding connections open, etc.
        • by SEAL (88488)
          I've seen a marked decrease in the effectiveness of grey-listing lately

          Agreed. My ISP *finally* added greylisting this year. This is the account I use on my domain registrations, so the email address shows up in whois. It therefore gets an insane amount of spam. After testing out the greylisting for a couple of weeks, I saw no perceptible difference in the amount of spam I was receiving.

          When you greylist, you're basically using SMTP rules to tell the sender "try again later". As this became more common
        • Stock spams in particular always come through after faithfully waiting out the greylist timeout.

          That penny stock spam is the most successful I've seen. More than half of it gets past gmail's filters and into my in box, and then more than half of that gets past my own filters. It's just about the only spam that makes it through, but I get several of those a week. (I also checked the stocks, and not a single one has risen significantly, despite the spam's assurances ;-)

        • by hawg2k (628081)
          I was just about to post the exact same thing. Mod the parent up.

          I don't even use greylisting anymore because it gets in the way of me troubleshooting mail problems, and has negligable affect on SPAM anymore.
    • So how does this compare to OpenBSD's spamd, which does tar-pitting (and things like setting the TCP window size to 1 so you can really slow things down), but is designed for very low resource usage? This presentation [openbsd.org] by the spamd guys last year should, I think, address some of your questions about the long-term effectiveness of greylisting. In summary; spammers adapt, but so does spamd.
      • by ttul (193303) *
        [shillery notice: I am CEO at MailChannels [mailchannels.com]]

        spamd gave us our initial inspiration. I talked with Bob Beck at the Cansecwest security conference [cansecwest.com] after he presented on spamd and was -- to put it mildly -- blown away.

        It's important to understand that spamd does not actually deliver mail. It just responds r e a l l y s l o w l y and then returns a 400-series code to force the sender to try again. After the first time, a packet filter rule is added that redirects that sender to a real MTA, which receives the mes
  • by thrillseeker (518224) on Thursday October 12, 2006 @10:08PM (#16417769)
    Sometimes I sits and programs, and sometimes I just sits ...
  • by frenetic3 (166950) * <houston.alum@mit@edu> on Thursday October 12, 2006 @10:10PM (#16417779) Homepage Journal
    Your post advocates a

    (X) technical ( ) legislative ( ) market-based ( ) vigilante

    approach to fighting spam. Your idea will not work. Here is why it won't work. (One or more of the following may apply to your particular idea, and it may have other flaws which used to vary from state to state before a bad federal law was passed.)

    ( ) Spammers can easily use it to harvest email addresses
    (X) Mailing lists and other legitimate email uses would be affected
    ( ) No one will be able to find the guy or collect the money
    ( ) It is defenseless against brute force attacks
    (X) It will stop spam for two weeks and then we'll be stuck with it
    ( ) Users of email will not put up with it
    ( ) Microsoft will not put up with it
    ( ) The police will not put up with it
    ( ) Requires too much cooperation from spammers
    (X) Requires immediate total cooperation from everybody at once
    ( ) Many email users cannot afford to lose business or alienate potential employers
    ( ) Spammers don't care about invalid addresses in their lists
    ( ) Anyone could anonymously destroy anyone else's career or business

    Specifically, your plan fails to account for

    ( ) Laws expressly prohibiting it
    ( ) Lack of centrally controlling authority for email
    ( ) Open relays in foreign countries
    ( ) Ease of searching tiny alphanumeric address space of all email addresses
    ( ) Asshats
    ( ) Jurisdictional problems
    ( ) Unpopularity of weird new taxes
    ( ) Public reluctance to accept weird new forms of money
    (X) Huge existing software investment in SMTP
    ( ) Susceptibility of protocols other than SMTP to attack
    ( ) Willingness of users to install OS patches received by email
    (X) Armies of worm riddled broadband-connected Windows boxes
    (X) Eternal arms race involved in all filtering approaches
    ( ) Extreme profitability of spam
    ( ) Joe jobs and/or identity theft
    ( ) Technically illiterate politicians
    ( ) Extreme stupidity on the part of people who do business with spammers
    ( ) Dishonesty on the part of spammers themselves
    ( ) Bandwidth costs that are unaffected by client filtering
    ( ) Outlook

    and the following philosophical objections may also apply:

    ( ) Ideas similar to yours are easy to come up with, yet none have ever
    been shown practical
    ( ) Any scheme based on opt-out is unacceptable
    ( ) SMTP headers should not be the subject of legislation
    ( ) Blacklists suck
    ( ) Whitelists suck
    ( ) We should be able to talk about Viagra without being censored
    ( ) Countermeasures should not involve wire fraud or credit card fraud
    (X) Countermeasures should not involve sabotage of public networks
    ( ) Countermeasures must work if phased in gradually
    ( ) Sending email should be free
    ( ) Why should we have to trust you and your servers?
    ( ) Incompatiblity with open source or open source licenses
    ( ) Feel-good measures do nothing to solve the problem
    ( ) Temporary/one-time email addresses are cumbersome
    ( ) I don't want the government reading my email
    (X) Killing them that way is not slow and painful enough

    Furthermore, this is what I think about you:

    (X) Sorry dude, but I don't think it would work.
    ( ) This is a stupid idea, and you're a stupid person for suggesting it.
    ( ) Nice try, asshole! I'm going to find out where you live and burn your
    house down!
    • I've seen the checklist used many, many times, and it's typically funny. But I'm not sure you've selected the correct values in this instance. Please provide details of why you selected the following:

      (X) Mailing lists and other legitimate email uses would be affected
      (X) Requires immediate total cooperation from everybody at once

      Specifically, your plan fails to account for

      (X) Huge existing software investment in SMTP
      (X) Armies of worm riddled broadband-connected Windows boxes

      and the following philosophical

    • by gurps_npc (621217)
      Your answers were bullcrap. Here are my counters.

      (X) Mailing lists and other legitimate email uses would be affected

      And your point is? If I have to give up 'mailing lists', or (far more likely) force mailing lists to change so that they are NOT so similar to spam that they get caught by anti-spam stuff that is not a real issue. We do NOT owe Mailing Lists the right to exist if they can't change to deal with the reality of a spam-free world, tough luck. The effect on other legitimate email uses would b

  • by Chuck Chunder (21021) on Thursday October 12, 2006 @10:26PM (#16417927) Homepage Journal
    Evidently:
    Asynchronous Programming for Spam Elimination 4 of 1 comment
  • Asynchronous Programming = programming with futures

    must we rename everything every time that someone "discovers" it?
    • AJaX (Score:2, Informative)

      by tepples (727027)
      Asynchronous Programming = programming with futures

      Except "asynchronous programming" is already a well-known term among many web developers:

      Asynchronous Programming with
      JavaScript, HTML DOM,
      and
      XMLHttpRequest

    • by RAMMS+EIN (578166)
      ``must we rename everything every time that someone "discovers" it?''

      Yes, because, that way, you get publicity. If you just quietly sat and implemented it, it would be every bit as great, but nobody would hear about it.
  • by 0kComputer (872064) on Thursday October 12, 2006 @10:50PM (#16418199)
    This guy goes and makes it multithreaded... Great just what we need.
    • It's easy to do threads [cpan.org] in perl 5.8.

      • Re: (Score:2, Informative)

        by ttul (193303) *
        [full disclosure: I work with Stas at MailChannels]

        We looked at using the new Perl threads, but Perl 5.8 threads suffer from a few severe limitations.

        1. When you create a new thread, a complete copy of the interpreter is made. The new thread makes use of this new interpreter instance and cannot communicate with the original thread except via the threads::shared module or some traditional IPC mechanism. In short, they're no better than forking a new process and in many ways, they are far worse than this.

        2. P
        • Re: (Score:2, Interesting)

          by Ed Avis (5917)
          Did you consider some event-driven thing using POE [perl.org]?
          • by ttul (193303) *
            Yes, we looked at using POE. We concluded that POE is just far more than we needed for this application.
            It would have been too difficult to make POE rock performance-wise in addition to ensuring that POE used an efficient event library like libevent.
            And in this kind of application, you need awesome performance. We profiled the app with strace for weeks to get rid of unnecessary system calls.
        • by kimanaw (795600) on Friday October 13, 2006 @10:28AM (#16423043)
          Yes, we could have used Python. Or Ruby. Both these languages have better threading support by leaps and bounds.

          Er, how ? Because they don't really use threads ? Sure, they're fast and lightweight...but since they don't use the underlying OS's threads implementation (ie, kernel-compatible threads), they're only marginally useful on multiCPU and/or multicore systems.

          2. Perl threads are still quite unstable.

          Whats your basis for that statement ? Have you tested the latest versions of the threads [cpan.org] and threads::shared [cpan.org] modules ? Some significant effort has been applied in the past year to improve stability, as well as reduce footprint...you might want to give it a look...

          Perhaps if your org can get some funding, you might throw some money at the TPF to get iCOW implemented ? Which should vastly improve thread startup and reduce footprint. threads::shared remains a bit of a challenge, but that issue can be addressed by some carefully crafted XS (which I'm told Stas is pretty good at ;^).

          • stable perl threads? you must be kidding...

            it works for basic light things, but if anything complex is used, like mod_perl, it segfaults all over and if it doesn't it takes dozens of seconds to start a new thread under heavily loaded machine (due to lack of CoW as you've mentioned, but even then I doubt it'd be much of help, since it'll still need to copy a lot of data)

            And yes, someone needs to work on fixing those and a TPF grant would be very helpful.
    • by losec (642631)
      Actually, threads is something perl has got right, compared to most other languages.
      Perl threads is also very easy to understand.
      Simply put, nothing is shared between threads.
      If you want to share data between perl threads you must explicitly say so:
      my $foo : shared = 1;

      though if you're stuck with perl version 5.6.0, dont use threads.
  • Isn't that an oxymoron?
    • by FreeIX (1011833)
      Nah, Perl is very easy to do things in...the first time. Unfortunately what is not so easy is understanding what you did six months ago.
      • by chromatic (9471)

        Consider this an opportunity to learn how to write maintainable code.

        • by FreeIX (1011833)
          Assuming for the sake of argument that I don't know how, perhaps I'd rather use a language that doesn't by its very motto make it difficult to learn how.
          • by chromatic (9471)

            There's more than one way to do it, in Perl, so choose the most maintainable. Problem solved.

            Before you counter "But I have to maintain code written by monkeys, and it's hard to read," consider not hiring monkeys to write code you care about. Not even Haskell or Java or Ruby prevents monkeys from writing bad code. The problem is, they're monkeys, not that they're using the wrong language.

      • by hondo77 (324058)

        Unfortunately what is not so easy is understanding what you did six months ago.

        If a programmer cannot go back into code he wrote six months ago and figure out what is going on, the blame rests with the programmer. The language is irrelevant.

        • Re: (Score:1, Insightful)

          by Anonymous Coward
          He didn't say it couldn't be done, he said it was not easy with Perl. The language is entirely relevant to this assertion.
  • Clever, but... (Score:2, Interesting)

    by deepb (981634)
    The article is correct - mail servers do not mind waiting a few minutes/hours/days to deliver their mail. Unfortunately, end-users do mind. The inherent delays for just about every message would be particularly painful for business email users, but even residential ISP customers are constantly opening tickets when they observe a delay (I work closely with several large ISPs, which is how I know).

    Delays aside, I just can't buy into network-layer rate limiting when it comes to email. The metric for anti-s
    • by ttul (193303) *
      [full disclosure: I work with Stas at MailChannels]

      > The inherent delays for just about every message would be particularly painful for business email users, but
      > even residential ISP customers are constantly opening tickets when they observe a delay (I work closely
      > with several large ISPs, which is how I know).

      That would be a problem if every single message was slowed down, but it's not. The system uses sender reputation and behaviour to ensure that only malicious senders are slowed down. Our cus
      • by deepb (981634)

        That would be a problem if every single message was slowed down, but it's not. The system uses sender reputation and behaviour to ensure that only malicious senders are slowed down.

        I don't recall any mention of that in the article, but I guess it may have been a bit outside the scope. Either way, I didn't realize that - makes sense.

        One way or another, you have to delay some of the traffic. You either do it up front and selectively -- applying the pain to the bad senders -- or you do it after the messages

  • Most if not all mail transfer agents no longer operate as open relays by default, a problem which used to be the main contribution to spam. People blamed the complexity of Sendmail for that and other problems, so many distros moved to other mail transfer agents for their default. A few years ago Sendmail was still about 65% of the mail servers.

    What is the current marketshare of Sendmail now and what is the frequency of others like Exim, qmail, and Postfix?

    • by ttul (193303) *
      I can actually comment on that. We've surveyed 400,000 mail servers at organizations around the world and have found that Sendmail still holds on to 13% of the market.
      • by foxylad (950520)
        Care to comment on the other mail servers? Sendmail at only 13% is a big suprise. Hopefully this statistic will help pursuade our dinosaur sysadmin that we should switch to postfix.
  • Perl is good for scripting but 24/7 high performance apps?
    Don't make me laugh. Something this CPU and I/O intensive should
    be written in C/C++ or even assembler at a push , not a scripting
    language. Seems to me this project has been written in perl for
    the sake of writing it in perl , not because it confers any
    advantages over doing it in a lower level language.
    • Re: (Score:2, Insightful)

      by chromatic (9471)

      Remember kids, if your process is IO-bound, you want the fastest possible code ever to make sleeping on those system calls as efficient as possible!


    • You must be really amazing to be able to determine that a given application can't possibly be usable when written in language X and would be much better in language Y without any data or firsthand experience using the application.

      Sometimes, things work just fine even though they'd be 20ms faster if written in C/C++.

      - Roach
      • by Viol8 (599362)
        If you're dealing with large data dumps you want something that can
        process that data fast.
    • Re: (Score:3, Insightful)

      by angel'o'sphere (80593)
      You seem to have never used PERL? PERL is in most regards in a speed range of > 85% of C/C++. For the stuff PERL is optimzed for it is nearly at 95% of C++ with a FAR shorter development cycle.

      At least your comment is the msot silliest I have ever seen. What will a mail filter/forwarder do 90% of its time? NOTHING, being blocked listening on a socket. It realyl does not matter if the listening process is written in assembler (granted, which is very portable from sparc to i386 to PowerPC) and jsut waits "
  • We implemented greylisting. It is the answer. I watch as tens of thousands of emails per day are bounced away into oblivion. At first, ham had to wait a a while, but now that the database is built, no one waits anymore. Not only that, server CPU is neglible because Spamassassin doesn't run on resent mail that has been marked as ham. Combine this with a few scripts that do some basic purging of spam addresses from the database, and we're good to go. Let's not reinvent the wheel. Why don't we just build greyl

Whenever a system becomes completely defined, some damn fool discovers something which either abolishes the system or expands it beyond recognition.

Working...