What follows is more-or-less Pat "BSD-Pat" Lynch's account of the DDoS... Pat is our super 31337 BSD Junkie sysadmin. He wants everyone to know that the timeline below is little screwy, but things are more or less in sequential order. Things might not be exactly perfect, but hey, what do you expect after 30 hours without sleep?
Having moved the day before, none of us were truly familiar with exactly how the new hardware would handle the full burden of being 'slashdot.org'. The cluster (known affectionately as The Matrix) had handled its premiere day with flying colors, but we didn't really have an accurate feel of how things would react. Combine this with a couple of extremely high traffic stories posted on both Thursday and Friday, and it took us a awhile to determine that the problems were external, and not a flaw in some new component in the cluster."
The Attacks began Thursday morning. Most of it came in the form of SYN floods, from obvious /16's no less, and some /24's. We didn't have any zombie-killing software or a firewall installed because of certain network topology issues. Later on, a second wave came, this closer to 8 or 9pm and the load balancer (an arrowpoint CS-100) died under the load.
The DDoS, as far as I could see, was a lot of SYN and Zero port packets coming from various /16's and /24's as well as a bunch of RFC1918 reserved addresses (10.0.0.0/8, 172.16.0.0/12 and 192.168.0.0/16) At one point we reached 109Mbits worth of traffic into our network.
Liz and I went back to Exodus and rebooted the Arrowpoint, then the site seemed "ok" for a bit. By 3 in the morning, Liz decided that the PIX (Cisco's firewall) could simply not do what it was supposed to do, so we went back and started building a FreeBSD box as a bridging firewall.
just before we went to plug it in, I tried to ssh into the vpn-gate and noticed that nothing was working right: while the site worked, outgoing traffic and source groups on the Arrowpoint was screwed. As if that wasn't enough, two ports died on it already!
At some unknown point (time blurs after 30 hours straight!) Martin and PatG show up (thank the gods!) and they force us to go to sleep, they bring the site up outside the Arrowpoint, while Liz and I watch from a hotel room.
As of Friday morning, the site is semi-working, but the adsystem can't be updated, and we have no access to the backend servers. I scream bloody murder to Arrowpoint, who eventually shows up to blame the router: a cisco 6509 switch with two RSM/MSFCs.
Liz and I do packet dumps and determine it's not the router, the little CS-100 had died the night before, and thats where it all started. The Arrowpoint guy insists we did something to make the Arrowpoint not work (CT: Explicit description of precisely where Liz and and Pat wanted to store the newly deceased Arrowpoint removed to keep things rated PG) By 7 the CS-800 CSS is up we're almost done for the day, but we stay to make sure. By 10pm we're exhausted but stable, although we're running 4 servers on a round-robin DNS while the new load balancer waits.
Netops (Liz , Martin and I) regroup, and do reintegration of new Arrowpoint CS-800 and installation of a new FreeBSD Firewall box instead of the PIX during Saturday Afternoon. Slashdot returns to normal. Sysadmins get well-deserved sleep.
So that was the story. It was a pretty hellish weekend for everyone involved, but thanks again to those that helped get our ducks back in a row. Again, Part #2 to this (which originally was gonna be run last Thursday, but with all this ddos stuff got pushed aside) is a fairly detailed description of the new Slashdot setup at Exodus, complete with all the changes mentioned above. Fun for the whole family if your family is really into clusters of web servers."