Server Wackiness, Carnivale, G5

Journal CmdrTaco's Journal: Server Wackiness, Carnivale, G5

Journal by CmdrTaco on Wednesday October 22, 2003 @10:49AM

Several problems continue to plague Slashdot, so consider this a status report, and a request to not continue reporting problems we're working on ;)

Our logging database was having problems. The most notable repurcussion of this was a dramatic drop in moderation points flowing into the system. The effects are obvious. We dropped to about 20% of our usual moderation. We have fixed the broken piece of hardware so logging is now happening quickly again. Moderation points are now flowing back into the system at a more normal rate. We're still not up to 100% tho- we can't simply dump the full load back into the system because then we'd have an obnoxious see-saw problem where spikes of points would be used, and we'd get lulls in moderation. I expect that in the next 2-3 days we'll be back up to capacity. Also, a very visible problem there was meta moderation. Many people noticed blank meta moderation. This was because there was 20% of the mod points needing M2, so we were caught up. This too will return as moderation returns to normal.
Users have been reporting 500 errors, hung pages, and other difficult to diagnose problems since we replaced the webheads and load balancer. We're still not exactly sure what is causing these, and since they seem to occur randomly, we're struggling to resolve it. In the process of trying to fix this, we found what appears to be a nasty memory leak in a couple different pages in slash. But that likely isn't really the problem- we just didn't notice it sooner because our old load balancer seemed to work much better. What seems to be causing the hurt is some screwy combination of pound/lvs/mod_gzip/lingerd. THe problem didn't exist on the old load balancer, so that seems to be a likely candidate. It's a hard bug to duplicate tho- it isn't specific to any particular page on the site, but occasionally an apache fork just gives up. All our code executes properly, but the apache fork fails to get its next request. Also we're seeing unusual tcp synack stuff (which is, for the trivia minded among you, what ultimately required us to scrap the Alteon load balancer that we used in 99 and 2000). Anyway, eventually all the apache forks stop handling requests, and we have to kill/restart apache. We'll probably have to kludge in some sort of angel script to detect this condition and automate the restart. We have 16 webheads right now, and this would probably mean restarting only a few of them every few hours, so users probably wouldn't notice. Or at least , they'd notice it less than the server hangs that they are getting now.
It's also worth noting that despite these problems, traffic is up substantially. We've been doing 3M+ pages on weekdays for the last few weeks, no doubt a combination of our usual fall traffic growth, and improved response times. This is what caused #1 mentioned above, but thats the price of success I guess ;)

Carnivale is a really excellent show. I had to play catch up, and watched all 6 episodes in the last week or so, but it really has been worth it. Besides featuring a host of freaks and legitimately interesting flawed characters, it has really cool cinematography. It's just really interesting stuff, and if you have HBO you should check it out.

I finally got my greedy little mits on a G5 this week. It hauls some serious ass... upon first boot, the system connected to apple and downloaded the 10.2.8 G5 Update, which promptly fried the ethernet adapter. Yay. Installing a 10.3 beta solved the problem, but still, way to go Apple!

Our office is physically moving a few weeks from now. The best part of this is all the heavy lifting. Oh, and getting walls again. Between you and me, that CowboyNeal is a chatterbox... having some drywall between us will save at least one of us from the trouble of hiring a hitman ;)

Slashdot Top Deals