Slashdot Log In
Blow-by-Blow Account of the OSDN Outage
from the warts-and-all dept.
Our network operations staff was shorthanded; one of our most knowledgable people had quit recently to go into business with a friend and had not yet been replaced. Another was in the hospital, ill and unreachable. A third's cell phone was on the kitchen counter, unhearable from the bedroom, and the fourth one's cell phone battery had fallen out. It was a frustrating comedy of errors, and an unusual one. Our netops staff is typically "on the bounce" 24/7.
Dave Olszewski, an OSDN programmer who is not technically part of our netops staff and is not trained in our equipment setup, happened to be on IRC at the time. He doesn't live far from the Exodus facility in Waltham, MA, where our server cage lives, so he went there immediately. Kurt Gray, lead programmer, who we dragged out of bed, was not far behind. Hemos and others were awake by then, growing frantic as we found that not only Slashdot, but also NewsForge, freshmeat, OSDN.com, ThinkGeek, and QuestionExchange were down, along with our old -- but still popular -- MediaBuilder and AnmationFactory sites. Arrgh!
This is Kurt's "on the scene" report from Exodus:
Headshaking all around. Meanwhile, about 11:40 a.m. Yazz Atlas woke up and got his cell phone reunited with its battery. He picked up his voice mail messages, tossed on clothes, and hustled over to Exodus.Walk into our cage at Exodus and it seems harmless enough but try to learn what everything is doing and where the wires are all going in less than an hour and you could go insane. You're standing in a nice, clean, uncomfortably air-conditioned facility with 150 of VA's FullOn and various other servers humming away. Greeting you at the door is "Big Gay Al" our Cisco 6509, which contains two redundant router modules: Kyle and Stan. If Stan dies, Kyle takes over and vice-versa. Across the cage are two Arrowpoint CS800 load balancing switches: one is racked and idle (as a hot spare) and the other is live and balancing the load for most of our OSDN web sites. Between the Cisco 6509 and the Arrowpoint is a bridging FreeBSD firewall using ipfw rules to block stuff like ping just to drive everyone nuts basically."I can't ping your site!"
"Yeah, we know."
Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you.
At this point if you know anything about networking you'll demand an explanantion for why we're using each piece of equipment in the cage and not a WhizBang 9000 SuperRouter like the one you've been using flawlessly that even washes your dishes for you and makes food taste better too... I can only tell you that I'm not the networking design person here, I didn't chose this equipment or configure it but I'm told it's very good hardware as long as you know what you're doing, but as CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
"Did you reboot the firewall?" I asked Dave.
"I rebooted everything," he said. "I think's it's the Cisco."
So we console into the Cisco 6509. What a mess. Neither of us understand how this switch was configured and what it is trying to do. We don't fully understand why you can get a console connection on Stan but not Kyle (turns out the standby module doesn't have active console, that's normal).
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords." The op who was supposed to be on duty (the one whose phone was out of hearing) was still nowhere to be found. They called their hospitalized coworker and got the Cisco passwords.
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at. We could ping something on the inside but not everything. On some VLANs we could ping the gateway and others not. The outside world could ping one of the IPs the 6509 handles but not the other. From the inside we could not ping the IP that the outside world could ping. We could ping the one that they couldn't...very frustrating..."
Kurt again:
The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers. Instead of getting an answer from Exodus and running to Cisco with it, and then back again, he got Cisco and Exodus engineers to talk directly to each other and work it out. He conferenced an Exodus network engineer to Barnaby at Cisco and, Kurt says, "they talked alien code about VLANs, standby IPs, HSRP, multihoming, etc. etc., and they came to an agreement: our switch config was a mess... but at least Barnaby knew what the settings were supposed to be and an Exodus engineer agreed with him."Several hours of this sort of network debugging went on until 3:00 AM Sunday. By then we had called Cisco for help. They couldn't help us until they saw the switch config and got a chance to review it. We were spent. We had to go to bed and stay down for the night.Next morning we're back at Exodus and the situation hasn't changed -- our network is unreachable to the outside world. I was hoping that during the wee hours of the morning the Cisco 6509 had become sentient and fixed its own configuration or perhaps a friendly hacker had cracked into it and fixed it for us, or perhaps ball lighting would travel down a drain spout and shock our cage back to life like those heart paddles paramedics use... "It's a miracle!" No such luck.
So I called Cisco tech support. I wish had done this sooner. I was amazed first of all by how you can talk to a qualified Cisco tech immediately... we're talking an 800 number that you dial and within less than a minute you are talking to a technician... doesn't Cisco realize how shocking this is to technical people, to actually be able to talk to qulified technicians immediately who say things other than, "Well, it works on my computer here..."? Do they not know that tech support phone numbers are supposed to be 900 numbers that require you to enter your personal information and product license number, then forward you to unthinking robots who put you on hold for hours, then drop your call to the Los Angeles Bus Authority switchboard... does Cisco not understand that if you do not put people on hold for at least 10 minutes they might pass out in shock for being able to talk to a human too soon? Apparently not.
So I asked the Cisco technician, Scott, to telnet into our switch and take a look at the config. I figured he'd balk and say, "No I can't do that," because of course this is a tech support number I called so he's going to tell me to give the phone to my mommy if she's there and ask her to log into the switch because, since I don't have a lot of experience with IOS, I must be some kind of idiot to even call tech support without knowing what my HSRP configuration is on VLAN 4. Instead he says, "OK, what's the login password?" I can't believe this... I must have dialed the wrong number, he's not going to just go into our switch and sort this out for me right here and now, is he?
So he's in the switch and he's disgusted and horrified by how we have it configured, and I'm sure he's right. So I ask him, "Well, can you change all that?" I figure he'd say, "No, this your equipment, you fix it yourself," but he doesn't, he says, "Sure, what's the config password?" You gotta be kidding me, I must have dialed the wrong number here... this cannot be a tech support line... you can't actually get a tech support rep on a toll-free number simply to log in and fix your router setup while you whine at him on the phone... this is not real.
So he's in the switch config and he's having a great time pointing out everything some of our people warned us about months ago. He tells me this is wrong, we shouldn't be doing this or that... "Well, then change it if you don't mind," I tell him. "Switch broke. Me dumb. You fix." ...so at one moment Scott wanted to undo some changes. He bounces the switch... copy startup-config running-config ... the switch resets itself... then email starts streaming into my inbox... then I can ping our sites all of a sudden... we're back online! Everything is back! Weird.
Ok, that's all fine, but Scott is still freaked out about how we have the switch configured. Soon I get a call from Barnaby, another hot shot Cisco tech rep. He just logged into our switch and he's horrified too. He wants to walk me through a total switch upgrade and cleanup right now. "Not tonight", I tell him, "I'm burnt and I need to consult some some network people over here before we mess with this any further."
Before moving on to the (short) Tuesday outage, here are a few more notes from Yazz:
Tuesday was router reconfig day. It was originally only supposed to cause "about five minutes" of downtime, so it didn't seem worth posting any kind of notice that it was going to happen. Why the middle of the day instead of a low-traffic post-midnight time? Because this way, if there was any trouble lots of people at Exodus and Cisco would be awake and around to help. And it was a good thing this choice was made. Kurt picks up the story:The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups. Half the VLANs were only stored on one unit and the other half of them on other. So when one died it only knew half of the full setup and couldn't route things correctly since the VLANs it wanted weren't there... Fun!!!
This has not been OSDN's finest week. But we thought it was better to give you the full rundown than try to pretend we're perfect. At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be. If nothing else, perhaps this story can help others avoid some of the mistakes we made. We certainly aren't going to make the same ones again! (~.*)Tuesday 11:00 a.m. we're back in the cage. Barnaby is logged into our switch while he's talking to me on my cell phone (which disconnects every 5 minutes just to make my day more challenging), helping us by upgrading the Cisco 6509 firmware, then he's going to clean up the config. First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops). Yazz took care of that. From there Barnaby patched the firmware, had me reboot the switch, and we should be down for just 5 minutes. Unfortunately 5 minutes turned into 2 hours.After the switch reboot part of our network was unreachable again, much like Saturday's episode only this time with a Cisco rep on the phone helping us work it out. Again we started tracing cables all over the cage, pinging every corner of the matrix. Barnaby got an Arrowpoint tech rep, Jim, on the line and into our Arrowpoint. But this is tech support, Jim isn't just going to log into our Arropoint and debug for it for us, right? Wrong, this is Cisco tech support: Jim logs into our Arrowpoint and works with Barnaby to trace packets and debug our network.
For a while we put a cross-over cable in place of the firewall just to be sure the firewall box wasn't jamming us. Nope. Didn't help. Barnaby and Jim are mapping hardware addresses to IP addresses to figure out where each packet is going. Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
"Hold on," he says over the phone to me. Then the light goes green, and after a few seconds of routers correcting their spantrees we're back online. Everything is back online. All this time it was this little interface to an ignored switch that none of us bothered to account for. Make a big note about in the network documentation, please.
After we came back online Barnaby went ahead and cleaned up our switch configuration, put things the way they ought to be, made our conections sane and stable.
Re:Cisco Support (Score:5)
Many people who call don't understand how the system works internally so here's a summary: We have cases in 4 groups, priorities 1 through 4, 1 being the most important. The designation of the priority of the case is entirely up to you as a customer. All cases are P3s by default which more or less means they need resolution within 72 hours. If your network is down and you need help right now, today with no waiting we'll elevate to a P2. If you are in a serious network situation like the one described in the article then it's a P1 and literally everything else stops, a bell goes off and everyone crowds around the tech w/ the problem (unless it's a softball case).
There are TACs all over the world but for English-speaking customers what usually happens is the US TACs roll over to the Australian TACs in the early evening who in turn roll over to Belgium and then back to the US. P1s get worked 24 hours until they're resolved, and if they're not fixed in less than 4 hours it's not so good for us.
We have to close about 5 of these cases a day which is sometimes cake (I can't ping my interface which is shut down) and sometimes nasty (redistribution 12 times over).
Also, those little surveys you get everytime you work with us (Bingos) are very important. If you'll recall you can rate us from 1 through 5 in 8 to 10 different categories. Anyone who doesn't maintain an average of at least 4.59 is not long for the TAC, 2 or 3 months tops.
The pay is actually kind of crap but there's no better place in the world to prep for your CCIE. I don't think anyone views the TAC as a long-term environment. Too much stress honestly.
Re:Grr... (Score:5)
Not everything is a conspiracy folks.
Re:cisco tech support are badasses (Score:4)
- Robin
Re:Eeep - scary moderators! (Score:5)
Within a minute ??? (Score:5)
Was anyone else waiting for the "*clickity-click* Wow, it looks like your entire root directory was deleted!" punchline? :-)
I forgot to add one thing (Score:3)
Was this configuration ever tested?! It sounds like it was put together, prayed over and sent out into the world.
it would have been simple to test too... pull out one of the uplinks... then the other... now try pulling out some of the webservers... and so on.
I just have a couple of questions (Score:5)
By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem.
Reboot the database?? WTF? You just proved my point as to why MySQL is NOT ready for primetime. Reboot the fscking database??
So Dave takes a look, can't ping the gateway, can't ping anything. Reboot the firewall. Didn't help. Still can't ping outside. OK, reboot the Arrowpoint. No difference. Hold your wallet... reboot the 6509... rebooting... rebooting... no difference. This is not good.
Guys, this isn't Windows -- Rebooting is an absolute last resort and if it works then you have discovered a problem, either in hardware or software and it needs fixed, not just a "oh well, a reboot fixed it, life goes on." Bastions of professionalism you're not.
I don't normally flame people for this kind of thing but the Slashdot crew are especially keen on bashing Windows, yet you resort to their exact tactics whenever a problem comes up.
Reboot the database?? I still can't believe I read that. Sorry.
Cisco Systems have some wonderful systems -- Hell I just recently found out about their stack trace analyzer... feed it a "sh stack" and it emails you back a list of IOS and/or hardware bugs which likely caused the crash. That is just plain old SCHWEEEET. Or being able to read their memory mappings to find out what is causing a bus crash... Ideal. You don't just randomly reboot the damn shit to try and get it to work. If it isn't working something is causing it. Embedded systems are generally pretty good at throwing up the red flags; you just need to look for them (logs, stack traces, extensive use of the debugging facilities...) Use the tools at hand instead of the big red button!
First step was getting the firmware patches onto a TFTP server near the switch (had to be less 3 hops from the switch, TFTP doesn't work over longer hops).
Unless this is something specific to the IOS or router, that's bullshit. I just upgraded 5 AS5248s to IOS 12.1(9) with a TFTP server that is 8 hops away. I'm not aware of any TTL issues with TFTP.
Finally Yazz and I are staring at this other switch cascading off of the 6509, this little out-of-the-way Cisco 3500 just sitting there... is this thing connected? We look at the link light leading it to the 6509. It's dark. "Uh Barnaby... can you check port 1 on module 2?"
You mention that your network documentation is shitty -- I sure as hell hope you'll push to have it upgraded and maintained with a high degree of readability. Even complex systems do not have to be undocumented just because they're complex. Use pictures, use words. I haven't found anything in IT which cannot be explained by a combination of both. And throw in a glossary for the non-techies like yourself who are called upon to fix it. :-)
Don't get me wrong; I'm glad you're back up. But this could have been prevented. Very easily from the sounds of it. I hope you did fire your cisco admin; it sounds like s/he didn't have a clue and was too terrified of losing his/her job that s/he didn't ask for help. Cisco has mailing lists, tons of documentation and there are many IRC channels to ask for help.
Re:OSDN, Audit ALL of your systems NOW. (Score:5)
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Up until recently you had no choice but to telnet to Cisco equipment. I came up with a quick solution: deny telnet from anywhere but a same-segment computer (in our case, it's our RADIUS authentication box). Now ssh to the server and telnet from there to the NAS. Problem solved. :-)
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
While I usually agree, sometimes it is necessary to do a quick check. Even with the number of blackhats out there the chances of them doing anything signficant (or anything at all) for the 2-5 minutes you have the firewall out are insignficantly small.
Re:Cisco Support (Score:5)
I can confirm this. I've been a network consultant for almost a decade, primarily as a Cisco router/switch jock. I've dealt with the TAC (Technical Assistance Center) too many times to count.
Hold times can vary, depending on time of day, but are never as bad as the stories from other companies. In most cases, you are on the phone with a real, live engineer within 5 minutes.
90% of the time, the engineer you are transferred to will be able to get your problem corrected. On the few occassions where they have not been able to help me, Cisco has moved mountains to get the right people invloved. I had an issue with Serial SNA - DLSW+ encapulation last year that was escalated to the point where the guy that wrote that portion of the code for IOS was on the phone, and was prepared to come to my client's site (True, they had purchased about $8M dollars in hardware...).
You do, typically, have to have a Smartnet contract, but as other posters have pointed out, if the problem is not hardware related, they will generally help you straighten out your configurations even without the contract.
Alot of people like to make comparisons between Cisco and Microsoft. Anyone who has dealt with the two will be quick to dispell any similarities. Cisco is a first-rate organization, with first-rate support, and I've made a career out of working with their products.
Re:Cisco Support (Score:3)
Caveat: Cisco basically does not have first level support (i.e. "'Is the router plugged in?' 'What's a router?') - you are supposed to have second level knowledge and have completed the first level troubleshooting before you call TAC.
But - I have been out of the office and had brand-new network techs call Cisco with a problem, and they did help out even then.
sPh
Re:blocking ping, btw, is STUPID. (Score:3)
Slashdot outtage - graphs and stuff (Score:3)
Go to the monitoring system page. [sysorb.com]
Click the www.slashdot.org link
Select services
This will give you some graphs showing the outtage.
Consider the time (Score:4)
Well, in this case Slashdot was down. That can explain the instant response.
__
Re:Kenny (Score:3)
Szo
And in financial news... (Score:4)
"It's the first the the Slashdot effect has been a productive one", said an unnamed Cisco official, pausing briefly to dodge a large bag of cash sailing through a nearby window.
Jay (=
Often true. Maybe usually. Not always. (Score:3)
It is clear that they were out of their depth. It is clear that they didn't know what they were doin. They knew that they didn't know what they were doing. But the experts were unreachable. So they tried something that sometimes works. I really don't see how you can fault them for that. It would, of course, have been better if they had know what their choices and options were, but they didn't.
I wouldn't have either. Probably most of us wouldn't have.
Caution: Now approaching the (technological) singularity.
Re:OSDN, Audit ALL of your systems NOW. (Score:3)
Bah, you're talking without knowing the parameters. For all you know, they could've enabled the telnet access on the outbound interface specifically for the checking/cisco rep, disabling it afterwards.
Secondly -- if I remember correctly you can have pretty damn long passwords on ciscoequipment. We do not know the length of the password, but its highly probable that the password is 10+ characters. A bruteforce-attack is pretty damn difficult when you have to check 64^10 possibilities. According to my bc:
arcade@lux:~$ echo 64^10 | bc
1152921504606846976
Now, that is a pretty impressive number of queries you've got to make to exhaust that pwd-space. To be quite frank -- I don't see the problem.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
Oh, yes of course. If you don't have a firewall You are phooked!!
Ehh? Excuse me? Why the fsck do a properly configured serverfarm need firewalls _at all_? Please, enlighten us with your wisdom oh dimwit.
Firewalls _are not needed_ if you're not running services that _should not be running_ on servers for the internet.
--
SlashCrash? (Score:4)
It definetly enjoyed reading this article and I am sure that it will be bookmarked by a fair few techie minded network admins, just in case.
Re:Microsoft Support (Score:3)
Of course, it depends greatly on who you are talking to. The platforms team does have a huge slant toward NT/2000 because that's what they support and allegedly like. Those of us in Exchange support (I'll leave it to you to figure out what part of Exch. support I'm in) handle calls where Unix servers are relays, Pix firewalls sit between systems and load-balances continually send packets off into the woods. If you *don't* know non-Microsoft stuff, aren't prepare to acknowledge that non-MS works and works well, or just can't handle the idea of public standards, you are fucked in that group.
It all comes down to who you get on the phone. If you don't like who you are dealing with, ask to speak with their manager or technical lead. Get it straightened out with them or request another support tech. You're paying for it, get what you are paying for.
(As always, my comments are my own and my employer doesn't take any responsibility for them. Like they would want to anyway.)
---
Re:Cisco Support (Score:3)
I don't normally swear, but if someone asks me if Cisco support is good, I have to reply: "Abso-fucking-lutely". They are easily the tightest organization out there, bar none. I don't think anyone: UPS, the Military, Wall Street, runs as good an operation as they do.
And I've sat with two engineers at 1:00am through to 11:00am as they fixed my small gateway to an ISP, not a big ticket item. At one point, they did an engineer transfer, connecting me to a different part of the world, and spent thirty minutes overlapped, with the engineers working together to make sure that the new engineer knew what the first had tried. As it turned out, the firmware storage was flakey, and the config corrupted itself semi-randomly.
Years later, I watched Cisco do the exact same thing - only this time, they correctly identified that the problem wasn't them, but in some Bay routing equipment, *and* they told us the exact commands to fix it (I was a outside consultant just watching, but I believe they even offered to telnet in and fix it themselves).
So, yes. Cisco is the only brand I will buy, no matter how expensive they are. Think of the extra expense as insurance. You *may* not need it, but it sure pays for itself if you do.
--
Evan
Re:Eeep - scary moderators! (Score:4)
A quick Google search for "Anne Tomlinson" returns an orchestra conductor and someone in a retirement community.
If it was a real post, CmdrTaco probably would have ignored it. His good humored response makes me think it was a troll.
Is there any evidence that it was real?
-B
Re:Cisco Support (Score:4)
Re:Microsoft Support (Score:5)
I beg to differ.
That article details calling the 900 line, but even with support contracts, most MS tech support reps toe the company line in a distressing fashion.
"Unplug all the unix servers, that'll fix it"
"Upgrade everything to Win2k Adv Serv, that'll fix it"
"Upgrade to SQL Server (from Oracle), that'll fix it."
They seem to have no ability to distinguish which network components could be involved in a problem and are unwilling to accept that you've already localized the problem.
Case in point, there was a problem where two WinNT boxes wouldn't see each other. They both had IPs, they could both ping everything else. They were connected via a 100mbps switch.
We made sure each properly had an IP, that it could reach other machines, that the switch worked, and then swapped ports with two machines that were working just fine. We also tried isolating these two machines on their own switch, to avoid potential IP conflicts.
When we called the support number we honestly described the situation to the tech. He asked what else was on the network. We explained that it was in a different IP range, but on the same switches as a bunch of Linux machines, an Open BSD (firewall for the desktop machines), and a couple Suns (doing something for the other department, dunno what.)
He then proceeded to tell us that it was the other computers, despite our telling him that we had isolated the NT boxes in question on their own switch and we still had the problem, but when we put a third computer on, both of the NT boxes could reach it just fine.
We eventually lied to him, telling him that yes, we had unplugged all the unix machines, etc. (Like we're going to just unplug out company on the say-so of a moron, and like two junior techs would have the authority to do so anyway.) So now jim-bob starts to help, by telling us that Win2k is so much better, etc, that we wouldn't have these problems with it, etc.
When we flat-out refuse to "upgrade" to fix this bug, his advice is that we format the drives and reinstall. ARGH!
We finally convince him that these machines are somewhat important and we can't just wipe them everytime there's a small problem.
After over an hour with this jack-off, we hang-up, problem unresolved.
We get permission from the boss to call someone in... So we look through our list of contacts and grab someone whose card says they deal with networking and windows. Call him up. As we're describing the problem he listens quietly, grunts affirmatively when we describe how we isolated the problem, agrees that it couldn't be any of the other machines.
Then he says, "It sounds like it's an issue with a bad route, type 'route
He said that it, whatever it was, was a very common problem where the machines basically forget how to get from A to B. That command zeroed the routing (which didn't show any bad routes) and the reboot brought it back up.
Cost, a 15-minute phone consultation. $45
Microsoft tech support was basically a sales department, staffed with the marketing rejects.
So, don't EVER believe it if someone tells you that MS supports their products. Any company whose line is "Format and reinstall" has no business calling a product "Server", let alone claiming they're in the enterprise level.
Schon, earlier in this thread, said "Rebooting doesn't solve the problem!!" I wonder what he'd say about formatting and reinstalling.
Re:Eeep - scary moderators! (Score:5)
No, we don't have a right to know. Ms. Tomlinson's departure is between her and her employer; not some tabloid expose for a bunch of overly curious rumor mongering conspiracy theorists. I wouldn't be surprised if the people who blurted this out on a public forum haven't been seriously bitch slapped by HR.
As a community it would be best to let the matter drop. I'm sure if you were in Anne's position you'd be severely pissed. A little perspective and some empathy would be appropriate.
Re:You know you've been using windows too long whe (Score:3)
And who's to say that the problem that's being experienced will be fixed by a reboot?
We had a server running, one of the things it did was SMB sharing - one of the drives (the one dedicated to non-critical SMB shares, in fact) died.. This box was doing MUCH more than SMB - it was also our internal DHCP, and DNS server
I was out, and one of our MS guys decided "I don't know what all these error messages mean, but I can't see my windows drives, so I'll just reboot it." Because the drive was dead, the machine wouldn't boot. He took the WHOLE DAMN DEPARTMENT OUT - nobody had DNS, and when people's windows machines stopped working, the solution was (guess what?) REBOOT them - so THEY stop talking to the network altogether.
Now, the kicker is that the drives in this machine were hot pluggable. If the reboot hadn't happened, I could have swapped in a new drive, restored from last night's tape backup, and people could have continued working. Instead, because the machine was rebooted the whole department was down for several hours.
The mantra stands - REBOOTING WILL NOT FIX THE PROBLEM. And if you reboot before you know what the problem is, then not only don't you know if it will help at all, but you also don't know if it will make the situation worse.
sometimes getting back online as fast as possible is more important.
That's the trap - there is no guarantee that rebooting will do this - and you might just be screwing it even worse.
Getting back online as fast as possible involves solving the problem first - REBOOTING WILL NOT FIX THE PROBLEM.
Re:You know you've been using windows too long whe (Score:4)
Even though you think you're saying the opposite of what I said, you've hit the nail squarely on the head - rebooting never fixes any problem.
It may temporarily fix the symptom, but the problem is still there.
It is possible for routers, Linux boxes, etc to crash.
Yes, it is. But if they crash, it's for a reason - perhaps there is a bug in the configuration, or firmware; or perhaps it's hardware.. but what's important is that rebooting will not actually fix the problem, all it will do is temporarily alleviate the symptom.
If the problem is with the configuration, then you fix the configuration. If there is a bug in your software, you fix that. If it's hardware, you replace the faulty hardware. If it's firmware, you upgrade the firmware (or replace the unit with a different model, from a manufacturer who actually does quality testing.)
But you do not just blindly reboot - if a reboot is required, you do it after you've discovered WHY the machine has crashed, and you've fixed it. Once again, the mantra is "Rebooting will not fix the problem."
You know you've been using windows too long when.. (Score:5)
But, says Yazz, "Since the Cisco was rebooted there were no logs to look at."
You fell into the classic "Windows" trap.. this is what I tell the Jr. tech guys here when one of the servers goes wonky: "If it doesn't work, there is a reason; something is wrong. Rebooting will not fix the problem."
They usually respond with "but I didn't know what else to do."
To which I answer "Repeat after me - REBOOTING WILL NOT FIX THE PROBLEM."
"But I didn't know what else to do."
"Then call someone who does - REBOOTING WILL NOT FIX THE PROBLEM."
Thank you. (Score:3)
BTW: Are you going to plan any redundancy/failover drills as a result of this?
Really full disclosure? (Score:5)
Was Rob just popping off at random, or was that little bit removed trying to cover
Jes' wondering...
Re:You know you've been using windows too long whe (Score:5)
You can make a case that valuable troubleshooting info is lost when systems are rebooted. I agree, but counter that all good systems should have detailed event logging. Leaving the system online and intact is the best way to root cause a bug. But, sometimes getting back online as fast as possible is more important.
Let this be a lesson..... (Score:3)
Don't slack. When you slack it bites you in the ass. Maybe not today, maybe not tomarrow, but someday, someday soon, it will.
Test your failover configs. How? By actually making them fail. During the maintaince window, power that primary router/firewall/load balancer down hard and see if the fail over works. It's like testing back ups, kids. You have to know they work before you need them.
Realistically develop on call strategies. OSDN didn't really have a net ops staff of four. One had quit (why are they counted?), one was in hospital, and two had weak "couldn't reach my cell phone" excuses. That just don't work in the real world. If you are on call, you are on call. The "phone too far away" and "battery fell out" just don't cut it in the adult world of professional net ops. Get a satellite pager, and if you are on call, make sure it's on, and near you so you can hear it.
Don't bash your employees/ former employees, particularly during a heated situation. Shows no class. Besides, if you are such a hot shit. grab that console and fix it. Otherwise, keep your mouth shut. Besides, who is in charge of making sure the people that are hired are qualified? Hmmm?
Document your shit. It's not that hard. Visio can do much of it for you. I'm going to break an NDA here, but the Exodus Service Agreement states that all machines and cables are to be labeled. That is so when the dude (or dudette) has to leave the NOC and enter your cage to reboot your lame box, they know what is going on. Also works well for when you net ops staff is too concerned with getting drunk or laid and your poor programmers have to go in to fix the network.
Some folks really went above and beyond, but it seems to me that the management severely dropped the ball.
Is VA really ready to abandon the hardware market for software services? One has to wonder.
Dave
been there before...
Re:Responsive techies (Score:5)
Rumour has it the conversation went a little something like this:
[Kurt] Hi, cisco tech support?
[TAC] Yes
[Kurt] this is Kurt at slashdot...
[TAC] Oh my god, its about time you called us. You've been offline for nearly 24 hours, we're all going through withdrawls. Hang on a sec, our top techs are dying to help.
I talked to a friend in cisco TAC (Brussels) who said that they regularly lurk on
Since summer weather had come to Europe, I, personally, did not notice the outage. But I promise in the futur to not have a life.
the AC
[Note to Kurt and company, make sure you return your customer satisfaction survey. Those TAC folks live and die based on keeping a very high level of sat scores. I think they need a 4.85 (on scale of 1 to 5) just to keep their jobs within cisco, and a 4.89 to get a raise. So 5's across the board, and in the comments put a link to this
Disorganized? (Score:3)
I've seen this scenario over and over again... one guy who knows and understands the network, ten people standing around at the equipment trying various silly commands to fix it when it's down...
Here's some suggestions -- you probably already realize that 90% of your pain was avoidable, but everyone has to learn "the hard way" the first time, right?
We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer...
That's called bad documentation that no one ever reads.
Get your networking guys to document TROUBLESHOOTING techniques and to teach the programmers how the network is acutally set up and why. You have plenty of talent capable of understanding how it all works there.
Get more than one way (cell phone) to reach your most important network engineers. Pop for a guaranteed delivery text pager and ask them to carry that as well as the cell phone.
Yazz says, "When I arrived at Exodus, Kurt and Dave were trying every combination of things to do to get the 6509 back. But neither they nor I even knew the Cisco Passwords."
Paper. Wallet. Put them there. Better yet, PGP encrypted password escrow somewhere that anyone can get access to, and a locked cheap fire safe at the office with the public and private PGP keys on a CD-R inside -- for just this type of scenario.
So I asked the Cisco technician, Scott, to telnet into our switch...
Bad bad bad... telnet = bad. Good network security always goes out the window when the network's down...
So he's in the switch and he's disgusted and horrified by how we have it configured...
This is probably the most important hint during your entire outage... your network people either don't know what they're doing, or you're not ALLOWING them to do their jobs, or they're understaffed, or whatever other excuses can be made up ... your call, but don't forget this -- if Cisco's "horrified" by your configs, there's a serious issue you need to find and correct somewhere in your organization. Everything from training, to documentation, to troubleshooting procedures needs a serious walk-through.
The one card going bad wouldn't have been such a big deal if the config in both were set up correctly. It was meant to flop to the other interface if the primary card died, which it did, but not with all the info it needed... AKA it was misconfigured...
DO FAIL-OVER TESTING. If you'd have done a fail-over test of this config you'd have known it didn't work correctly during a nice scheduled time when your network engineers are available and at the equipment, instead of the middle of the night during an outage with all of them MIA. This is so easy to avoid.
Exodus really wasn't set up to handle the type of failover the 6509 was meant to do. Thats what the Cisco folks said basically, and the Exodus people are no longer supporting this type of Cisco in their setups.
Nice of them to tell you. Who is the customer here again?
Put a $20/month POTS line in your cabinet for goodness sake!
That's enough... I'm appalled, but hopefully you will straighten out some things now that the site was down for an extended period. Done properly, network downtime should be a rare event, usually caused by human error, not by bad configuration.
Many outages are unavoidable, your outage sounds like it was avoidable, and certain steps could have been taken to minimize the length of the outage.
Re:More Writeups Needed (Score:5)
What you said.
I did a bit of (very junior-level) sysadminning back in my day.
First thing the BOFH told me was "Buy a hard-cover notebook. Not spiral-bound. Not softcover. Write down everything you do. Feel free to doodle and write obscenities if you like. Someday you'll thank me for this".
I was a bit befuddled, and then he showed me his notebooks. Five years of dramatic fuckups and even more dramatic recoveries. His own personal "deja.google.com" (but it was 1992, and long-term USENET searching hadn't been invented yet, hell our office was using UUCP!) for everything he'd had to work out from first principles on his own.
And thus was the PFY enlightened.
(And yes, I did buy him a beer in late 1992, when something I wrote down in mid-1992 jumped off my page and saved my ass.)
Not like my experiences.. (Score:5)
While I agree that I usually get someone at cisco who knows what they're talking about, it is very rare in my experience that it happens in only a minute, although it does occasionally happen. A much more common experience is to wait on hold for 15-20 minutes, but I have waited on hold as long as an hour with them.
All of that being said, I would have to agree that cisco's TAC is probably one of the best tech support groups I've ever worked with.
--
OSDN, Audit ALL of your systems NOW. (Score:3)
Point 1./ Why do you allow TELNET in to your routing/switching equipment from the outisde world? If a CISCO tech' with the password can do it then a hacker without the password likely can too.
Point 2./ If you are connected to the Internet in any way NEVER replace your firewall with a cross over cable. Basically at that stage you have your pants around your ankles, are bent over, with a big "Do Me Now!!!!!" sign on your butt!
More Writeups Needed (Score:5)
"Welcome to the HOWTO. My setup worked the first time. Why didn't yours?"
Granted, noone wants to see stuff on the 'net go down (and we're glad you're back,
Really, what Linux (and other geek subjects) need is to have a Great Book of Failure Stories -- writeups like these that detail horrible outages, downtimes, misconfigurations, security hacks, etc., so that we all can learn from other's mistakes.
Hear, hear! (Score:3)
X: "What happened to Anne?"
Y: "I don't know; all I know is that she anne-tomlinnsoned from work."
Note that this verb should have the subject of the remark used as the subject of the verb, and the organization left as the indirect object. This should be adhered to regardless if the subject quit, was fired, laid off, died, disappeared, never existed, or there was a mutual decision for the subject to leave. In fact, the verb should mainly be used when the method of departure is unknown or never officially stated (or, even officially acknowledged).
Also note that this verb should NOT refer to a person leaving another person, as in "Fred's now-ex-wife had tomlinsonned from him." The number of people (one or more) that are the subject should be less than the number of people who the object represents.
Continuing on, this verb should NEVER be applied in a self referential matter, IE: "I anne-tomlinsonned from them". This implies that the subject either A) knows the reasons, and is just being a prick about not stating them, or B) the subject does not know the reasons due to massive thick-headedness.
Lastly, this term should only be used to convey the sense of inpenetrable mystery surrounding the departure. It would be oxy-moronic to state: "Ted tomlinsonned because he was bored and wanted to leave." If the mystery surrounding the departure is penetrable, use another phrase.
anne-tomlinson, v,: to leave or be removed from a group under extremely odd, and mysterious, circumstances; especially when the actual method of departure or initiating party of departure is unknown. More especially, when the actual departure is apparently covered up or left un-acknowledged.
tenses: anne-tomlinson, anne-tomlinsons, anne-tomlinsonning, anne-tomlinsonned, had anne-tomlinsonned.
Re:Cisco Support (Score:4)
Our results were much the same. Very, very responsive people.
I have to agree with Taco, if they gave this kind of service down at the DMV, they'd be picking up passed out folks left and right.
*scoove*
It MUST have been a FREAK OF NATURE! (Score:3)
Wish the companies I deal with on a regular basis ever showed that level of skill when I need help. well... hmm... actually Speakeasy is generally pretty good about accepting that my problem is accurately diagnosed and figuring out what's wrong. And Viewsonic the other day was able to provide refresh-rate specs on a monitor I wanted to order within about 60 seconds of my placing the call (Though they dropped the ball by not having the specs I wanted available on their web page) What is this trend of good service? It's scaring me...
Re:You know you've been using windows too long whe (Score:3)
Just because someone screwed at your work doesn't make your mantra a universal rule.. Especially when dealing with something like a router or a switch.. These things are normally not meant to be user serviceable and will take a reboot just fine(no hot swappable drives there).. You could have hit a 1ppm problem and rebooting just brings everything back online until statistics kick in again. Little uptime is better than none.
Sure it won't fix anything per se, but getting things normalized enables you to start concentrating on the problems at a less hectic pace..
they're gonna have a field day (Score:4)
---
Re:You know you're a cranky old grognard when... (Score:3)
That sounds lame as hell. (Granted, though, configuring a Pipeline 50 goes right over my little bow head, much less a Cisco. So yes, I'll stipulate that I'm talking out of my ass here.)
The act of rebooting should be just another even that gets logged, NOT a synonym for "oh, and by the way, you can delete the old log file now."
IMHO log deletion should be done on a calendar basis; everything more than x days old gets purged automatically. What's Cisco's rationale for auto-deleting logs during the boot process?
Yeah, more details please! (Score:4)
C'mon, tell us the full story!
Eeep - scary moderators! (Score:5)
BTW, feel free to mod me down, prove my point and compound my paranoia; I've got karma to spare : )
Re:Cisco Support (Score:3)
Re:Like reading old issues of the RISKS digest (Score:3)
hang on, i can do this... (Score:5)
I. Call to Adventure
"By 7 a.m. it was obvious that this was not a typical, easily-fixed, reboot-the-database problem. The network operations people were paged, but did not respond."
II. Meeting the Mentor
CowboyNeal once said, "You can take everything I know about Cisco, put it in a thimble and throw it away."
Whoops, that's not it.
"So I called Cisco tech support."
There we go.
III. Obstacles
"Just to make things interesting we've added ports to the 6509 by cascading to a Foundry Fast Iron II and also a Cisco 3500. We've got piles of printouts and documetation of all sorts, drawings and spreadsheets, helping us keep track of every IP and machine in this cage, yet it doesn't seem to get any clearer unless you've either built it yourself (only one person who did still works here and wasn't available this weekend) or if you've had the joyful opportunity of spending a night trying to trace through it all under pressure of knowing that the minutes of downtime are piling up and the answer is not jumping out at you."
IV. Fulfilling The Quest
"He bounces the switch... copy startup-config running-config
V. Return of the Hero
"The next day, Monday, Kurt talked to Exodus network engineers and asked them why our uplink settings were so confusing to Cisco engineers."
"Tuesday was router reconfig day."
VI. Transformation of the Hero
"At least we've learned a lot from the experience -- like to call for help from specialists right away instead of trying to gut things out, and just how valuable good tech support can be."
"We certainly aren't going to make the same ones [ed: mistakes] again!"
Peace,
Amit
ICQ 77863057
Responsive techies (Score:4)
Re:Rebooting (Score:3)
Take a relational database for example; there is so much, that can go wrong with it. For starters, there are bugs in such complex products and fixing them (save for Postgresql) is beyond your control.
But it must not even be a bug in the database code. It can be something in your network component (we chased cases for month which turned out to be a DECnet issue, but where attributed to the database server), it could be the fact that the db vendor compiles his product on multiple platforms and it's virtually impossible to test every functionality of a new release on every supported platform. Yes, I know that in an ideal world this should be done, but it isn't.
Assume it would be possible to perform such tests. Save for propriatery (or semi propriatery) architectures like OpenVMS/AXP you can have so many different hardware- and network components, that it's just not possible to forsee all eventualities.
After ruling out such possibilities, we're not there yet: What are the query characteristics, how many concurrent users do when, what. What front ends do they use, how are they connected. The problem may even be caused by a component that has nothing to do with the database engine (Access front end, anyone ?)
Although the fundamental cause for the problem might never be detected a reboot of the data server might fix the problem and it will never occur again, since the same combination of factors occurs so rare that it's even impossible to reproduce the problem.
However, the [alt-ctrl-del] attitude of younger IT folks (specifically those that grew up in a PC environment) makes me barf and indicates just how clueless a lot of those folks are. You never reboot a productive IT component, unless there is no other choice or in the context of your normal maintencance cycle (memory leaks do occur in software)
Correct. (Score:3)
Either Anne is real or she isn't. If she's real, this is an internal matter that we really don't need to interfere in. If, as the "Anne" poster suggested, she quit because Taco and Hemos are hard to work with, she was within her rights and should get at least some support from a community which often says "Quit! Now!" to Ask Slashdots about PHBs.
If she's not, this is all a big waste of everyone's time, and possibly the best troll we have ever seen on slashdot. (An account by that name [slashdot.org] has a brand new uid (462836) and zero comments.) Think of the trolls you've posted - how many led to 100s of posts on other threads, conspiracy theories galore, and posts by #1 and #2? Whoever did this (if not Anne) should get mad props from the troll fans, but should not take any more of our time.
My bet is that she's not real. But in either case we should drop it and get on to more important things.
So, what was the problem...? (Score:3)
-S
Kenny (Score:5)
-S
Re:Beware of departure from original statement (Score:5)
Everything failed at about 7AM Sat. Dave was at Exodus between 8:30 - 10:30AM Sat (didn't look at the log book when I got there). Kurt arrived shortly after that I belive (again I didn't look at the log book). I arrived there around 11:40AM. Sat.
And yes my battery was lose on my Nextel. Just takes a little pressure upwards to lossen the batter on the i1000plus I have. The batter doesn't fall out, just loss enought so it lose contact and turns the phone off.
I have now taped the battery in place!
Yazz Atlas
Re:Mad props to CISCO! (Score:3)
ehr... 4 T1's = 4 x 1.544 Mbit. (=6.176Mbit... der)
1 OC-3 = 155.52 Mbit.
Not really similar, eh
Grr... (Score:5)
After having been modded down next to the goatse links, somebody please explain to me how the hell we're supposed to discuss the decidedly strange disappearance (and subsequent reappearance) of this story on the site without getting modded as "offtopic"?
Just where, exactly, are we to discuss this little point? For example, why did this story disappear? Was it technical? Was it editorial?
For a group that is so damned keen on openness and truth, it strikes me as somewhat ironic that several dozen mod points have been used to effectively supress this part of the thread.
I want to know what happened. Others do to. If you can't give us a decent place on Slashdot to discuss this issue, then don't mod us down as offtopic!
Re:Grr... (Score:5)
You'll understand my consternation, though, upon seeing my (admittedly offtopic) post on the Shared Source article regarding the disappearance of this article modded down three points to -1 in the course of roughly one minute, and the seemingly similar fate of a good many other posts like it. Also worth noting is the fact that this article touches on what many would consider a rather sensitive issue with the OSDN and /. crew right now. I don't like conspiracy theories much, but I'll be damned if the situation didn't seem, well, rather odd.
Might I suggest, though, that once stories are actually posted to the front page, they remain as is, even if the order of presentation is not the most desirable? Consistency is key, and having articles disappearing from the front page is not terribly consistent.
How many times do we have to say it... (Score:5)
No matter how FUBAR'd your router/switch/firewall configuration is, it's still no serious obstacle to crackers, Robin.
Re:More Writeups Needed (Score:5)
I discovered the RISKS digest when reading the about software engineering, and it has certainly helped me think about failures and recovery when designing and building systems.
There is also the underlying thesis in the article about how complexity, whether in a bungled redundant network connection or just a large poorly documented, poorly tested, and poor configured system is your enemy in building reliable systems. A lot of systems were built like Slashdot during the dot.com IPO craze, I wonder how many of us rely on such poorly built systems?
Building complex, reliable networks is hard, and expensive. About 3 times what your estimate is, which is about 2 times what you boss expects to spend.
New slashdot sql ... (Score:3)
Is there someone else outthere hosting a site where we can have a non-biased discussion ?
BROWSE AT -1 Checkout how many posts have gone straight in at -1 (and this one too will, I betcha...)
Re:Eeep - scary moderators! (Score:4)
Hey, try browsing at -1 nested - seems like everyone who's questioned the story about the woman who "quit" (Anne Tomlinson?) has suddenly been modded down. I'm not usually one for conspiracy theories, but is this surprising anyone else? I think that the question couldn't really be more on topic, and they're hardly flamebaits or trolls - what's up?
Yes, I'd like to know too. I didn't see the original "Slashdot Back Online" announcement, but I did see Maswan's listing of the 3 different versions of it [slashdot.org], with all reference to the "she wasn't actually qualified" girl removed. And of course, the message from Anne Tomlinson [slashdot.org] decrying her treatment at the hands of Jeff and Rob. And an astonishingly rare post from CmdrTaco [slashdot.org] (his only post in the last several weeks) dismissing her as a troll (and being of course sycophantically modded up to 5).
So I, and no doubt many other loyal Slashdot readers, would like to know - what really happened? Who is Anne Tomlinson? Why did she quit? Why has all reference to her been purged from the site? Why is everyone who asks about her being modded down to -1 so quickly that it is obviously editors doing it?
We have a right to know.
Beware of departure from original statement (Score:5)
Where does this mysterious woman fit into the story above?
hrmm... (Score:3)
-
sean
Ouch! (Or When a Redundant System Isn't) (Score:4)
I remember when I started out in computer networking (and it didn't seem like it was that long ago), I was told this by one of the other technical members of our team, something that I haven't forgotten: redundancy in a system is necessary not only in the hardware and software in that system, but also in the resources that are used to keep that system running (that includes of, course human resources, as well as power HVAC, and so on).
Too often, the human part of the redundancy equation isn't totally factored in. When you don't put all of the human factors into the redundancy equation, you have a redundant system isn't really redundant.
Of course, it helps if you have a vendor that will work with you (and those of you who remember working with Novell servers in "the old days" know what I'm talking about, too).