Follow Slashdot blog updates by subscribing to our blog RSS feed

 



Forgot your password?
typodupeerror
Check out the new SourceForge HTML5 internet speed test! No Flash necessary and runs on all devices. ×
The Internet

Uptime Realities in the Internet World 357

schnurble writes: "My former boss has written an interesting article on the realities of uptime in the Internet World. It poses the idea that four and five nines of reliability are too expensive to be realistic, especially in the post dot-bomb economy. It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."
This discussion has been archived. No new comments can be posted.

Uptime Realities in the Internet World

Comments Filter:
  • Uptime (Score:5, Funny)

    by dattaway ( 3088 ) on Tuesday July 09, 2002 @04:15PM (#3852002) Homepage Journal
    Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.
    • Re:Uptime (Score:3, Funny)

      by spencerogden ( 49254 )
      Of course the article is about how uptime is too expensive, I guess this proves the point...
      • Re:Uptime (Score:4, Informative)

        by ranulf ( 182665 ) on Tuesday July 09, 2002 @10:00PM (#3853861)
        the article is about how uptime is too expensive

        I'd also say impractical. 5 nines is 99.999% availability, i.e. can be down for 1 second every 100000 seconds, or 27.77 hours. That gives approximately 6 seconds of downtime per week.

        Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc... Chances are even restarting the web server would take up more time than your maximum weekly downtime.

        Given that over the course of a month (which is the billing period on most ISP lines), you only have 24 seconds of possible downtime, it's very unlikely that the ISP will be able to meet that target. Pretty much *any* fault would take longer than that to fix, so any company offering a refund if the SLA isn't met is just asking for trouble.

        • Server vs Service (Score:3, Informative)

          by AftanGustur ( 7715 )
          Even if all that weeks downtime came at once, six seconds is little enough that most users would just hit refresh and never even notice. Besides which, most web servers are taken down for maintenance tasks, upgrading software or disk, etc...Chances are even restarting the web server would take up more time than your maximum weekly downtime.

          You are not making the distiction between "server uptime" and "service uptime". When people talk about 99.something% uptime, they are ususlly refering to "service uptime". With proper hardware (redundancy etc ..) you can reboot servers, change disks, memory and even routers and it won't cost you even 1 second of "service downtime".

    • Re:Uptime (Score:3, Funny)

      by suss ( 158993 )
      Wouldn't you know it, an article about uptime...and slashdotted. Looks like he needs a mirror.

      You could always try the Google Mirror [alltooflat.com]
      • Re:Uptime (Score:2, Funny)

        by n9hmg ( 548792 )
        1. Cool site. Nice idea.
        2. Unfortunately, neither elgooGoogle nor archive.org had a chance to cache it before we killed it. The "former boss" was probably an update, made after he slashdotted the poor guy.
        • The article has been posted in this thread by an anonymous donor. The parent of its four pages has not been moderated yet.
  • by tshak ( 173364 )

    Uptime Realities in the Slashdot-linked World
  • by derekb ( 262726 ) on Tuesday July 09, 2002 @04:17PM (#3852027) Journal

    How many engineers out there have heard the marketing / sales 'it has to be always available' and priced out an infrastructure accordingly.

    Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

    Availability is infrastructure plus process. You need to have the supporting process to go along with the hardware - maintenance schedules, change management (well FCAPS in general), etc. It's not just a big box.
    • by Subcarrier ( 262294 ) on Tuesday July 09, 2002 @04:23PM (#3852083)
      Even recently I'm working with a customer who wants a compromise between price and availability - but it still needs five nine's

      $999.99

      Problem solved. ;-)
    • My company (a large-ish, surviving Internet Retailer) has internally announce a Six Sigma Initiative. I'm wondering if we'll need to maintain 5 9s uptime...
      • Good luck applying Six Sigma to processes that aren't directly related to manufacturing something ... ;)

        I really mean that - Good Luck. :)
        • My company does lots of things, but almost no manufacturing (our local office provides engineering services to the government and military). We also got hit with the Six Sigma marketing buzz, and our stupid (now departed) CEO decided that they needed to initiate the garbage company wide. I've managed to avoid it so far, but I've passed by the conference room occasionally while sessions have been going on, and I would have to say that it would score real close to 10 on the Wank-o-meter. [cynicalbastards.com] All of the engineers who have been subjected to it have said it's nothing more than good engineering practice that they should have learned in school. But maybe it's good for the administrative/marketing types.
      • Actually, you'll need to add two sixes to it.

        Six Sigma is a maximum of 3.4 defects per million. So converting to uptime would be.

        Uptime percent = 100*(1 - 3.4*10^-6) = 99.99966

        After we take off the literal filter, I'd have to say that was a pretty funny comment. Just hoping to add a little connection to the Six Sigma to Five Nines relationship.

    • Our current isp(group telecom) guaruntees 5 9's of reliabillity and it's pretty much a joke. Weve already burned through several years worth of downtime (granted only a coupple hours a month) and who knows what will happen to our "guarunteed service" if/when they finish their slide into bankrupcy.

    • by rob_from_ca ( 118788 ) on Tuesday July 09, 2002 @04:32PM (#3852167) Homepage
      This is the most intelligent thing I've ever heard on slashdot before. If you don't understand this comment, read it again and again until you do. :-)

      If you're a business, your money is far better spent improving the user experience rather than working on buying redundant-everything, building the support infrastructure, and incurring the extra overhead of the tedious and careful processes needed to obtain 5 nines (and 4, and even to a degeree 3 nines).

      If your site sucks and no one visits, it doesn't really matter if it's down...work on building something reasonably reliable that is very compelling to your users; that's money much better spent...
      • by alexhmit01 ( 104757 ) on Tuesday July 09, 2002 @04:43PM (#3852259)
        Let me give you a hypothetical case. One of our clients does about $50k/month on their web site. When the site was built, they were only expecting $10000-$15000/month. At the time, NN4 compatibility wasn't important, because the extra cost ($10k) wasn't going to be worth it. With NN4 sitting between 5% and 10% each month, they have decided that NN4 compatibility is important in the next version.

        When we launched, 3 days of downtime a month was considered okay. It was considered a better choice than spending an extra $5k on hardware for redundancy. Well, when the site broke $40k/month, we immediately decided that that was no good and invested in the redundancy.

        The site has had a few 15 minute outages over the past 6 months, and a 1 day outage over a holiday weekend (not a big deal). However, if the site doubles in revenue again, downtime is becoming less acceptable, and we'll drop $10k to avoid it.

        If your site sucks and no one visits, downtime doesn't matter. If you are making lots of money, downtime does matter. $10k on hardware is worth it if the downtime would cost you $25k?

        Alex
    • If you want to learn about uptime, don't bother going to codesta.com. Their servers have already melted from a brutal slashdotting. According to Netcraft, codesta.com runs Linux and has 74 days of uptime [netcraft.com]... until today!
    • by ipsuid ( 568665 ) <ipsuid@yahoo.com> on Tuesday July 09, 2002 @04:48PM (#3852304) Journal

      One word to clients... "Outsource"

      Maintaining backend infrastructure with a 5 9's service level agreement really is prohibitively expensive for all but the largest businesses. Especially if they are not a tech company.

      The level of engineering that goes into providing true 5 9's service is extraordinary. Also, some military contracts actually require 6 9's!! (Let alone completely seperate networks for classified data).

      I'm actually in the design phase of a data center which requires 5 9's (so we can take on those who decide to outsource). Redundant generators, redundant UPS, redundant routers, redundant HVAC, two seperate cable runs from different sides of the building, two connections to the power grid, etc., etc....

      And thats just the physical infrastructure! Now you need to develop, or integrate the software to completely cover every aspect of your operations. Anything from cable tagging, to ticketting systems, to emergency procedures. After you build all the infrastructure, take that price and double it... that's how much you will be spending to develop all of those operating procedures. Which, at that point, go get ISO certified - since you've already gone above all the requirements.

      If I had to take a guess at a physical cost, $250-300 a square foot seems pretty close (around here anyway). And that only gets cheaper if you are looking at a facility greater than about 10000 sq. ft.

      Unless of course, only marketing has those 5 9's!

  • my boss.... (Score:5, Funny)

    by Patrick13 ( 223909 ) on Tuesday July 09, 2002 @04:18PM (#3852032) Homepage Journal
    said if i can get this mentioned on slashdot, i'll get the raise after all...

    • said if i can get this mentioned on slashdot, i'll get the raise after all...

      But now that his email is posted on the front page of slashdot, maybe they'll just split the difference between being fired and getting a raise.

      -schussat

  • No Grudge? (Score:3, Funny)

    by fiftyLou ( 472705 ) on Tuesday July 09, 2002 @04:18PM (#3852041)
    "My former boss"

    Nice, and you go after your ex-boss by getting his article slashdotted! ;-)
  • by ALecs ( 118703 ) on Tuesday July 09, 2002 @04:19PM (#3852051) Homepage
    After a major firewall downtime last year, I wanted to have some T-shirts printed up advertising

    Tovaris Systems Support:

    Proudly providing nine-fives reliability.

    The boss didn't do for, though. :(

  • 9 9s (Score:5, Funny)

    by digitalsushi ( 137809 ) <slashdot@digitalsushi.com> on Tuesday July 09, 2002 @04:20PM (#3852054) Journal
    Like the Telco... voice grade telco. Better than the power company.

    Our web server does about 4 9's, which is a downtime of about 8 hours a year, I think. I really suck at math though. I mean it.. I'm so bad at math I have no idea if thats right. I said "well theres 8544 hours in a year, so 8 divided by that is 0.0009, so thats about 4 9s. I think. 8 hours of downtime isnt that bad. I think the next step up from 8 hours of downtime is essentially those megacorps that have redundant systems, and sirens go off and people die when their server goes down for under a second. In fact, I think if their server actually went down for more than a second, some sort of structual damage to the building hosting it is the only likely scenario. Course, that's closer to 7 9s. I cant figure out how long any of the other 9s are cause I only knew what our average downtime is, and could do the math that way only. Wow, its really hot in here.

    Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!
    • Re:9 9s (Score:5, Informative)

      by Anonymous Coward on Tuesday July 09, 2002 @04:29PM (#3852142)
      1 nine: 90% availability, or 37 days of downtime per year (Qwest!)
      2 nines: 99% availability, or 88 hours of downtime per year
      3 nines: 99.9% availability, or 9 hours of downtime per year
      4 nines: 99.99% availability, or 53 minutes of downtime per year
      5 nines: 99.999% availability, or 315 seconds of downtime per year
      6 nines: 99.9999% availability, or 32 seconds of downtime per year
      7 nines: 99.99999% availability, or 3 seconds of downtime per year

      Beyond that, it doesn't much matter.
      • Re:9 9s (Score:3, Interesting)

        by fishbowl ( 7759 )
        >Beyond that, it doesn't much matter.

        Well, beyond "7 nines" you would start talking about 100% reliability. So you start with contingency plans for a terrorist attack on
        one data center at the same moment of a quake under another data center. Now you're in the realm of needing your own redudant power plants, and probably network infrastructure that does not
        really exist yet.

        So in reality, your guarantee of "9 nines" or, effectively ZERO downtime for the life of the product, really would be specified in terms of compensation and not technology. In other words,
        you'd be stating what the client will receieve when (not if) the uptime guarantee is not met.

    • Re:9 9s (Score:5, Informative)

      by Wrexen ( 151642 ) on Tuesday July 09, 2002 @04:29PM (#3852145) Homepage
      TI-89 > all education
      9's ---- time
      1 876 hours
      2 87 hours
      3 8 hours
      4 52 minutes
      5 5 minutes
      6 31 seconds
      7 3 seconds
      8 .3 seconds
      9 you get the idea
    • Re:9 9s (Score:4, Funny)

      by Asprin ( 545477 ) <gsarnold&yahoo,com> on Tuesday July 09, 2002 @04:40PM (#3852230) Homepage Journal
      Could someone with an 8th grade math education please post the amounts of downtime 1 through 9 9s are, please?!

      365 days * 24 hours/day * 60 minutes/hour = 525600 minutes/year.

      %uptime %downtime Fuzzy description of downtime
      .9 .1 52560 minutes down/year ~= 36 days down/yr
      .99 .01 5256 minutes down/year ~= 3.5 days down/yr
      .999 .001 525.6 minutes down/year ~= 9 hours down/yr
      .9999 .0001 52.56 minutes down/year ~= 1 hour down/yr
      .99999 .00001 5.256 minutes down/year ~= 5 minutes down/yr
      .999999 .000001 .5256 minutes down/year ~= 32 seconds down/yr
      .9999999 .0000001 .05256 minutes down/year ~= 3.2 seconds down/yr
      .99999999 .00000001 .005256 minutes down/year ~= (HALF A MILLISECOND/YEAR!!!!)
      .999999999 .000000001 .0005256 minutes down/year ~= How long it takes for one of these locally hosted sites to get /.'ed


    • Re:9 9s (Score:2, Interesting)

      by michael_cain ( 66650 )
      Ah, local telco reliability.

      IIRC, and it's been a number of years, the overall goal was about 50 minutes of outage per line per year (a little less than three nines). Different failure modes were allocated different parts of that total. Components like the wires, that only took a single line out of service, were allocated the lion's share. Switch components were allocated smaller amounts, depending on how many lines would be out of service. Total system failure on a switch was allocated about 4.5 minutes per year (five nines).

      No switching system ever actually made that grade. Probably the ones that came closest were the old electromechanical "steppers". Many small steppers in small towns ran completely unattended, and maintenance consisted of someone driving out once a month to make sure the building was still there and to polish some relay contacts.

      All of the computer-controlled switches had dual synchronized processors (ie, each one executing the same op codes at the same time) and duplex memory, with a bunch of extra hardware to detect faults. The single most common cause of total system failure was when a fault had occured, and the system was running "simplex", and a tech pulled a card from the active rather than the failed processor.

    • Re:9 9s (Score:4, Funny)

      by 4of12 ( 97621 ) on Tuesday July 09, 2002 @05:23PM (#3852629) Homepage Journal

      Hmmm...

      Enough nines of reliability and you can probably easily claim that network latency is responsible for the slow response a client is experiencing:)

      The server can go down, be rebooted before the client thinks something is really wrong!

  • In other words... (Score:2, Insightful)

    by reaper20 ( 23396 )
    We should just give up on decent service and professionalism. I don't think so.

    My ISP (Ameritech) seems to think so, considering my DSL connection and their promptness to "Get ahold of me within 24 hours..."

    Bleh ... It's not unrealistic ... don't expect people to live with downtime just because a good portion of those systems need to be rebooted on a regular basis (Win machines), and general retardness of sysadmins around the world allow things like Nimda and Codered to get out of hand. This is an excuse to let companies too cheap to have decent customer support off the hook. Maybe if they were educating their tech staff instead of finding more ways to rip us off, they'd have decent servive.

    Everyone with competent sysadmins on rock solid *nix systems raise your hands...

    • My last DB server, which was the back end for a moderately high traffic site (~.5Million hits a day, ~1million db hits a day), running about 80% capacity for the last year straight, was up for 11 mths before we replaced it last week.

      Win2k my friend.

      And whom supported it that whole time?
      Me, the web application developer.

      Sure, we _could_ have paid for a 'rock solid *nix system' and a couple of admins to go with, but my raises over the past couple of years sure would have looked dismal.

      It's called TCO. Sometimes, in some cases, nix isn't necessarily better, or at least there's nothing wrong with Win IF you rtfm.
      Guess you never did! You should try it sometime before slamming WinServer users.

      Oh, never got nailed by Nimda or the red or any others either.
      • Re:In other words... (Score:2, Interesting)

        by JWSmythe ( 446288 )
        The database server handling the message areas for Voyeurweb, RedClouds, and feedback areas for the same has answered 28,442,099 questions in the last 13 days. That's when we finalized changes to it.. Before that, it had been running for 2 years.

        I wish we only had 5mil hits/day.. One web server takes 18mil req/day.. We have bunches of 'em out there. :)
        http://voy37.voyeurweb.com/1.stats.html [voyeurweb.com].

        Did I mention we're a Linux shop?
  • Cost of reliability (Score:3, Interesting)

    by nuggz ( 69912 ) on Tuesday July 09, 2002 @04:21PM (#3852066) Homepage
    Pretty broad statement.

    To get higher reliability you have to design for it, if you only require the lower reliability, it would be considered overdesigned.

    I don't think high reliability is "too expensive". I think sometimes people ask for more then they need.

    Phone system reliability, 911 should be "highly reliabile", long distance across the world can get by being significantly less reliable.

    The main hospital server system should have a high reliablity, because it is important and worth it > 99.99% of the working day.
    The fundraising server, or something could be a bit less reliable high 90's.

    Demanding high reliability for unimportant applications isn't worth it, and is just a lazy design.
  • by Lord_Slepnir ( 585350 ) on Tuesday July 09, 2002 @04:24PM (#3852087) Journal
    I think we just knocked his server down to two nines by slashdotting it.
  • by palmech13 ( 59124 ) on Tuesday July 09, 2002 @04:24PM (#3852089) Homepage
    What else would motivate someone to post an ex-boss' e-mail address on the front page of slashdot?
  • 99.999% perfection (Score:4, Insightful)

    by Gorm the DBA ( 581373 ) on Tuesday July 09, 2002 @04:26PM (#3852111) Journal
    Let's see...five nines would be just over five minutes of downtime in a year (315 seconds). For business and other non-life-threatening situations, that would be way better than necessary. Lots of folks are probably going to harp on the "If 1 out of 10,000 airplanes crashed, there'd be X crashes" line of argument. There's a problem with that...one mistake doesn't crash an airplane. Every system on an airliner is redundant, and virtually any "pilot error" has time to be fixed before there's a problem. Listen in on the Air Traffic Control to Cockpit transmissions sometime...just about every flight encounters some minor error at some point, whether it is a pilot needing to reask for a clearance or someone needing to climb or descend a bit to clear a potential collision. Errors are unavoidable. The key is to ensure recovery from those errors is possible. So sure, your computer may be down for 5 minutes a year. Make sure you have a backup system that is able to take up the slack instantly, and your downtime is down to 3/10 of a second a year. Redundancy is the key.
  • It all depends on what is on the server. If it's stuff your own people use constantly on their job, through your own network, you need five nines, otherwise you will take the blame for critical jobs getting done late.

    But when people are going to the server through the internet, they get used to interruptions - there are so many links between, some of which periodically become overwhelmed with traffic, that no one could tell the difference between two nines and five nines on your server itself. So sales & product information sites don't need more reliability than you can readily afford. They do need high capacity.

    And if it's your blogs concerning your navel lint - no one's looking at your uptime but you...
  • Simple (Score:5, Funny)

    by American AC in Paris ( 230456 ) on Tuesday July 09, 2002 @04:27PM (#3852123) Homepage

    Five nines uptime is cheap and easy. It all boils down to where you put the decimal point.
    • Heh, my old Commodore gets about five nines. If I turn it on for just 52 minutes a year, it is getting .0099999% uptime!
  • by isa-kuruption ( 317695 ) <kuruption@@@kuruption...net> on Tuesday July 09, 2002 @04:35PM (#3852195) Homepage
    The "five-nines" of reliability has nothing to do with an individual server being available, but with a n individual application. This means, you can have 2-3 servers running the same load-balanced application. This way, you can take 1 down every hour if you want, as long as the other one or two are still working. This way, the application is still working. If you're REALLLLLLLLY lucky, you will meet the "five-nines" and if you're EXTREEEEEMELY lucky, you'll get 100% on that application.

    THAT is the goal. It's called redundancy. You will *not* meet any reliability milestones on a single server or network link. It's an obtainable goal, but it does cost money depending on your architecture.
    • Actually, even this is silly. True five nines availability on a widely distributed network would mean that an application was available at all times on all segments of the network. Which would mean that your uptime depends not only on your redundancy on one side of a pipe, but on your overall reduncancy as well, so that when a pipe goes down you're still accessible. Since when a pipe goes down in your host you probably lose other resources as well (such as power or alternate pipelines), this means multiple datahouses owned by multiple vendors. Each of these has to have a perfect backup of all data and be running the same versions of all software. Really, the only true redunancy would be so heavily distributed that each local network would basically have to have its own server. This isn't so crazy -- technically, DNS and email do this. However, we all know that for an end user even DNS and email can have perceived outtages.

      And this is why 5 9s is foolish. Sure, you're redundant behind the pipe, but if you lose the pipe you can't blame your datacenter when you charged a customer for uninterrupted service. Technically, if their modem disconnects them for a few hours you've broken contract.

      Besides, who needs it? If yahoo is unreachible from my desk, I wait and reconnect. It doesn't matter if the downtime was my fault or theirs...the effect on my user experience was the same. Any services I might have used, or products purchased, I will use or purchase at a later time. After all, I don't refrain from buying shoes just because the mall is closed!
      • Having the 5 9's of reliability is NOT foolish. It is a reality of life. My particular organization services 40 million web customers, so we can not afford to be down at any time of the day because of the type of service we provide. In fact, last year we made our goal of having the 5-9's, and we did it without needing our disaster recovery (DR) site.

        Having a DR plan and being reliable go hand in hand for the most part, however under normal day-to-day business conditions, servers need to be upgraded and things unplugged. You don't switch your entire infrastructure over to a DR site to upgrade your apache web server!! It is for this reason you have redundancy on the network and server level leading out to the Internet (or wherever your customer base resides).

        Disasters, on the other hand, do not happen everyday. They happen once a year, maybe.... sometimes once every 2 years. If you live in an area more prone to disasters (like southern California), you may need an alternate site located on the east coast.... but, that is the cost of doing business.

        Also, having 5-9's on uptime does NOT mean being accessible to everyone in the world at any time no matter what. Having 5-9's of uptime means that your organization has successfully kept it's applications and services available to the Internet. How is it my company's fault if you don't plug your modem into the wall? It's not, so to say that our "reliability" decreases because of an end user being a moron is a stupid statement.
  • but their server is down.
  • with M$, it is theoretically impossible as well to achieve their advertised up-time; ( i think back when they ran some ad (still running?) about how windows can achieve three or four 9s of uptime).

    Total bullshit... let's see -- windows machine *requires* reboot every time you apply a patch; a reboot on a large machine is... i dunno, 10 minutes if you got a lot of crap. security update turns up about twice a week or so... that puts up to be ~99.8% MAXIMUM;

    even if you don't buy my numbers, three 9s uptime means every week you only gets ~6 seconds downtime.

    yeah... sure... not if you want to patch up than internet explorer / IIS so your system does not die from DoS, hackers, or worms!
  • Hate standing in the meat locker (server room)? Hate rushing to work past midnight to cycle a server?
    The problem I used to have is I'm not a morning person so being available as an admin before 7am is tough, but now I can admin my network while trapped in rush hour traffic. =] Reboot servers, telent into devices, stop/start services, add users, manage DNS... the list goes on and on.
    Uptime can be maintained without even having to leave the comfort of your easy chair. If you're an admin you should check this product out.
    SonicAdmin by sonicmobility
    (http://www.sonicmobility)
  • 3 9s = 99.9% uptime = 8.75 hrs/Yr = 525 min/Yr.
    4 9s = 99.99% uptime = .875 hrs/Yr = 52.5 min/Yr.
    5 9s = 99.999% uptime = .0875 hrs/yr = 5.25 min/Yr.
    9 9s = 99.9999999% uptime = .03 seconds per year downtime.

    I call bullsh*t on anything that claims to have 9 9s reliability. 3 seconds every HUNDRED years.
  • Looks like codesta.com just used up all it's downtime by getting it's servers slashdotted.
  • by cOdEgUru ( 181536 ) on Tuesday July 09, 2002 @04:48PM (#3852306) Homepage Journal
    I believe theres more to this than meet the eye.

    What other best way to get back on your former boss than slashdotting him or his company server back to medieval ages..

    Follow that up with multiple queries on google about boss's info, credit cards, ssn etc..

    To cut things short, by the end of the week :

    Boss's boss realizes the server crashes were due to Boss, fires his ass on the spot.

    Wife realizes that the new unexplained charges on Credit card from "Suzy's Parlor" were not exactly the next door cafe. Gives him the boot as well.

    You evil man..you!
  • ...that this article is hosted on a server which is now being brutally Slashdotted?
    • I can't read the paper but, for his sake, I hope that he really meant that reliabillity isn't that important to him.

      His server is toast!

  • by SkyLeach ( 188871 ) on Tuesday July 09, 2002 @04:50PM (#3852326) Homepage
    We did it on a really low budget:

    Heartbeat/Mon/Fake/Coda/Linux/IPVS for the High Availability, failover from DS1->DS2, each on different backbone nodes.
    Mirrored systems in different geographic locations:
    Firewall
    IPVS Gateway
    Apache->Weblogic bridge (Apache vhosts with ssl)
    Apache->Zope bridge (Apache vhosts with ssl)
    Zope->Zeo setup for content management.
    SAN drive array for Oracle, running on two E4500s

    This system isn't really that expensive, just the costs of hardware and my salary for setting them up.
  • I work for a small ISP in central NY. A couple years ago, I can't remember which provider it was anymore, but they unplugged us because their paperwork was all screwed up and they didn't think anybody was on the circuit. Then they plugged somebody else into it. It not only took us several hours to find out what the problem was, it took 3 whole days for them to resolve the problem. They wouldn't simply undo what they did, they had to assign us a new circuit and basically refused to escalate the work order. We eventually came back up but lost quite a few customers, understandably.
  • by nomadic ( 141991 )
    It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues.

    You really want to see someone go berserk over downtime, try running a MUD...
    • Actually, the specific 800lb gorilla I was thinking of was a client when Steve and I were working together.

      eBay.

  • That site deserves to be slashtdotted. They have this little paper divided into about ten little sections, which multiplies their load by 10x or so. Then, it's a .jsp page (why?), which means more server-side interpreter overhead. If they hadn't crudded up the basic job of serving a readable document, they'd have one or two orders of magnitude more capacity.
  • "It's an interesting read, especially if you answer to an 800lb gorilla for outages and uptime issues."

    Please. Let's not talk so badly about eBay. Do you know how many people have been crushed under their CIOs foot? ;)
  • by Todd Knarr ( 15451 ) on Tuesday July 09, 2002 @05:09PM (#3852492) Homepage

    Remember that downtime is related not only to reliability of each piece of equipment but the number of pieces of equipment. 99.99% uptime sounds good, less than an hour of downtime a year, right? Scale that to a 500-server farm and it's an hour and ten minutes or so of downtime a day, every single day of the year including weekends and holidays (OK, we'll give you one day off in leap years). This concept has boggled a few salescritters who don't grasp the concept of scale.

  • New Uptime Server (Score:2, Informative)

    by Aknaton ( 528294 )
    For those who remember the awesome but now defunct uptimes.net will be pleased to know that a new server is now up and running. It uses the old uptimes protocols and clients.

    The URL is http://uptimes.wonko.com/

    A GNU/Linux box was number one the last time I looked, with a NetBSD box coming in second.
  • So much for "two nines". Nothing, I repeat nothing, can withstand the /. hordes ...
  • by Carmody ( 128723 ) <slashdot@@@dougshaw...com> on Tuesday July 09, 2002 @05:40PM (#3852762) Homepage Journal
    "Are ve up?"
    "Nien."
    "Are ve up yet?"
    "Nien."
    "How about NOW?"
    "Nien."
    "Vill ve be comink up soon?"
    "Nien."
    "Vill ve be up next veek?"
    "Nien."
  • Full Text - Page 1 (Score:5, Informative)

    by Kallahar ( 227430 ) <kallahar@quickwired.com> on Tuesday July 09, 2002 @07:32PM (#3853311) Homepage
    The Scenario

    Pagers going off. Phones ringing. People shouting fragments of conversations over the tops of cubicles. Groups of people huddled around monitors. Others dashing up and down the hallways, sticking their heads into office doors for just a moment, then scampering along to the next doorway. You are frantically talking on your cell phone, silencing your pager, and yelling into the speakerphone on your desk while typing on two different keyboards attached to three different monitors.

    Sound familiar? It's a classic case of the dreaded 'downtime' disease, a terrible ailment where none of your systems work and for reasons you can't always understand. Of course, it typically strikes at the most inopportune moments - the launch of a major product upgrade, or right after announcing your partnerships with 5 of the Fortune 100.

    Nobody wants downtime. It's a terrible thing that always involves blood, sweat, tears, and inevitably, a loss of money. This is why when you talk to the upper management of any company with a strategic online initiative you'll be told that the IT group has the highest goals, and that downtime is considered to be an anathema to be stamped out vigorously.

    Unfortunately, when you talk to the company's IT manager you commonly hear a different story; the resources to back-up the company's lofty online goals are hard to come by. In fact, with the down swing of the last couple years, combined with the fact that IT isn't, at least directly, a revenue generating entity, IT budgets are being reduced while uptime performance levels are expected to be the same. This can just lead to a death march of extremely over-worked IT personnel, and longer, more numerous, occurrences of system downtime. These goals need to be re-evaluated.

    Genesis of the 'Five Nines'

    We've all heard the mantra of 'five nines', or 99.999% reliability. Somewhere in the depths of the Internet's 'big bang', when systems were slow and cranky, reliability became a major selling point of why one company's system was 'better' than the competition.

    First, people talked about being 'two nines' or 99% reliable. Then someone else would top that, and make their product seem better, claiming 'three nines' (99.9%). Not long after that came 'four nines' (99.99%) and then, near the peak of the dot com era, came 'five nines'.

    The herd mentality left no room in which to pitch for investment without the 'five nines' claim. "After all," it was thought, ôif everyone else is saying they can provide 'five nines', I'd have to pretend I didn't know what I was doing if I didn't say I could match everyone else's claim."

    'Five nines' isn't impossible. It's merely impractical and unnecessary in the world of the Internet. A shocking statement, perhaps, but a truism none-the-less.

    We're not talking about launching people into space (which, by the way, is unfortunately done under 'three nines'), or working with nuclear power plants. We're working within the reference of online systems providing services to users both on and off the Internet - nobody dies from a system failure.

    The Greasy Steel Bar

    Think of uptime as a chin-up bar coated in grease. The higher the reliability desired, the greater the coating of grease. It's clearly tougher to hang to a higher standard of reliability.

    What's not so obvious, but very important, is that the higher the uptime target, the worse one does if not prepared. An IT department capable of three nines faced with a bar that's five nines slippery won't even manage the three nines they are capable of doing.

fortune: cpu time/usefulness ratio too high -- core dumped.

Working...