Follow Slashdot stories on Twitter


Forgot your password?
The Internet

Why eCommerce Sites collapse 95

Rahul Mehra writes "ZDNet has an interesting article about how eBay and other e-commerce sites collapse under heavy loads. It talks about how massive growth, incomplete planning, rising expectations (24x7 uptimes) and immature technology all contribute. " This train of thought, for me at least, leads to neo-Luddite question - what do you folks think?
This discussion has been archived. No new comments can be posted.

Why eCommerce Sites collapse

Comments Filter:
  • whenever you are talking about serious work. You are back to SP clusters and s/390s and S/70s and E10ks and so on.

    The Scwab issue is clearer than they are making it out to be -- I have some knowledge of this back two years and Schwab has some complete idiots running things, still, even after a series of disasters (some of which didn't make the news). I really can't explain it in any way other than people who have gotten MBAs seem to only trust other people with MBAs, no matter how poorly they perform. I set up my account there as soon as I could, but after hearing really unpleasant stories for a few years, I finally went to Fidelity.

    s/390s aren't unreliable, and Parallel Sysplex stuff works well dynamically, but if a)you are basing your maintenance window and procedures on a saleman's promises to an MBA and b)you aren't keeping the better mainframers because of pay and poor treatment, you will have problems.

    Similarly, Cisco routers aren't unreliable. Encryption makes a 30% performance hit. If you are about to be swamped by transactions (if the market is tanking, for instance), then turning the encryption off is a command decision that you "get paid the big bucks for." Not doing so and having systems choke is not a problem with Cisco, anymore than undersizing the systems is a technical problem.

    I am relatively confident in Schwab -- I would be confident enough to keep my money there if they would take my money as seriously as I do and spend less on MBAs in technical positions and more on technical people in technical positions.

    And no, to the best of my knowledge (I have several funds and they have positions in everyone out there), I own no Schwab or Fidelity stock.
  • by Anonymous Coward on Monday June 21, 1999 @10:24AM (#1840527)
    At least, that's the way a lot of companies seem to treat sysadmins. A good sysadmin, who keeps systems running smoothly, *appears* to be doing nothing. Why should such a person be paid very much just to run backups and turn a few screws, they'll ask. The sysadmin is viewed as a high-tech janitor, and is given about as much respect. This often results in companies only hiring one sysadmin, or worse, foisting the sysadmins duties onto other people on a standby basis. So when the fires come, the company is suddenly understaffed. A bad sysadmin, who's always recovering from crashes, restoring backups, rerouting network traffic, looks like a busy employee. If not for him, our machines would all be down right?, they will say. So what we have here is a scenario that favors either bad sysadmins, overworked sysadmins, or standby sysadmins who are actually full-time employees with other stuff to do and worry about. Welcome to hell.
  • Does anyone else think that everyone out there *expecting* eBay to be available every second of every day is a bit extreme? I mean look at it this way: you can, most of the time, go on eBay, probably find something close to what you are looking for for about half the price of retail, and even order the damn thing straight to your door within a few days. And you bitch when the service burps?

    It's really sickening to hear that people can't get a grip on how far technology has come, and expect it to be way farther than it is.
  • One thing the article didn't mention directly, but plays a very important part in the stability/quality of your infastructure is the quality of your sysadmins. It doesn't matter if your boxes are triply redundant hot-swappable never go down systems if the sysadmins inadvertanty blow away key files periodically. Lots and lots of IT places seem to hire semi-trained monkeys as sysadmins and then wonder why their site is always going down. Look at the chart of outages on the second or third page of that article, notice how often "Failed software upgrade" appears? The problem is that the hardware vendor is usually blamed for those kinds of problems, which draws attention away from the true problem of unqualified sysadmins. Of course most of the Slashdot crowd doesn't fall in that category.
  • >> turning the encryption off is a command
    >> decision that you "get paid the
    >> big bucks for."

    Gee, if I were a hacker, I'd *never ever* wait until a big event (eg. market goes to hell) to start dumping to disk if I had managed to hack into a decent-sized ISP (or worked there and was a pissy sort of person). The prestige for showing an online broker to be vulnerable has to be pretty significant, especially if you moonlight as a "security consultant" or whatever.

    Maybe I'm just a wuss, but it seems like
    s/get paid/get fined/g;
    is a distinct possibility if the ruse is uncovered. (It's also a tacky thing to do)

    I suppose that with those sorts of loads, you could make a case for it being statistically infeasible to pull any real information out without a huge amount of disk space to dump the packets onto and a lot of time to pore through them... but people don't change their passwords very often, and you could probably assemble useful information in a reasonable amount of time. And a decent lawyer should have no trouble spooking a jury into overreacting if a trial came to pass.

    Either way I submit that the magnitude of the negative publicity that would ensue would make such a decision very hard to justify.

    Why not colocate at, say, Above.Net, and rely on their monster pipes for the big loads? It's not like it would cost that much more, and you rely on an extremely high caliber of technical staff to keep things running.

    >> Encryption makes a 30% performance hit

    In my experience you're off by almost an order of magnitude, in terms of CPU load. If you're only talking about packet throughput, then yeah, the handshake, key exchange, and renegotiation every few minutes adds about 30%. It seems like CPU power is usually the bottleneck in doing SSL transactions on big fat pipes, though.

    >> people who have gotten MBAs seem to only trust
    >> other people with MBAs

    I had the misfortune of working at one of the top business schools in the country for about a year, and this is what I perceived: MBAs without a physical science or engineering background are categorically inept at technical decisions, no matter how much they think they have learned by reading InfoWorld. Negotiation is an MBA's strong point; following through is Someone Else's Job, as best as I could make out. So why don't they recognize that they are likely to make more money (enough to offset the cost) if they hire the best (and most expensive) technical staff? Beats me...

    'Cause otherwise you're presenting an opening for someone else to gain publicity as Those Guys That Suck Less (tm) and steal your mindshare and profits. That can't possibly be lost on MBAs. (can it?)

  • Doing it right the first time of course isn't that easy, but once it works, don't break it. It must be possibal to run a 24x7 site for an entire year, while stuck on gilligan's island without any way to contact the rest of the world including the site.

    Nasa has comptuers in buildings where at any moment deadly (a few seconds from a small leak and everyone in building is dead!) chemicals are around. Do you think that their IS wants to touch the comptuers? Not unless they first send everyone else home and empty those tanks. If it wasn't so heavy they would probably insist on space suits too.

    You decide a year or more in advance how much bandwidth you will get. Then decide how many customers that will support, and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them, and long term satisfaction will go up.

    Once you know how much bandwidth you will have, you make sure you have comptuers that can deal with it. Mainframes have been doing 24x7 for years. Unix is very close to matching that (with Sun's redundant hot swapable system perhaps better, not that sun is the only chioce) I have seen tripple redundant systems with a polling mechanism where if one comptuer gives a different result it is shut off. Guess what: none of this is cheep. Thats right, doing buisness on the internet in volumn isn't cheap. Spend the money on system that will stay up, and enough power that you don't run out, and you will run 24x7. There are plenty of companies that make equipemtn that is ment for this use.

    Last, and foremost: hire system administrators that have proven they can keep the systems running 24x7, and pay them to do so. These people are older, in their 50s or so. Hire thebest of the expirenced, and then give them a deal: you pay them to keep the systems up there or not. They should soon find a paycheck arriving every two weeks, with only a few hours a month work.

    Remember, design the system so you can run it from Giligans island (no access by you) without your boss realising, and you will do fine.

    Of course reality is that you do have to replace crashed harddrives, but with RAID-6 (raid-5 plus more redundancy, raid-6 isn't officialy defined) that is any time. You do need to buymore backup tapes once in a while, but automatied backups are the norm in 24x7 enviroments.

  • This ZDNet article is probably the most fact-filled piece I've read from them.

    I tend to agree with the theory that many of these companies still think like start-ups; they act like they don't have any money to spend! Perhaps they're just not aware of where their money is best spent. I can't say I know the start-up web content business mentality to its very ends, but when money is tight you start betting against catastrophe, and hope your odds are good. Duplicate server hardware is expensive for a small shop, but when you have billions of dollars in revenue, and your _entire_ business relies on your information infrastructure, the least you should do is build a duplicate server farm right down to the cables on the power supplies.

    Yeah, you'll blow a million dollars on it, and you might not need it, but the maintenance costs are lower than the cost of losing your auction site, on-line trading service, bank, or retail market for five days.

    You co-locate services at multiple network access points. You use reliable software--the kind you have source code to, so you're not on the phone at midnight with a "knowledge engineer" across the country who is trained in taking bug reports. You need to fix the problem so you hire people who can.

    You spread the load at all points (you have multiple web servers, multiple database servers, multiple administration access points, redundant networking hardware), and you always have ample staff around for that 4:00 AM breakage.

  • Using age as a disqualifier?

    Someone who's 50 has a chance to have 30+ years experience in the field. Let's see you, hot shot 25 year old, have 30 years experience.

  • That first link should have read 'eBay problems probably preventable'.
  • Make sure to read the link about why eBay probably preventable [] - it seems eBay didn't suffer from load problems, just that eBay hadn't installed an easily available patch - apparantly it'd been available for about a year. This LA Times story [] says about the same thing, but is much shorter.

    The first link basically says that the eBay guys weren't paranoid enough about making sure the setup was reliable. This is always a problem. (hey, I'm working on a commercial web site that only got a proper sys-admin 2 years after it started...). Little side-note - one guy says Sun's clustering stuff is not that great... I know Sun have been a bit late in starting doing clustering stuff, but I've also heard that what they have done is pretty good, *shrug*. Actually, they just annouced version 3 last week, which also allows clustering of 16 Starfires, for 1024 processors. (they're also making the source code for this available...)

  • I dunno... 'eBay probably preventable' sounds like much more fun to me... :P

    Posted by the Proteus

  • I remember seeing a blurb recently on Microsoft's site slamming Sun for causing the problems at Ebay. According to MS, the Sun server failed, causing the outage, while the NT front-end servers were golden. Lots of factors were cited, including the E10K's sensitivity to config changes, reliance on a smaller domain server, and other factors.

    Now we learn that the problem was caused by Ebay, and Ebay alone, by not keeping up on their vendor patches, and that Sun had fixed this particular bug quite some time earlier.

    It would seem that MS needs to print a retraction. Any bets on when we'll see it? :-)
  • Does anyone else think that everyone out there *expecting* eBay to be available every second of every day is a bit extreme?
    No... for example, some time ago I used to go Fatbrain when I wanted to buy a book. But I frequently experienced that Fatbrain was down. On the other hand, I cannot remember having problems with, ever.
    So when I want to buy a book, and my cursor is in the URL field of my Netscape, which URL will I type? Fatbrain, even though I know that they might be down again, and I will have to wait 30 seconds patiently until I get the timeout? Or will I got to and but the book immediately?
    Well, I use the shop that uses Unix, and not the NT shop...
  • Agreed; it's always more interesting when you can hear the voice of an an author or an editor, rather than the bland predigested output of a committee. The Cluetrain manifesto [] is a good argument that companies shouldn't be homogeneous and faceless on the Web.
  • Looks like the definition of load balancing to me...
  • Look at AOL. They introduced their flat rate scheme, had constant service problems, infuriated all their customers -- and now they're by far the dominant ISP. The goal here is market share. It doesn't matter how much your customers bitch, how much other people hate you, how many "Why XXX Sucks" pages there are about you. The only thing that matters is how many bodies you can claim.

    Providing quality service probably means that you're doing something wrong, just like making a profit does.

  • Thinking that over for another minute, that's much more true for cases like AOL where the system was inadequate for the new load than here where they had plenty of hardware but didn't maintain it properly...

  • It's funny how people's expectations get so out of whack when dealing with technology. Knowing how to use a paintbrush does not make one an artist. Yet when it comes to computers everyone knows what's best. E-commerce sites are just one example, but they are a VERY GOOD example.

    We let pilots fly the airplanes; we let chefs cook the dinner; but we cannot let technical experts exert technical expertise. Sometimes it's scary.
  • Oracle uses files or logical volumes, which are basically glorified disk partitions. My experience is on HP-UX, but what generally happens is that root creates logical volumes for oracle which are accessed via /dev in the root filesystem. Once the LV's are created and opened, nothing should be able to read/write blocks in them except Oracle, under oracle's own user id. It's a basic device locking process.

    Apparently Solaris screwed up this arrangement and wrote some blocks in Oracle's space. It's odd that Oracle was then able to crash the OS - the only reason I can think of is that Solaris put something really critical in those blocks, and Oracle overwrote them for some reason while it was aborting.

  • No, it's a basic layered design which puts block i/o on an open layer underneath the file system. A kludge is when you need a facility like this and have to work around the OS to get it, eg partition magic, Norton utilities, etc.

    The logical volume layer is a great thing to work with in normal situations. Mirroring, striping, RAID, backups, and failover all work at this level. To give an example, if you want to do a hot backup of a mirrored filesystem, you can split off one mirror, mount the copy and fsck it, dd it to tape, and then merge the storage back into the mirror, without disturbing the primary FS. That works for oracle instances as well (just substitute some oracle commands for fsck above).

  • You're joking, right? Have you actually gone to the CEO of any company and told him/her that it's time to shut down marketing? "They can just take the next 2 months off while we re-engineer the back end systems."

    You'll be lucky if you're not fired outright. And if they listen to you, they are even crazier than you :-)

    Maybe that whould be the punishment for marketing incorrectly projecting demand. The only problem is that marketing tends to hae the CEOs ear more than IT. Therefore, IT gets blamed for only meeting marketings projections. But, that is showing my own prejudices. :-)

    I think one of the real problems is a lot of companies are not doing the research that will give them good capacity projections. I believe it was Schwab in the article that said they went from quarterly analysis of capacity vs demand to every couple weeks, plus they have plenty of excess capacity, which is required sice demand can spike in days if not hours, while adding capacity probably has lead times measured in weeks.

  • Technicians traced the main cause of the outage to a problem with the Sun Solaris operating system, which overwrote files and corrupted the Oracle database. The database, Version, recognized a data block in an incorrect format, and that caused the main hardware--a Sun E10000 server--to crash. The problem is fixed now.

    I am hardly an Oracle (or Sun, for that matter) expert, but I thought Oracle used its own filesystem?

    Also, note that Microsoft's view on the matter [] is nowhere near the actual cause of the problem. It's as if Microsoft was keeping tabs on this Oracle/Sun combo and decided to come forward with their "competative analysis" when the time was right. Looks like they had some "Haloween" documents on Oracle/Sun too... ;-)

  • Every job I've ever gone into was a mess. When I first took the job I have I was pulling my hair out this place was so bad, I mean everything here was wrong; the users habits, the way they did things, the software they used, everything was just fucked. Now, a year later, things have started to calm down and I find myself with a LOT of free time to implement back burner projects.

    A sysadmin should never look idle. There are always things to do, things to improve.

    The reason I stayed was that I have so much control over things, and people will listen to me.
    And as clueless as my users are, I still like most of them.

    Unfortunately the slashdot/linux today conspiracy - lately - is really hurting my productivity. Urgh ... back to the books ...
  • I Totally agree! I've been preaching this for a couple years and am implementing a very cool, full featured middleware solution now.

    I read a description of what eBay had a few months ago and was shocked at the predictable crash they were heading toward.

    The thing is you can't easily patch a monolithic system to run on loose clusters with replication and redundancy. It will appear much more attractive to continue down the monolithic road and add hot-spares.

    Few people seem to get what it takes to build truly scalable and reliable systems.

  • The entire Dell "Shopping Cart" idea *IS* dynamic. Change an option, click a button, new price on same .asp.

    I've done some ASP work and it's pretty memory intensive. Kudos to Dell for making it work -- it's slow as molasses sometimes, but it's never been down in my experience.
  • But seriously... Planning planning planning. Don't run everything on one box, keep backups, have backup plans, etc, etc. If you don't do these things your site is bound to have problems
  • I think they're developing something that runs on top of a solaris core but uses its own file system. That's probably where the confusion is.

    Under normal operation, though, Oracle allocates a specific amount of disk space for its operations, and manages the space itself. So you could say that it uses its own file system layered on top of the host OS (Solaris in this case). This is distinct from, say, mySQL, which uses files in the regular Unix file system as tables.

    I'm not an Oracle expert either, so you can take this with a grain of salt, but I believe this is how it works.


  • There isn't enough prevention in the world to stop a hardware failure or natural disaster.

    Probably the best approach to mirroring is loose consistency. Have a daemon running in the background that will pop up once in a while and check to see if usage is below a certain point. If it is then start updating the secondary system. This method is better than strict consistency which requires that all updates happen to both systems at once before the transaction can continue. In the event of a failure this approach gives you a bit more data reliability than loose consistency but greatly reduces availability because both systems must be working in order to get any real work done.
  • The difference is much bigger than just sheer numbers. Lets give Dell the benefit of the doubt and say that they have twice as many browsers pointed at them. They serve up mostly static pages. The parts that are dynamic are not very dynamic. Now look at eBay. They have a site that is nearly 100% dynamic what with the very nature of their business. Add in the personalized things that users can set up and you have a system that is working several orders of magnitude more than a site with heavier traffic but static content. Now, I don't think that Dell has anywhere near the number of hits. eBay junkies (My parents are antique dealers) check the site almost constantly throughout the evening. Once you've seen a Dell laptop, you know what it comes with. eBay customers expect a constant change in the site and thus check every few minutes as auctions that they are interested in draw near a close. eBay is easily a more taxed site.

  • A friend of mine (perhaps you, AC?) made this exact point to a school board somewhere in the midwest (location obscured to protect my friend) which was grossly underpaying their sysadmin for this very reason, even though the guy was sharp as a tack on the tech side. Apparently their perception was precisely what you're describing -- he even used the same verbiage as you to try and explain to the board what it was they were doing wrong, and how fucked the entire school district would be if this guy ever discovered his true market value and found a job that would pay him accordingly.
  • Well, there's a big difference when the customer base is in the hundreds of thousands. If the cumulative effect of one day's aggravation is a million customer-days of aggravation -- you've got a problem.

    Also, you're thinking that the only people on eBay are like some people casually wandering into an antiques store with grandpa's wardrobe cabinet. Nope, often the people on eBay are the antiques dealers, and you'd be surprised how many people already make their living off of selling stuff on eBay.

    It's like my current ISP's authentication problems. Yeah, I can log on 70% of the time on the first try; but the cumulative effect of a 30% first-try failure rate is that they get fired. Net inconvenience to me, measured in perhaps a couple of hours after several weeks of this -- but not tolerable.
  • Anonymous Coward wrote:

    > woah, hemos you have to explain your thinking
    > on this one. i cant even come close to finding
    > anything with a neo-luddite feel to it in this
    > article.

    I think what he's talking about is the angle the
    author is taking: "Could this be the death of

    Answer: No, it won't. Next?

  • You're joking, right? Have you actually gone to the CEO of any company and told him/her that it's time to shut down marketing? "They can just take the next 2 months off while we re-engineer the back end systems."

    And you seem to forget one thing: it is not a business decision what the capacity of the equipment is. If the equipment can support X users at once, the business types have precisely three choices:

    1. Limit the usage to no more than X users.
    2. Buy more/larger equipment to handle a larger number of users
    3. Have the number of users reduced to 0 when the equipment dies under an overload.

    And option 2 takes time, time to get the equipment in, time to configure and test it, and time to roll it into production cleanly. If the CEO doesn't like it, I'm sorry but that doesn't change reality.

  • Cut costs - and you get what you pay for. I had a little bit of a think about this and was looking at some new software that came across my desk this morning. This is not Microsoft bloatware, but IBM bloatware... Just happened to start taking a look at DB/2 for NT.

    Code has become bloated... I remember when I was in development, we had to fit our software on a low density floppy or two, since most of our users would not have HD floppies (Europe was a major factor in this decision) and more than two floppies would raise the Cost of Goods.

    Appears to me that a lot of programmers, webmasters and networking people have forgotten how to optimise their crap.

    I remember a LARGE bank in Malaysia running their servers on DOS(!) doing transactions at the rate of a couple of thousand a day. Where have we lost our ability to optimise code, data and out thoughts?

  • This is all very interesting to read about. One would think that the internet is generating "unheard amounts" of loads on various systems for the first time. Mainframes (IBM, Unisys, Amdahl) have taken much more than this in terms of loads or transactions / second. The problem that I see is that people tend to isolate architectures that have worked in the past for new cool things that vendors tend to shove down their throats. CICS or for that matter virtually any transaction intensive database on proper mainframe (some of my customers are doing 20-30K transactions per second and they are in no way "big" users) could handle that load. At times, the whole internet revolution reminds me of the "client server" phase that the industry went through. Ziff David was one of the proponents of this phase (well they had to sell them damn magazines didn't they?) often claiming that a Novell file server would be damaging to companies like IBM. Well perhaps it is time to step back and examine how some of the legacy systems have worked (heck... imagine your bank telling you that their systems got overloaded on pay day!?! Then let see how we can adapt them to the Internet. IBM is doing an awesome job on this and so is HP. I strongly belive that the systems we're seeing today are "prototypes" doing proof of concepts, waiting on the big iron boxes to become internet enabled. One more point. Most of the classic "brick and mortar" businesses, people who know their technology, customers, systems.. are NOT internet or e-business enabled. Lets drop a few names of the DOW Jones components.. Ford, GM, GE, DOW, Coke etc, do more business than the e-business startups and probably process more transactions per day on their mainframes. I'd be more concerned about what happens when they start up their internet "storefronts"... Ok.. just a few random thoughts before I head into work...
  • All the touchy-feely bullshit on the web. Countless self-absorbed homepages, insipid rantings and more. Electronic Navel Gazing I'd say. Dennis Leary was hilarious, especially that one advert with the kid crying about keeping the net free, and Leary pops up to ask how his mom and dad paid for the computer he was using...
  • I also think it is sickening, but it is a reality of software and Internet engineering. Whenever someone comes up with a nice innovative idea and is able to produce it, someone else out there is going to see how it ought to be better.

    Actually one of the central focuses (foci? :) ) of slashdot is a great example: Windows vs. (!Windows). Windows has brought more computing to the masses but does not fully serve the needs of multiple users. It's progressing, but *nix and others are coming from exactly the opposite direction. So either way you look at it, we have one OS that has some great technology but is lacking major components. Many (most?) on /. assert that it is our right and duty to complain.
  • You decide a year or more in advance how much bandwidth you will get. Then decide how many customers that will support, and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them, and long term satisfaction will go up.

    You're joking, right? Have you actually gone to the CEO of any company and told him/her that it's time to shut down marketing? "They can just take the next 2 months off while we re-engineer the back end systems."

    You'll be lucky if you're not fired outright. And if they listen to you, they are even crazier than you :-)

  • I think that may be part of the problem with some sites in regard to 24/7/365 uptime. Mostly what I have seen is that microsoft products tend to work well for the tasks that MICROSOFT specifically thinks that you will use them for. Although it is technically feasible to run such a system for heavy duty services I would rather choose IBM for its reliability as a vendor of database tools and support if I had to choose a proprietary solution. However linux or bsd (which has some pretty optimised code for fast net access from all indications) would be a better idea if someone is there to get it up and running.
  • Here here. Most people don't think about how much traffic and what will actually happen to the computer that is running it. If you have a site that is going to be running finicial transactions and authenticating people as well a repeated hits from the same ip's from people bidding you probably should have something that serves up dynamic content pretty damn fast. As well as making things stable enough. Probably should have some sort of load balancing and fallover to alternate systems.
  • I have a friend -- a Senior Systems Administrator -- who, a couple of years ago, described his job as "data janitor". So I fully agree with this post. He also made the point that being good enough that you didn't have to do that much wasn't a Good Thing. He was once fired from a job for just that reason -- "wasn't bringing in billable hours" was the *official reason*. (A week later they had to rush out and hire a new sysadmin, because -- guess what -- they found they needed one! They were *really* suitpid! ;-)
  • eBay's outages have a bigger impact than they should because their auction format favors real-time last minute "sniping" over their proxy bidding system. If they would tweak things to encourage less sniping then unavoidable outages (and there will always be some, regardless of how much redundancy they build into their infrastructure) won't have such a devastating impact on their business.

    As it stands now, eBay's auctions are so time-critical that they're in the same league as online brokerages. And speaking of brokerages...

    Fidelity is running TV ads (plastered all over Pirates of Silicon Valley last night) touting the speed of their systems and how seconds count, with a quick disclaimer at the end of the ad that response time depends on network conditions. This is a pet peeve of mine: ads with disclaimers which make the rest of the ad meaningless. Example: "99c Big Macs! That's right, 99 cents! Only 99 cents! Prices may vary." But the point is that they're promoting the idea that the internet is suitable for real-time transactions, even though they recognize that it isn't quite there.

  • you should invest in MBA Technologies, theyre a good company and you obviously love mba's...
  • ...and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them,...

    This reminds me of the IBM TV commercial where Bob is at an AA like meeting..."No one here is stupid"... Then Bob tells them that he forgot to tell his staff to ramp up the website for more hits because of their new PR. Then they all turn
    on Bob... "That WAS stupid, Bob."
  • That was funny. You got me thinking about other Fox computer-related specials:

    America's Funniest Core Dumps
    When Spammers Attack
    I Married A SysAdmin
    Real Life Reboots
    Totally Shocking Backups -- Caught On Tape

    /* Alright -- quit yer groanin' */
  • Oracle can use either a file or a partition for its datastore. In recent versions of Solaris, Oracle will put the file in a mode which can bypass the VFS layer of the OS to get near raw partition speeds.

    The inside scoop was that, because eBay did not install the latest kernel patch to Solaris 2.5.1, they ran into a bug where if you have a kernel core dump of more than 2GB, it will piss all over your disks. I suspect they do not have root or swap under Vertias control.

    So when the machine panic'd, it overwrote most of root (the core dump starts from the end of swap back to the beginning, many users have root just before swap on their disk).

    So they not only had to restore Solaris, they had to restore their configuration. Not something that can occur quickly, esp. when the CEO of a company is breathing down your back. It is also my understanding that the eBay database itself was okay and didn't have any data corruption.
  • Well, not exactly. Let me see if I can remember.

    Oracle does a lot of management of its persitent objects (tuples, clusters, tables, indices etc.) in storage areas called tablespaces. These can be kept in conventional files or in "raw" disk partitions (raw in the sense that they do not have a file system that the OS can mount). Oracle manages these in a manner similar to an extent based file system such as ext2. Oracle not only manages mapping program requests to objects in the tablespace, it also has its own very sophisticated caching and journaling capabilities. When you keep your tablespaces in operating system files, you incur the overhead of the operating sytems filesystem for very little or no benefit. In fact, their are many tuning parameters for table storage (controlling things like the size of inital extents and how additional extents are added) that are undermined by putting the tablespace in a OS file.

    Therefore, if it is available on your platform,you'll want to let Oracle manage space on the raw device. The main reason not to is if you want to manage the data files using operating system facilities, for example moving them from one disk to another, or using operating system backup utilities if yours don't do a good job of backing up large binary streams. If you are seriously interested in high performance, you'll have to go this route.

    If you keep key data in raw partitions, make some good tuning decisions, and do a judicious job of clustering data you can get astonishing performance out of Oracle. The requires dba who can think numerically, understand the applications being run against the db and their users' expectations, understands the most important of the dozens (hundreds?) of tuning parameters and generally has intellectual qualifications beyond having a body temp of 98.6 F. Not only will you need this person, you'll need to give him time to experiment and ponder results.

    IMO, you can't blame Oracle for this debacle; it did the right thing by recognizing the damage, assuming that one of its components had lost its mind and shutting them all down. This is even more the case, because you can go back to prior backup and replay all your transactions logs back as far as you have them, effectively recovering up to the last moment of operation.
  • One would think that the internet is generating "unheard amounts" of loads on various systems for the first time.

    One of the most sensible comments I have seen for a long time. High performance, high transaction rate, high availability systems have been around for a long time. Anyone remember the original Tandem machines? IBM Series 1s? By the mid 80's in Australia, a number of of financial institutions were using IBM System/38's for fore-X and similar stuff - dual systems, mirroring data base transactions.

    Sure, all this stuff doesn't come cheap. I helped install a $1,000,000 fallback hot site for a major bus company here - but as their CEO said - if the system is off the air for more than 8 hours, all he could do would be to turn out the lights, and go home - his business would be dead.

    You have to design the system - the total, end to end system - to handle the expected workload, and to provide the reliability your customers expect (and pay for). That also means havign the people with the required expertise to implement and manage the system.

    Cut costs - and you get what you pay for.


  • 'Prior Proper Planning Prevents Piss-Poor Performance'.

    If we can teach this to grunts, -why- cannot those who are allegedly more intelligent fail -repeatedly- to learn it?

    Ah, well. I recall when Comdisco failed in the attempt they made to show Shwlob what was about to happen in a simple email system, too.


  • We plan, backup, build redundant systems, isolate production from testing and implementation, and still every now and then something happens that makes you realize how young all this technology really is, and that bottlenecks still exist.

    I am just coming off a twenty hour day repairing problems in a production system. Both members of a cluster affected (by the clustering software itself of course). In the end we end up hacking out the best fix available on the fly.

    Dependability is expensive, and that expense is often hard to justify to economy minded business people. Add to that the fact that even the most secure, stable, and isolated system will eventually break and it is a recipe for some very long days for those of us who answer the pages when it all falls down.

    Good thing I enjoy this kind of work. Now its off to a nap then back to the office to listen to a vendor tell me his next release will address the trouble and explain to a few business folks that simply stating a system will be up 24/7 doesn't make it so.

  • I think that the major reason that the internet enabled systems have problems that the legacy systems don't is the point that bluGill ( made above about not overselling your capacity.

    If you have a mainframe with 3270 terminals, you know exactly how many users you will handle. Adding more users involves a definate decision, and you can decide what upgrades you need in order to handle those extra users.

    With an internet application, there isn't such a direct link between the number of users you can support. HTTP is a very variable protocol, perhaps one page requires 3 HTTP accesses, all of which can be served in under 1/100 second, another page requires 50 accesses, which can take up to 3 seconds. Together with the growth of any Internet application, this makes capacity planning very difficult.

    I'm sure that capacity planning will become better in the future, but for now, some growing pains are inevitable.

  • Both.
    on Unix machines you have the _option_ to have Oracle use it's own filesystem.
  • I would put this more forcefully. If you are given specific performance targets by marketing, you meet them, and then marketing turns around and says that it's your problem that they dropped the ball, then the company has exactly one of two options:
    • Punish marketing, or not, for making poor predictions, or
    • Punish IT... and immediately tell HR to work on replacing the entire IT staff since no competent IT person will work under these conditions (at least, not in the current job market).

    Some people might suggest that IT should routinely overbuild, but that has its own problems. If I'm the CEO of a company that decided to purchase X capacity for $Y, then I learn that IT bought 2X capacity for $Y, my first and only question will be why the outgoing head of IT didn't buy X capacity and return $Y/2 for reallocation. Maybe I would have used it to buy the excess capacity... or maybe I would have used it to shore up a specific project enough to land that lucrative contract that just barely got away.
  • On the note of Financial based website servers... is actually a cluster of 6 g3/400s running MacOS X Server 1.0 and WEbobjects. Only had one problem, and that had to do with the credit card system, the servers kept running 24/7.

    Oh for those of you looking for un-breakable solutions, NeXT made something called

    It keeps track of the server load on each machine (it runs on every machine) and sites incoming traffic to least used machine.

    it works like this

    Machine 1 has load of 50%, machine 2 75%, machine 3 90%. What does is tell Machine 1 to reply to a request first, until the laods balance off, and until then, all the other machines just work on their current traffic/transactions.

    Really cool -eh?

  • One other Thing i Stumbled upon

    I Believe the apple store servers (the 6 g3s) dont hold the database on them at all. It is probably held on two - three other redudant systems (webobjects supports remote database and database mirroring on two different machines). Apple probably has it set up so all but 2 computers (one webserver and one database server) can crash and burn, but the webstore keeps running. Also allows for individual machines to be swapped in and out for repairs.

    From what i have seen Having redundant processors/powersupplies/hds still is a problem if you have to do a software patch, but having redundant machines, gives more bang for your buck, expecially since you can run any one of them independantly.

  • Laird! A long time ago when I was so innocent, I had a site in Geocitie$ and the server crashes were horrible. I'm surprised they didn't lose more people. The server ate my files, re-wrote permissions, and nevermind that it was slow as still IS slow...I think everything they do on the site now is to get people to buy faster computers by getting them to think that theirs is slow...

    OTOH, I'm not fazed by ebay's problem. I'll never hand over my CC to Amazon, and half the stuff I look for is uniquely weird - the only kind of things you can get on ebay. The crabbers are probably newbies - geez, live with it, it happens, you know? If they ever used Lynx they'd realize how far browsing on the net has come!

  • I know what you mean! I loathe most net advertising, pop ups included, and everyone who insists on being an "associate" (read: cheap labour for big companies).

    I remember seeing ads for Lotus that went like "the net is screaming for capitalists" or something...I suppose what hurt is that one ad had the line "short stories that nobody reads". That seemed awful crass. The net thrives on its humanity. Take out the humanity and what have you got? The soulless machine that writers have been warning us about for ages (just reread Farenheit 451 - so hard to believe it was written in the 50s!)

    It hurts me that everything is done for eyeballs and money. Newbies will never realize how great it was to surf. They aren't wary of schemes that focus on them as a pyschographic and target markets. They think, "Oh how nice they want to give me free webspace". They don't think, "Gee, I don't like having my page cluttered with ads that I don't agree with."

    Ebay is pretty cool. I snagged a lot of good books dirt cheap there. But it is getting harder. Before I could make a bid a few days before and still win. Now I have to wait until the last few seconds to swoop down on everyone else :-)

  • Even if eBay had know that patches were available, oftentimes, it doesn't pay to install them without some rigorous testing. I work in a fairly large database shop, and we are very reluctant to install patches unless there is a specific reason. On multiple occassions we've installed patches just to find out some undocumented quirk with it that only shows up in bizarre circumstances (be it a kernel patch that isn't support in an E10K domain or a database patch that breaks an application). To solve these types of issues, you can't rely on JUST a failover system. You need very extensive testing and development systems. Even then, you may not find a bug until after you've gone production. This isn't to say that eBay wasn't preventable. I'm simply stating that because a patch exists doesn't mean that you should always install it.
  • You want to keep in mind that Dell is using UNIX servers on the backend. It's my understanding that they have several Sun E10K's and IBM S70s. (All in Dell black and grey, of course)
  • You forgot Honey, I Shrunk The Budget.

    Fourth law of programming: Anything that can go wrong wi

  • Wrote anonymous on Monday:
    > Seriously though 24/7 can be done with
    > present day technology. The phone system
    > comes to mind.

    On Thursday/Friday two Swedish magazines carried
    a story about upstart "Bluetail", a spinoff from
    Ericsson. These people have a telecom background
    and their "Mail Robustifier" product is just out
    in release 1.0. Written in the Erlang programming
    language used by Ericsson in telephone exchanges,
    it does load sharing between "mail servers" (I'm
    not sure whether this means SMTP or POP3/IMAP)
    and promises 99.999 % uptime. The targeted
    market is large or medium scale ISPs.

    With more problems like eBay's we should see more
    telecom people moving over to doing web-related
    products. Either telecom companies will change
    their business or there will be spinoffs like

  • Having worked in a number of environments (small start ups, educational and corporate) there are some I have noticed that people do not seem to realize:

    1) the technology is a lot more fragile than the marketroids will have you believe and the engineers want to believe.

    2) Complexity seems to increase as 2 to the nth power where n is the number of components. E.g.,
    2 servers have 4 ways they can interact, while 3 would have 8 ways to interact (not counting the fact that there may be many software interdependecies on and between machines).

    3) Planning is key, and the central tenent should always be KISS (keep it simple and stupid -or- keep it simple, stupid!).

    4) The larger the environment, the more it has a life of its own.

    5) The larger the environment the more crucial communication becomes.

    6) #5 above can lead to information overload.

    7) There is no substitute for an intelligent, well trained well led staff. All the certifcication programs and fancy admin. tools cannot substitute for that.....

    My $.02

  • Which is larger:

    1. The number of people buying new $1000+ Dell computers.
    2. The number of people wanting to bid on collectibles at eBay.

    They're hardly in the same class.
  • What about the people who only look, I bet Dell gets a lot of window shoppers. More than e-Bay, maybe.
  • Ok, so eBay goes down and everyone gets all up in arms about it because this stuff is not 100% reliable.

    Hey, we knew that. Even the best systems out there are expected to be down a few minutes a year, and most of 'em (including those "super-reliable" Suns) are on the order of a couple of *days* a year. Throw a relational database into the equation and, well, reliability ain't so hot.

    There are ways to deal with that, and eBay didn't do ANY of them.

    At a minimum they should have had a hot backup available, PARTICULARLY for the single point of failure -- the database. With a hot backup they could have been back online in a matter of a couple of minutes. It was insane to bet their business on a single Sun/Oracle box! Whoever made that decision should be out on the street.

    But they can do a lot better than that with a little middleware infrastructure. There's no reason they can't replicate transactions to multiple databases -- or even split their databases up so they have lots of little ones handling part of the load rather than One Big Server.

    Of course that will take some technology that is a bit beyond the duct-tape-and-bailing-wire stuff they're using. It's not rocket science but it's gonna be a bitch to do with CGI.

    What it all comes down to is that they bet on an infastructure design that had a single point of failure and were screwed when it failed. That could have been -- SHOULD have been -- foreseen and protected against.

    I could maybe see that being OK in a startup that didn't have the cash for duplicate hardware outlay, but eBay has the cash in spades and they STILL didn't do it. There's a certain level of stupidity at work here.
  • I don't see why people are so surprised with this. The Internet still is not regarded as "acceptable" for real-time work, so why are people so affected by system faults?

    "Planning planning planning."
    Planning is the key. On personal systems, there are UPS devices, floppy drives, RAID configurations (well maybe not so often on a PC), Zip drives, CDRs... all sorts of mediums to circumvent the loss of data. Just because a system is online or owned by a large group is no reason to assume it is secure.
  • by fete ( 61267 )
    Unfortunately eBay doesn't seem to be preventable.
    I would settle for "For Auction" posts being banned from Usenet "For Sale" newsgroups, though.
    I am NOT interested in participating in an auction when I browse my favorite 'for sale' groups for deals on tech hardware (i.e.

Q: How many IBM CPU's does it take to execute a job? A: Four; three to hold it down, and one to rip its head off.