Why eCommerce Sites collapse 95
Rahul Mehra writes "ZDNet has an interesting article about how eBay and other e-commerce sites collapse under heavy loads. It talks about how massive growth, incomplete planning, rising expectations (24x7 uptimes) and immature technology all contribute. " This train of thought, for me at least, leads to neo-Luddite question - what do you folks think?
Interesting how NT seems to be less of an issue .. (Score:1)
The Scwab issue is clearer than they are making it out to be -- I have some knowledge of this back two years and Schwab has some complete idiots running things, still, even after a series of disasters (some of which didn't make the news). I really can't explain it in any way other than people who have gotten MBAs seem to only trust other people with MBAs, no matter how poorly they perform. I set up my account there as soon as I could, but after hearing really unpleasant stories for a few years, I finally went to Fidelity.
s/390s aren't unreliable, and Parallel Sysplex stuff works well dynamically, but if a)you are basing your maintenance window and procedures on a saleman's promises to an MBA and b)you aren't keeping the better mainframers because of pay and poor treatment, you will have problems.
Similarly, Cisco routers aren't unreliable. Encryption makes a 30% performance hit. If you are about to be swamped by transactions (if the market is tanking, for instance), then turning the encryption off is a command decision that you "get paid the big bucks for." Not doing so and having systems choke is not a problem with Cisco, anymore than undersizing the systems is a technical problem.
I am relatively confident in Schwab -- I would be confident enough to keep my money there if they would take my money as seriously as I do and spend less on MBAs in technical positions and more on technical people in technical positions.
And no, to the best of my knowledge (I have several funds and they have positions in everyone out there), I own no Schwab or Fidelity stock.
Sysadmins are janitors... (Score:4)
People should *deal* (Score:1)
It's really sickening to hear that people can't get a grip on how far technology has come, and expect it to be way farther than it is.
Don't forget the sysadmins (Score:2)
Re:Interesting how NT seems to be less of an issue (Score:1)
>> decision that you "get paid the
>> big bucks for."
Gee, if I were a hacker, I'd *never ever* wait until a big event (eg. market goes to hell) to start dumping to disk if I had managed to hack into a decent-sized ISP (or worked there and was a pissy sort of person). The prestige for showing an online broker to be vulnerable has to be pretty significant, especially if you moonlight as a "security consultant" or whatever.
Maybe I'm just a wuss, but it seems like
s/get paid/get fined/g;
is a distinct possibility if the ruse is uncovered. (It's also a tacky thing to do)
I suppose that with those sorts of loads, you could make a case for it being statistically infeasible to pull any real information out without a huge amount of disk space to dump the packets onto and a lot of time to pore through them... but people don't change their passwords very often, and you could probably assemble useful information in a reasonable amount of time. And a decent lawyer should have no trouble spooking a jury into overreacting if a trial came to pass.
Either way I submit that the magnitude of the negative publicity that would ensue would make such a decision very hard to justify.
Why not colocate at, say, Above.Net, and rely on their monster pipes for the big loads? It's not like it would cost that much more, and you rely on an extremely high caliber of technical staff to keep things running.
>> Encryption makes a 30% performance hit
In my experience you're off by almost an order of magnitude, in terms of CPU load. If you're only talking about packet throughput, then yeah, the handshake, key exchange, and renegotiation every few minutes adds about 30%. It seems like CPU power is usually the bottleneck in doing SSL transactions on big fat pipes, though.
>> people who have gotten MBAs seem to only trust
>> other people with MBAs
I had the misfortune of working at one of the top business schools in the country for about a year, and this is what I perceived: MBAs without a physical science or engineering background are categorically inept at technical decisions, no matter how much they think they have learned by reading InfoWorld. Negotiation is an MBA's strong point; following through is Someone Else's Job, as best as I could make out. So why don't they recognize that they are likely to make more money (enough to offset the cost) if they hire the best (and most expensive) technical staff? Beats me...
'Cause otherwise you're presenting an opening for someone else to gain publicity as Those Guys That Suck Less (tm) and steal your mindshare and profits. That can't possibly be lost on MBAs. (can it?)
24x7 is easy, just do it right the first time (Score:2)
Doing it right the first time of course isn't that easy, but once it works, don't break it. It must be possibal to run a 24x7 site for an entire year, while stuck on gilligan's island without any way to contact the rest of the world including the site.
Nasa has comptuers in buildings where at any moment deadly (a few seconds from a small leak and everyone in building is dead!) chemicals are around. Do you think that their IS wants to touch the comptuers? Not unless they first send everyone else home and empty those tanks. If it wasn't so heavy they would probably insist on space suits too.
You decide a year or more in advance how much bandwidth you will get. Then decide how many customers that will support, and you don't allow marketing to sell to any more customers. Thats right, you refuse to allow more onto the system. Marketing can deal with this if you make them, and long term satisfaction will go up.
Once you know how much bandwidth you will have, you make sure you have comptuers that can deal with it. Mainframes have been doing 24x7 for years. Unix is very close to matching that (with Sun's redundant hot swapable system perhaps better, not that sun is the only chioce) I have seen tripple redundant systems with a polling mechanism where if one comptuer gives a different result it is shut off. Guess what: none of this is cheep. Thats right, doing buisness on the internet in volumn isn't cheap. Spend the money on system that will stay up, and enough power that you don't run out, and you will run 24x7. There are plenty of companies that make equipemtn that is ment for this use.
Last, and foremost: hire system administrators that have proven they can keep the systems running 24x7, and pay them to do so. These people are older, in their 50s or so. Hire thebest of the expirenced, and then give them a deal: you pay them to keep the systems up there or not. They should soon find a paycheck arriving every two weeks, with only a few hours a month work.
Remember, design the system so you can run it from Giligans island (no access by you) without your boss realising, and you will do fine.
Of course reality is that you do have to replace crashed harddrives, but with RAID-6 (raid-5 plus more redundancy, raid-6 isn't officialy defined) that is any time. You do need to buymore backup tapes once in a while, but automatied backups are the norm in 24x7 enviroments.
Thinking like Start-ups (Score:1)
I tend to agree with the theory that many of these companies still think like start-ups; they act like they don't have any money to spend! Perhaps they're just not aware of where their money is best spent. I can't say I know the start-up web content business mentality to its very ends, but when money is tight you start betting against catastrophe, and hope your odds are good. Duplicate server hardware is expensive for a small shop, but when you have billions of dollars in revenue, and your _entire_ business relies on your information infrastructure, the least you should do is build a duplicate server farm right down to the cables on the power supplies.
Yeah, you'll blow a million dollars on it, and you might not need it, but the maintenance costs are lower than the cost of losing your auction site, on-line trading service, bank, or retail market for five days.
You co-locate services at multiple network access points. You use reliable software--the kind you have source code to, so you're not on the phone at midnight with a "knowledge engineer" across the country who is trained in taking bug reports. You need to fix the problem so you hire people who can.
You spread the load at all points (you have multiple web servers, multiple database servers, multiple administration access points, redundant networking hardware), and you always have ample staff around for that 4:00 AM breakage.
Is that anything like... (Score:1)
Using age as a disqualifier?
Someone who's 50 has a chance to have 30+ years experience in the field. Let's see you, hot shot 25 year old, have 30 years experience.
opps (Score:1)
Other bits to read (Score:2)
The first link basically says that the eBay guys weren't paranoid enough about making sure the setup was reliable. This is always a problem. (hey, I'm working on a commercial web site that only got a proper sys-admin 2 years after it started...). Little side-note - one guy says Sun's clustering stuff is not that great... I know Sun have been a bit late in starting doing clustering stuff, but I've also heard that what they have done is pretty good, *shrug*. Actually, they just annouced version 3 last week, which also allows clustering of 16 Starfires, for 1024 processors. (they're also making the source code for this available...)
Re:opps (Score:1)
Posted by the Proteus
Microsoft FUD Refutation (Score:1)
Now we learn that the problem was caused by Ebay, and Ebay alone, by not keeping up on their vendor patches, and that Sun had fixed this particular bug quite some time earlier.
It would seem that MS needs to print a retraction. Any bets on when we'll see it?
Re:People should *deal* (Score:1)
No... for example, some time ago I used to go Fatbrain when I wanted to buy a book. But I frequently experienced that Fatbrain was down. On the other hand, I cannot remember having problems with amazon.com, ever.
So when I want to buy a book, and my cursor is in the URL field of my Netscape, which URL will I type? Fatbrain, even though I know that they might be down again, and I will have to wait 30 seconds patiently until I get the timeout? Or will I got to amazon.com and but the book immediately?
Well, I use the shop that uses Unix, and not the NT shop...
Re:No problem with that :-) (Score:1)
Re:How would you like to crash today? (Score:1)
Failure is good! (Score:1)
Providing quality service probably means that you're doing something wrong, just like making a profit does.
Second thoughts... (Score:1)
Re:From an ISP perspective (Score:1)
We let pilots fly the airplanes; we let chefs cook the dinner; but we cannot let technical experts exert technical expertise. Sometimes it's scary.
Re:Didn't Oracle use its own FS? (Score:1)
Apparently Solaris screwed up this arrangement and wrote some blocks in Oracle's space. It's odd that Oracle was then able to crash the OS - the only reason I can think of is that Solaris put something really critical in those blocks, and Oracle overwrote them for some reason while it was aborting.
Re:Scary design philosophy (Score:1)
The logical volume layer is a great thing to work with in normal situations. Mirroring, striping, RAID, backups, and failover all work at this level. To give an example, if you want to do a hot backup of a mirrored filesystem, you can split off one mirror, mount the copy and fsck it, dd it to tape, and then merge the storage back into the mirror, without disturbing the primary FS. That works for oracle instances as well (just substitute some oracle commands for fsck above).
Re:24x7 is easy, just do it right the first time (Score:1)
You'll be lucky if you're not fired outright. And if they listen to you, they are even crazier than you
Maybe that whould be the punishment for marketing incorrectly projecting demand. The only problem is that marketing tends to hae the CEOs ear more than IT. Therefore, IT gets blamed for only meeting marketings projections. But, that is showing my own prejudices.
I think one of the real problems is a lot of companies are not doing the research that will give them good capacity projections. I believe it was Schwab in the article that said they went from quarterly analysis of capacity vs demand to every couple weeks, plus they have plenty of excess capacity, which is required sice demand can spike in days if not hours, while adding capacity probably has lead times measured in weeks.
Dastardly
Didn't Oracle use its own FS? (Score:1)
I am hardly an Oracle (or Sun, for that matter) expert, but I thought Oracle used its own filesystem?
Also, note that Microsoft's view on the matter [microsoft.com] is nowhere near the actual cause of the problem. It's as if Microsoft was keeping tabs on this Oracle/Sun combo and decided to come forward with their "competative analysis" when the time was right. Looks like they had some "Haloween" documents on Oracle/Sun too... ;-)
Re:alot of us inherit the horrible work of others (Score:1)
A sysadmin should never look idle. There are always things to do, things to improve.
The reason I stayed was that I have so much control over things, and people will listen to me.
And as clueless as my users are, I still like most of them.
Unfortunately the slashdot/linux today conspiracy - lately - is really hurting my productivity. Urgh
Re:Why did eBay collapse? Seriously bad design (Score:1)
I read a description of what eBay had a few months ago and was shocked at the predictable crash they were heading toward.
The thing is you can't easily patch a monolithic system to run on loose clusters with replication and redundancy. It will appear much more attractive to continue down the monolithic road and add hot-spares.
Few people seem to get what it takes to build truly scalable and reliable systems.
sdw
Re: Dell (Score:1)
I've done some ASP work and it's pretty memory intensive. Kudos to Dell for making it work -- it's slow as molasses sometimes, but it's never been down in my experience.
How would you like to crash today? (Score:1)
Re:Didn't Oracle use its own FS? (Score:1)
Under normal operation, though, Oracle allocates a specific amount of disk space for its operations, and manages the space itself. So you could say that it uses its own file system layered on top of the host OS (Solaris in this case). This is distinct from, say, mySQL, which uses files in the regular Unix file system as tables.
I'm not an Oracle expert either, so you can take this with a grain of salt, but I believe this is how it works.
D
----
Re:Thinking like Start-ups (Score:1)
Probably the best approach to mirroring is loose consistency. Have a daemon running in the background that will pop up once in a while and check to see if usage is below a certain point. If it is then start updating the secondary system. This method is better than strict consistency which requires that all updates happen to both systems at once before the transaction can continue. In the event of a failure this approach gives you a bit more data reliability than loose consistency but greatly reduces availability because both systems must be working in order to get any real work done.
Re: Dell (Score:1)
neutrino
Re:Sysadmins are janitors... (Score:1)
Re:People should *deal* (Score:1)
Also, you're thinking that the only people on eBay are like some people casually wandering into an antiques store with grandpa's wardrobe cabinet. Nope, often the people on eBay are the antiques dealers, and you'd be surprised how many people already make their living off of selling stuff on eBay.
It's like my current ISP's authentication problems. Yeah, I can log on 70% of the time on the first try; but the cumulative effect of a 30% first-try failure rate is that they get fired. Net inconvenience to me, measured in perhaps a couple of hours after several weeks of this -- but not tolerable.
Re:neo-Luddite? (Score:1)
> woah, hemos you have to explain your thinking
> on this one. i cant even come close to finding
> anything with a neo-luddite feel to it in this
> article.
I think what he's talking about is the angle the
author is taking: "Could this be the death of
e-commerce?"
Answer: No, it won't. Next?
Re:24x7 is easy, just do it right the first time (Score:1)
You're joking, right? Have you actually gone to the CEO of any company and told him/her that it's time to shut down marketing? "They can just take the next 2 months off while we re-engineer the back end systems."
And you seem to forget one thing: it is not a business decision what the capacity of the equipment is. If the equipment can support X users at once, the business types have precisely three choices:
And option 2 takes time, time to get the equipment in, time to configure and test it, and time to roll it into production cleanly. If the CEO doesn't like it, I'm sorry but that doesn't change reality.
Re:Transaction capabilities are not new! (Score:1)
Code has become bloated... I remember when I was in development, we had to fit our software on a low density floppy or two, since most of our users would not have HD floppies (Europe was a major factor in this decision) and more than two floppies would raise the Cost of Goods.
Appears to me that a lot of programmers, webmasters and networking people have forgotten how to optimise their crap.
I remember a LARGE bank in Malaysia running their servers on DOS(!) doing transactions at the rate of a couple of thousand a day. Where have we lost our ability to optimise code, data and out thoughts?
Transaction capabilities are not new! (Score:2)
I liked those Lotus Ads... (Score:1)
Re:People should *deal* (Score:1)
Actually one of the central focuses (foci?
Re:24x7 is easy, just do it right the first time (Score:1)
You're joking, right? Have you actually gone to the CEO of any company and told him/her that it's time to shut down marketing? "They can just take the next 2 months off while we re-engineer the back end systems."
You'll be lucky if you're not fired outright. And if they listen to you, they are even crazier than you :-)
Re:Dell (Score:1)
Re:How would you like to crash today? (Score:1)
Re:Sysadmins are janitors... (Score:1)
A couple of points... (Score:2)
As it stands now, eBay's auctions are so time-critical that they're in the same league as online brokerages. And speaking of brokerages...
Fidelity is running TV ads (plastered all over Pirates of Silicon Valley last night) touting the speed of their systems and how seconds count, with a quick disclaimer at the end of the ad that response time depends on network conditions. This is a pet peeve of mine: ads with disclaimers which make the rest of the ad meaningless. Example: "99c Big Macs! That's right, 99 cents! Only 99 cents! Prices may vary." But the point is that they're promoting the idea that the internet is suitable for real-time transactions, even though they recognize that it isn't quite there.
investing (Score:1)
"That WAS stupid, Bob" (Score:1)
This reminds me of the IBM TV commercial where Bob is at an AA like meeting..."No one here is stupid"... Then Bob tells them that he forgot to tell his staff to ramp up the website for more hits because of their new PR. Then they all turn
on Bob... "That WAS stupid, Bob."
Other Program Ideas :) (Score:2)
America's Funniest Core Dumps
When Spammers Attack
I Married A SysAdmin
Real Life Reboots
Totally Shocking Backups -- Caught On Tape
/* Alright -- quit yer groanin' */
Re:Didn't Oracle use its own FS? (Score:2)
The inside scoop was that, because eBay did not install the latest kernel patch to Solaris 2.5.1, they ran into a bug where if you have a kernel core dump of more than 2GB, it will piss all over your disks. I suspect they do not have root or swap under Vertias control.
So when the machine panic'd, it overwrote most of root (the core dump starts from the end of swap back to the beginning, many users have root just before swap on their disk).
So they not only had to restore Solaris, they had to restore their configuration. Not something that can occur quickly, esp. when the CEO of a company is breathing down your back. It is also my understanding that the eBay database itself was okay and didn't have any data corruption.
Re:Didn't Oracle use its own FS? (Score:1)
Oracle does a lot of management of its persitent objects (tuples, clusters, tables, indices etc.) in storage areas called tablespaces. These can be kept in conventional files or in "raw" disk partitions (raw in the sense that they do not have a file system that the OS can mount). Oracle manages these in a manner similar to an extent based file system such as ext2. Oracle not only manages mapping program requests to objects in the tablespace, it also has its own very sophisticated caching and journaling capabilities. When you keep your tablespaces in operating system files, you incur the overhead of the operating sytems filesystem for very little or no benefit. In fact, their are many tuning parameters for table storage (controlling things like the size of inital extents and how additional extents are added) that are undermined by putting the tablespace in a OS file.
Therefore, if it is available on your platform,you'll want to let Oracle manage space on the raw device. The main reason not to is if you want to manage the data files using operating system facilities, for example moving them from one disk to another, or using operating system backup utilities if yours don't do a good job of backing up large binary streams. If you are seriously interested in high performance, you'll have to go this route.
If you keep key data in raw partitions, make some good tuning decisions, and do a judicious job of clustering data you can get astonishing performance out of Oracle. The requires dba who can think numerically, understand the applications being run against the db and their users' expectations, understands the most important of the dozens (hundreds?) of tuning parameters and generally has intellectual qualifications beyond having a body temp of 98.6 F. Not only will you need this person, you'll need to give him time to experiment and ponder results.
IMO, you can't blame Oracle for this debacle; it did the right thing by recognizing the damage, assuming that one of its components had lost its mind and shutting them all down. This is even more the case, because you can go back to prior backup and replay all your transactions logs back as far as you have them, effectively recovering up to the last moment of operation.
Re:Transaction capabilities are not new! (Score:1)
One of the most sensible comments I have seen for a long time. High performance, high transaction rate, high availability systems have been around for a long time. Anyone remember the original Tandem machines? IBM Series 1s? By the mid 80's in Australia, a number of of financial institutions were using IBM System/38's for fore-X and similar stuff - dual systems, mirroring data base transactions.
Sure, all this stuff doesn't come cheap. I helped install a $1,000,000 fallback hot site for a major bus company here - but as their CEO said - if the system is off the air for more than 8 hours, all he could do would be to turn out the lights, and go home - his business would be dead.
You have to design the system - the total, end to end system - to handle the expected workload, and to provide the reliability your customers expect (and pay for). That also means havign the people with the required expertise to implement and manage the system.
Cut costs - and you get what you pay for.
Ken
Proving the '7P'-principle -again-.... (Score:1)
'Prior Proper Planning Prevents Piss-Poor Performance'.
If we can teach this to grunts, -why- cannot those who are allegedly more intelligent fail -repeatedly- to learn it?
Ah, well. I recall when Comdisco failed in the attempt they made to show Shwlob what was about to happen in a simple email system, too.
Cheers,
Drieux
Isn't it Fun (Score:1)
I am just coming off a twenty hour day repairing problems in a production system. Both members of a cluster affected (by the clustering software itself of course). In the end we end up hacking out the best fix available on the fly.
Dependability is expensive, and that expense is often hard to justify to economy minded business people. Add to that the fact that even the most secure, stable, and isolated system will eventually break and it is a recipe for some very long days for those of us who answer the pages when it all falls down.
Good thing I enjoy this kind of work. Now its off to a nap then back to the office to listen to a vendor tell me his next release will address the trouble and explain to a few business folks that simply stating a system will be up 24/7 doesn't make it so.
Re:Transaction capabilities are not new! (Score:1)
If you have a mainframe with 3270 terminals, you know exactly how many users you will handle. Adding more users involves a definate decision, and you can decide what upgrades you need in order to handle those extra users.
With an internet application, there isn't such a direct link between the number of users you can support. HTTP is a very variable protocol, perhaps one page requires 3 HTTP accesses, all of which can be served in under 1/100 second, another page requires 50 accesses, which can take up to 3 seconds. Together with the growth of any Internet application, this makes capacity planning very difficult.
I'm sure that capacity planning will become better in the future, but for now, some growing pains are inevitable.
Re:Didn't Oracle use its own FS? (Score:1)
on Unix machines you have the _option_ to have Oracle use it's own filesystem.
Re:24x7 is easy, just do it right the first time (Score:1)
Some people might suggest that IT should routinely overbuild, but that has its own problems. If I'm the CEO of a company that decided to purchase X capacity for $Y, then I learn that IT bought 2X capacity for $Y, my first and only question will be why the outgoing head of IT didn't buy X capacity and return $Y/2 for reallocation. Maybe I would have used it to buy the excess capacity... or maybe I would have used it to shore up a specific project enough to land that lucrative contract that just barely got away.
Re:How would you like to crash today? (Score:1)
store.apple.com is actually a cluster of 6 g3/400s running MacOS X Server 1.0 and WEbobjects. Only had one problem, and that had to do with the credit card system, the servers kept running 24/7.
Oh for those of you looking for un-breakable solutions, NeXT made something called Moniter.app
It keeps track of the server load on each machine (it runs on every machine) and sites incoming traffic to least used machine.
it works like this
Machine 1 has load of 50%, machine 2 75%, machine 3 90%. What moniter.app does is tell Machine 1 to reply to a request first, until the laods balance off, and until then, all the other machines just work on their current traffic/transactions.
Really cool -eh?
-Pfhor
Re:How would you like to crash today? (Score:1)
I Believe the apple store servers (the 6 g3s) dont hold the database on them at all. It is probably held on two - three other redudant systems (webobjects supports remote database and database mirroring on two different machines). Apple probably has it set up so all but 2 computers (one webserver and one database server) can crash and burn, but the webstore keeps running. Also allows for individual machines to be swapped in and out for repairs.
From what i have seen Having redundant processors/powersupplies/hds still is a problem if you have to do a software patch, but having redundant machines, gives more bang for your buck, expecially since you can run any one of them independantly.
-Pfhor
How true (Score:1)
OTOH, I'm not fazed by ebay's problem. I'll never hand over my CC to Amazon, and half the stuff I look for is uniquely weird - the only kind of things you can get on ebay. The crabbers are probably newbies - geez, live with it, it happens, you know? If they ever used Lynx they'd realize how far browsing on the net has come!
No problem with that :-) (Score:1)
I remember seeing ads for Lotus that went like "the net is screaming for capitalists" or something...I suppose what hurt is that one ad had the line "short stories that nobody reads". That seemed awful crass. The net thrives on its humanity. Take out the humanity and what have you got? The soulless machine that writers have been warning us about for ages (just reread Farenheit 451 - so hard to believe it was written in the 50s!)
It hurts me that everything is done for eyeballs and money. Newbies will never realize how great it was to surf. They aren't wary of schemes that focus on them as a pyschographic and target markets. They think, "Oh how nice they want to give me free webspace". They don't think, "Gee, I don't like having my page cluttered with ads that I don't agree with."
Ebay is pretty cool. I snagged a lot of good books dirt cheap there. But it is getting harder. Before I could make a bid a few days before and still win. Now I have to wait until the last few seconds to swoop down on everyone else
Re:Other bits to read (Score:1)
Re:Dell (Score:2)
Re:Other Program Ideas :) (Score:1)
You forgot Honey, I Shrunk The Budget.
--
Fourth law of programming: Anything that can go wrong wi
Re:Fox Special-"When Sites Collapse". (Score:1)
> Seriously though 24/7 can be done with
> present day technology. The phone system
> comes to mind.
On Thursday/Friday two Swedish magazines carried
a story about upstart "Bluetail", a spinoff from
Ericsson. These people have a telecom background
and their "Mail Robustifier" product is just out
in release 1.0. Written in the Erlang programming
language used by Ericsson in telephone exchanges,
it does load sharing between "mail servers" (I'm
not sure whether this means SMTP or POP3/IMAP)
and promises 99.999 % uptime. The targeted
market is large or medium scale ISPs.
With more problems like eBay's we should see more
telecom people moving over to doing web-related
products. Either telecom companies will change
their business or there will be spinoffs like
Bluetail, http://www.bluetail.com/
Complexity (Score:1)
1) the technology is a lot more fragile than the marketroids will have you believe and the engineers want to believe.
2) Complexity seems to increase as 2 to the nth power where n is the number of components. E.g.,
2 servers have 4 ways they can interact, while 3 would have 8 ways to interact (not counting the fact that there may be many software interdependecies on and between machines).
3) Planning is key, and the central tenent should always be KISS (keep it simple and stupid -or- keep it simple, stupid!).
4) The larger the environment, the more it has a life of its own.
5) The larger the environment the more crucial communication becomes.
6) #5 above can lead to information overload.
7) There is no substitute for an intelligent, well trained well led staff. All the certifcication programs and fancy admin. tools cannot substitute for that.....
My $.02
Re:Dell (Score:1)
1. The number of people buying new $1000+ Dell computers.
2. The number of people wanting to bid on collectibles at eBay.
They're hardly in the same class.
Re:Dell (Score:1)
Why did eBay collapse? Seriously bad design (Score:1)
Hey, we knew that. Even the best systems out there are expected to be down a few minutes a year, and most of 'em (including those "super-reliable" Suns) are on the order of a couple of *days* a year. Throw a relational database into the equation and, well, reliability ain't so hot.
There are ways to deal with that, and eBay didn't do ANY of them.
At a minimum they should have had a hot backup available, PARTICULARLY for the single point of failure -- the database. With a hot backup they could have been back online in a matter of a couple of minutes. It was insane to bet their business on a single Sun/Oracle box! Whoever made that decision should be out on the street.
But they can do a lot better than that with a little middleware infrastructure. There's no reason they can't replicate transactions to multiple databases -- or even split their databases up so they have lots of little ones handling part of the load rather than One Big Server.
Of course that will take some technology that is a bit beyond the duct-tape-and-bailing-wire stuff they're using. It's not rocket science but it's gonna be a bitch to do with CGI.
What it all comes down to is that they bet on an infastructure design that had a single point of failure and were screwed when it failed. That could have been -- SHOULD have been -- foreseen and protected against.
I could maybe see that being OK in a startup that didn't have the cash for duplicate hardware outlay, but eBay has the cash in spades and they STILL didn't do it. There's a certain level of stupidity at work here.
I agree (Score:1)
"Planning planning planning."
Planning is the key. On personal systems, there are UPS devices, floppy drives, RAID configurations (well maybe not so often on a PC), Zip drives, CDRs... all sorts of mediums to circumvent the loss of data. Just because a system is online or owned by a large group is no reason to assume it is secure.
-Clump
Re:opps (Score:1)
I would settle for "For Auction" posts being banned from Usenet "For Sale" newsgroups, though.
I am NOT interested in participating in an auction when I browse my favorite 'for sale' groups for deals on tech hardware (i.e. sci.electronics.equipment).