The root of all eBay's troubles 300
UncleRoger writes "A friend pointed me to this article would would appear to explain why eBay has had such troubles with downtime, including the outage since Wednesday evening. " It would appear that MS is tired of having the finger pointed at them - as they point out, it's an Oracle database that's running on Solaris that's causing the troubles.
Re:Classic obfuscation (Score:1)
You may suspect that it is false, but it seems to be ebay's story and they are sticking to it.
From the open letter from the founders now on Ebay's home page
To help ensure this, we are working diligently on a hot backup system that should automatically limit the length of potential outages to less than an hour or so. We have been working on this system for many months, and it is almost ready. Sadly, if we had this system in place a few days ago, we might have avoided this outage.
Re:Buy NT because Sun's unreliable? (Score:1)
I wonder if this is why MS uses LINUX!! (Score:1)
(Server: Apache/1.3.6 (Unix) mod_ssl/2.2.8 SSLeay/0.9.0b)
I guess even when M$ considers "The Importance of Reliability in an e-Commerce World" they choose _NOT_ to use NT, like everyone else.
How do they expect other people to use their products when they don't?
I don't get it.
Re:design flaws / operator error (Score:1)
---------
Opps,. I mean UNIX,.. heh heh (Score:1)
Re:Some fun statistics from ms... (Score:2)
My solution (Score:1)
If your website is not high capacity, you should be okay. By high capacity I mean over 10 million hits a day. Hence Slashdot is not high capacity. If Slashdot got 10 million hits a day you would watch Rob's little world melt down into a puddle of pee - MySQL and mod_perl would meltdown big time.
If you are exceeding 10 million hits a day, avoid hitting a database. You're going to slam the machine in no time. Go for static pages or Apache SSI unless you truly need complex pages built out of a DB. Even then you can usually fudge it with static pages.
Your DBA who wants everything in Oracle should ask himself how much he wants Larry Ellison to control his destiny. Most old timers I know would like to minimize their debt to Oracle.
Re:HINT: DO NOT CONNECT ORACLE TO LIVE WEB PAGES (Score:1)
Sorry, those sites are small potatoes. You could run them with an abacus and smoke signals.
Time out for fresh air (Score:1)
So lay the fuck off Mike and his engineering operation until you understand the actual details of what was happening there. People seem to believe that "state of the art" comes about through random acts of kindness or something. No. It comes from learning from mistakes and accidents, and the bigger the flaw, the bigger the step forward that those on the leading edge make to put those things behind.
I agree that Microsoft's leap to take advantage of this in marketing and PR was uncalled for. On the other hand, McNealy's crew is famous for that kind of stuff too.
But the spinmeisters at Sun have nothing on Redmond, and so I still have to say
Why does MS try to justify its problems? (Score:1)
Re:where's the hot backup server? (Score:1)
if (nt == unstable) { switchTo.linux() }
becomes
switchTo.linux();
Re:Uh oh... not Microsoft's fault? (Score:1)
Ultra 5 (Score:1)
Of course the thing can crash its a computer.. Live w/ it Microsoft..
--Mark
Re:the real ebay expense... (Score:1)
By my calculations (Score:1)
Re:Ultra 5 (Score:1)
For all the money they charge you, and the fact that they market it as a desktop you'd think they'd put in an internal speaker which wasn't crappier than that of a PC.
Is it really the hardware/software? (Score:2)
Re:where's the hot backup server? (Score:1)
becomes
switchTo.linux();
that's after a -O. after a -O2 you get FreeBSD using SMP
Hotmail Guts (Score:1)
But I understand that MS is moving WebTV into the same building, so they may move to Magnavox and and Curtis Mathis any day now...
Strange, new eBay = Unavalible eBay (Score:1)
I think they're just overloading their servers... Again.
eBay's problem, not Sun/Oracle (Score:2)
If eBay uses a single Enterprise 10000 server for the back-end database, they should have had a standby server that they could have switched to in seconds. eBay could also have distributed database operations further.
One thing is clear: NT has no advantage in this area. Sun gives you the option of lots of little servers or one big server with a backup, and depending on the application, one sometimes has to make the latter choice. With NT, however, you are forced to go the lots-of-little-server approach.
On a side note, on the day on which Microsoft's poor security architecture in MS Office has been responsible for shutting down lots of corporate sites (including their own) and caused thousands of users to lose their data, their whining seems very ironic. eBay's problems are eBay's fault; the virus problems are Microsoft's fault.
InternetWorld article E10K and Oracle. (Score:2)
I nternetWorld Article here. [internetworld.com]
There's an article from a couple of months ago over at InternetWorld that profiled the EBay server setup and its *two* Enterprise 10000's(Starfires).
Read it and you'll understand just how complex a setup EBay has. One of them performs the searching for the site.. "We had search vendors come in and tell us they had a great product, and we'd point a little of our load at it and it would melt into a puddle of metal on the floor."
Buy NT because Sun's unreliable? (Score:1)
ebay's troubles... (Score:1)
right after they changed the lay-out of their
pages. Maybe they should take a look at how
Slashdot orgainzes itself (or re-organizes itself)....
Anyhow I am mildly annoyed, I'm in the middle of
an auction. Rather Reminds me of the UO beta...
Re:Buy NT because Sun's unreliable? (Score:2)
And of course the "wills" sprinkled around, prefixed with Windows 2000.. Oh, man, this is just boooooring.
Penguins do fly.. (Score:1)
What do you mean Penguins don't fly, they fly perfectly well. Under water that is.
Re:Ultra 5 (Score:1)
i don't like sun's ultra 5's either. but i do like the knock offs... can't remeber the name of them now, but the place i do some work for has a 5 333mhz boxes in nice rack mount cases w/ scsi running the website (all behind a crisco local director).
these babies have a very nice price/performance level (running 2.6 not 7)
crap, i can't remember the name of the reseller who builds them... the case comes off easy, easy to get at stuff.
henri
Re:My solution (Score:1)
And the pay is really _crap_. It reminds me of one of Southpark episode where Cartman gets abducted... "Sory about your ass, dude."
Re:Well.. (Score:1)
Re:MS forced ...wtf (Score:2)
That's because Microsoft's definition of "force" is generally, "We'll give it to you free. We think you should use it." It's how MS has gotten a serious stranglehold on education (where my PHBs and colleagues scoff at me/retreat fearfully from me whenever I mention Linux). While I can certainly understand the discomfort some people have when considering alternatives to MS, both FUD and the lack of cost make it very easy for managers to disregard any advantages to other platforms.
NeXT's WebObjects e-commerce offering to Dell was apparently designed and implemented in less than a week by one or two (granted quite gifted) developers. Microsoft's replacement, however, required a large team working for several months.
"Just fine" may suit your needs now, but you also had a system which likely suited your needs "just fine" which didn't require significant time and energy expenditure for replacement.
When was the last time you had to reboot the SQL Server for problems related to the operating system? When was the last time you had to restart the SQL Server Service? When was the last time you had to restart the server simply because you installed new software on the machine?
Re:HINT: -- Would you be so kind as to elaborate? (Score:1)
My experience has shown Oracle to be really fast, really solid and _really_ expensive. As long as you are willing to put a lot of time into the set-up and initial tuning, run it on a really expensive machine, and put a little time into maintenance it is a rock-solid solution. But it is rather ugly if you need to make any changes to the schema or data types. If we want to make any changes, we have to go through the DBA, which is a real pain in the ass. He's a nice guy and all, but extremely overworked.
Actually, we just started using Oracle 8, Personal Edition on our desktops. That might be a good solution, too. If anyone is interested in how well it works, let me know and I'll keep you updated updated
Jeff
nrrd@earthlink.net
Re:your experiences (Score:1)
Exactly. See #2.
If a bad CD will bring down NT server. . . (Score:1)
Just a thougt. . .
Re:The real problem (Score:1)
The only thing that Linux would prevent is corrupting the equivalent of win.ini to launch the worm on other machines, or for other users of the same machine.
Well.. (Score:1)
Re: (Score:1)
EBay Failures (Score:1)
Re:Ebay uses not one E10K, but... (Score:1)
Re:Uh oh... not Microsoft's fault? (Score:1)
40% reporting downtime?! (Score:1)
The SPARC Solaris machine I use at work has been running for 2 1/2 months straight (and that reboot was because of a power outage). On the other hand, if our NT server doesn't BSOD at least twice a day, it's a red-letter day.
Barnes&Noble Online reliability with MS SQL? (Score:1)
The TechNet article is written in a dull, prejudiced way that's more than a bit obvious in its selectiveness. The other points have been pretty well attacked, so I won't touch 'em.
This is absurd. (Score:3)
"Applications running in Domains are only as reliable as the instance of the Solaris operating system. For applications to gain enhanced reliability from Domains, users must explicitly set up clustering, just as in standalone systems. Sun does not recommend clustering between Domains, suggesting instead that fail-over occur to either separate, standalone systems or Domains in other Enterprise 10000 systems."
Uhh, duh, isn't that the whole idea? Am I missing something here?
"Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center."
Excuse me? The Ultra 5 an "unreliable device"?? We have a farm of Ultra 5s that have been running for a year now. Total number of system failures or crashes of any kind: 0. Period. How is the Ultra 5 any less reliable than any other workstation-class system?
"When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence."
No kidding. When (or rather, *if*) security is compromised, you could do a whole lot more than bringing down all running domains. Just the same as any other platform. How is this a weakness specific to Solaris or the Starfire?
And besides, these are supposed to *secured* (meaning, physically) control consoles. Meaning, locked in a cabinet in the datacenter.
"System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board."
Ummm, yeah. So? How is this any different from any other operating system? Again, I fail to see what the problem is. And besides, how often do you change system boards? Please.
Sure, go ahead... try and remove a CPU card from any NT-based system without first warning the OS. Not only will it hang horribly (ie; you can't do it!), you'll probably fry hardware as well!
The fact that the Starfire can even do this is pretty amazing.
"System boards that are hosting Token Ring adapters, ATM adapters, or non-Sun disk controllers cannot be present in a domain if board-remove operations involving kernel quiescence are to be performed on that domain."
Uh-huh. Sure. I know lots of people with Starfires running Token Ring off of non-Sun hardware that are removing boards with non-pageable data. Happens every day.
I'm not saying it doesn't happen per se, I just think that these arguments are rather ridiculous.
"If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available."
What kind of idiot doesn't leave enough swap space? What kind of admin would go ripping out system boards without really thinking it through first? What kind of person spends the incredible amount of money the E10000s cost without being informed as to the basics of running a Solaris-based system? Come on.
It's like saying "If you remove a CPU card from an NT-based system while running domains are active, the system will be brought down and all domains brought offline." Ummm, duh. If you remove your legs, you can't walk either. Apparently, M$ thinks that true Unix sysadmins are as stupid and lacking common sense like the server admins that they're used to dealing with.
"Reliable hardware is getting even more reliable. For example, customers can take advantage of 99.9% system-level uptime guarantees for Windows NT-based servers from major systems vendors, such as Compaq, Hewlett-Packard, IBM, and Data General."
These are guarantees on the hardware, not software. I'm sure this looks great for the PR, but hello? I'd love to know what the "major system vendors" think about Windows-based servers being equated with their hardware guarantees.
"Microsoft Windows® 2000 Server builds on these gains. For example, Windows 2000 Server supports COM+ load balancing, which eases customer development of highly available and scalable applications in a multi-tiered environment. On the back-end, Windows 2000 Advanced Server supports two-node fail-over clustering, whereas Windows 2000 Datacenter Server will support four-node clustering. IBM and other vendors will provide support for up to eight nodes."
WOW! I am truly impressed. Two or four-node fail-over. Please.
Finally, at the end:
"Which brings us back to eBay. For those keeping score, eBay relies on Windows NT-based servers running Internet Information Server to provide front-end web services, and a single Enterprise 10000 from Sun Microsystems to host an Oracle database on the back-end. According to published reports, the outages at eBay, which began in February, are due to problems at the back-end."
This is curious. Maybe I'm missing something, but a telnet to port 80 shows that www.ebay.com is using Apache 1.3.6 on Solaris. It doesn't get any more front-end than that, does it?
I did notice that pages.ebay.com and listings.ebay.com are running IIS 3.0, and cgi.ebay.com is running IIS 4.0.
Also notice that their web site is still up and running. Not that that means a whole lot, but hey.
I find a lot of what this article had to say utterly hilarious. The implications that the Starfire is an unreliable and dangerous system is the greatest work of FUD that I've seen in my life.
OK, enough said.
Easy (Score:1)
So let see 24 hours * 60 Minutes * 4 days
= 5760 minutes
At 99.9% uptime, this gives you over 5 minutes to reboot the system - should be OK if you have a fast trained monkey.
Lets face it, Linux cannot reach that level of reliability, unless you also hire a trained monkey to pull out the power cord at regular intervals.
Re: At least NT is bearable (Score:1)
This is absurd... (Score:1)
"Applications running in Domains are only as reliable as the instance of the Solaris operating system. For applications to gain enhanced reliability from Domains, users must explicitly set up clustering, just as in standalone systems. Sun does not recommend clustering between Domains, suggesting instead that fail-over occur to either separate, standalone systems or Domains in other Enterprise 10000 systems."
Uhh, duh, isn't that the whole idea? Am I missing something here?
"Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center."
Excuse me? The Ultra 5 an "unreliable device"?? We have a farm of Ultra 5s that have been running for a year now. Total number of system failures or crashes of any kind: 0. Period. How is the Ultra 5 any less reliable than any other workstation-class system?
"When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence."
No kidding. When (or rather, *if*) security is compromised, you could do a whole lot more than bringing down all running domains. Just the same as any other platform. How is this a weakness specific to Solaris or the Starfire?
And besides, these are supposed to *secured* (meaning, physically) control consoles. Meaning, locked in a cabinet in the datacenter.
"System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board."
Ummm, yeah. So? How is this any different from any other operating system? Again, I fail to see what the problem is. And besides, how often do you change system boards? Please.
Sure, go ahead... try and remove a CPU card from any NT-based system without first warning the OS. Not only will it hang horribly (ie; you can't do it!), you'll probably fry hardware as well!
The fact that the Starfire can even do this is pretty amazing.
"System boards that are hosting Token Ring adapters, ATM adapters, or non-Sun disk controllers cannot be present in a domain if board-remove operations involving kernel quiescence are to be performed on that domain."
Uh-huh. Sure. I know lots of people with Starfires running Token Ring off of non-Sun hardware that are removing boards with non-pageable data. Happens every day.
I'm not saying it doesn't happen per se, I just think that these arguments are rather ridiculous.
"If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available."
What kind of idiot doesn't leave enough swap space? What kind of admin would go ripping out system boards without really thinking it through first? What kind of person spends the incredible amount of money the E10000s cost without being informed as to the basics of running a Solaris-based system? Come on.
It's like saying "If you remove a CPU card from an NT-based system while running domains are active, the system will be brought down and all domains brought offline." Ummm, duh. If you remove your legs, you can't walk either. Apparently, M$ thinks that true Unix sysadmins are as stupid and lacking common sense like the server admins that they're used to dealing with.
"Reliable hardware is getting even more reliable. For example, customers can take advantage of 99.9% system-level uptime guarantees for Windows NT-based servers from major systems vendors, such as Compaq, Hewlett-Packard, IBM, and Data General."
These are guarantees on the hardware, not software. I'm sure this looks great for the PR, but hello? I'd love to know what the "major system vendors" think about Windows-based servers being equated with their hardware guarantees.
"Microsoft Windows® 2000 Server builds on these gains. For example, Windows 2000 Server supports COM+ load balancing, which eases customer development of highly available and scalable applications in a multi-tiered environment. On the back-end, Windows 2000 Advanced Server supports two-node fail-over clustering, whereas Windows 2000 Datacenter Server will support four-node clustering. IBM and other vendors will provide support for up to eight nodes."
WOW! I am truly impressed. Two or four-node fail-over. Please.
Finally, at the end:
"Which brings us back to eBay. For those keeping score, eBay relies on Windows NT-based servers running Internet Information Server to provide front-end web services, and a single Enterprise 10000 from Sun Microsystems to host an Oracle database on the back-end. According to published reports, the outages at eBay, which began in February, are due to problems at the back-end."
This is curious. Maybe I'm missing something, but a telnet to port 80 shows that www.ebay.com is using Apache 1.3.6 on Solaris. It doesn't get any more front-end than that, does it?
I did notice that pages.ebay.com and listings.ebay.com are running IIS 3.0, and cgi.ebay.com is running IIS 4.0.
Also notice that their web site is still up and running. Not that that means a whole lot, but hey.
I find a lot of what this article had to say utterly hilarious. The implications that the Starfire is an unreliable and dangerous system is the greatest work of FUD that I've seen in my life.
OK, enough said.
We all know... (Score:1)
All said and done, this is yet another good reason to not have ANY microsoft products in YOUR company's final solution.
Microsoft was giving ebay a firm kick in the teeth while they were down. Sun got splattered with blood, spittle and ebay's missing teeth. You know, if I were in EBay's position, I would really resent being used as market leverage. Yeah, fscking microsoft... that's the kind of people I want to do business with. It was about a brilliant a maneuver as Cabletron's bashing of cisco several years ago... cisco yanked their licensing agreement with Cabletron and someone in Cabletron's marketing department got fired. I suppose if you're stuck under some power hogging motherfscker in your company's marketing department, you have to use shock, stormtrooper tactics to get recognized, but jesus folks... there are much more creative ways to be fired or quit.
Still, we all know that nothing ventured is nothing gained. Sometimes it is better to not "venture" for fear of what you might "gain". If microsoft gains stupid customers from this venture, it will only make the final outcome darwinian.
Re: (Score:1)
Re: (Score:1)
Re:Ultra 5 (Score:1)
Thats why you'll need an Ultra 10, Creator 3D graphics..
I just want to bid on an open reel Leslie Gore tpe (Score:1)
Chuck
Re:Classic obfuscation (Score:2)
Yeah! I can see MS's marketing department gearing up right now....
90% of Yugo mechanics recommend MSActiveYugoYoke as the best Yugo yoking solution on the market*.
Studies have shown that the front passenger in the central Yugo in an MSActiveYugoYoke system sustains fewer neck injuries** than in competitors' systems.
Mechanics can take the MSActiveYugoSolutions Certification Test to demonstrate their compliance with a number of stringent guidelines set by Microsoft Customer Relations.
With an MSActiveYugoYoke system, you can go where you want to go today!
With MSActiveYugoYoke Enterprise Customer Satisfaction Enhancement Warranty, you can assure MSActiveYugoYoke functionality decades into the future!
Microsoft is firmly committed to enhancing MSActiveYugoYoke ease of use, particularly in high-speed interfaces with LargeSunTrucks.
Microsoft prides itself on its high degree of MSActiveYugoYoke-VehicleJava compatibility***.
MSActiveYugoYoke-the product line awarded the Gold Consumer Choice Vehicular Safety Award****.
Quotes from satisfied MSActiveYugoYoke:
"Well, it's a big improvement over MSActiveYugoYoke 95, I must say that." -- Dan Smith, Yugo mechanic
"Microsoft's Yugo yoking system is my system of choice for yoking Yugos." -- Steve Jobs, interim Apple CEO
"MSActiveYugoYoke is the supreme Yugo yoking solution." -- Jim Bob, a guy at Microsoft
"I guess if you like Yugos and use other Microsoft products, you might as well go with MSActiveYugoYoke to ensure product interoperability." -- an anonymous guy at C|net.
-----------------------------------------------
* Survey conducted by Mindcraft. Error margin +- 20 percentage points. [Note: only Yugo yoking system on market at the time was MSActiveYugoYoke]
** While traveling between 8.5 and 9.2 MPH on toll bridges during hurricaine conditions. Survey conducted by Mindcraft.
*** Certain procedures for unyoking in emergency conditions bear vague relation to VehicleJava emergency unyoking procedures in other vehicle yoking systems.
**** Award won was January 1996 award, won with MSActiveYugoYoke 95. Current product is the similar MSActiveYugoYoke 99.
Re:Ultra 5 as SSP doesn't run solaris 7 (Score:1)
last year, they had upgraded the SSP software to
a buggy level. As a result, the E10K hung, and
had to be phyiscally powered down. All during my
test.
----------------------------
Dammit Jim...It's "U-N-I-X",
Re:it's strange that there's so much media coverag (Score:1)
Re:Microsoft's new asshole (Score:1)
I'm not advocating anyone writing a worm like this, but Linux is going to be quite susceptible to this sort of problem. If sending mail requires authentication beyond just being logged in can you prevent this, but that's not very realistic.
In fact, all this could easily have been done on the mid-80s UNIX systems that I used to run.
I agree that macro viruses are based on the absence of any security controls for Office macros, but this sort of worm is not dependent on that - in fact you could write it to attack Netscape on Windows or Linux.
Re:I'll take the Sun anyday (Score:1)
Re:What's wrong with you? (Score:1)
It just ain't so. Microsoft is lying again. Nothing new, eh?
--
Get your fresh, hot kernels right here [kernel.org]!
I spoke with an Ebay Employee and the truth is... (Score:1)
What total assholes (Score:1)
That leaves them with one reasonably uncompelling point about dependence on a service processor when they start plugging their own high-availability "story" - and what a piece of fiction that is. I've worked professionally in the HA field, and Wolfpack is the laughingstock of the industry. It's the most unbearably pathetic HA package I've ever seen, only surviving because of its parentage, and even then it amazes me somewhat that anyone uses it.
I know marketing material is not intended to be objective, but this piece is the most blatantly offensive piece of misinformation I've seen in a long time. While no individual claim in it is untrue, the overall result is incredibly misleading. Whoever wrote it is a master of their craft, but some forms of mastery don't deserve to be acknowledged.
Re:By my calculations (Score:1)
Sorry... 24 times 7 times 365.25 is not the number of hours in a year.
eBay confirms it's Sun (Score:1)
CNet implies [news.com] that it's Sun, but doesn't come out and say it.
By the way, their IIS is version 3, which very few people are still running. Hard to see why they stick with it.
unbelievable crap (Score:1)
Not a fair comparison. (Score:1)
Whoever is in charge of maintaining Ebay has a herculean task, and I respect them.
However:
I don't like the way Ebay handled the Ebayla fiasco, I didn't like the way they dealt (er, didn't deal with) the earlier allegations of security problems, and I'm just plain concerned about Ebay's growth outstripping their ability to keep up with it.
Actually, I saw an E3K combust yesterday (Score:1)
___________________________________________
Re:HUH? Isn't eBay on NT? (Score:2)
If what I heard was correct, the part that croaked on Hotmail was Exchange, not IIS.
(Which shouldn't be a suprise - Exchange only started supporting > 16 GB in it's database last year or so. For those not used to dealing with corporate mail enviornments, 16 GB is not very much.)
--
Re:Microsoft's new asshole (Score:1)
Even if this PARTICULAR virus isn't an Office Macro it sure does depend on hooks into the email system to spread. Not to mention that a quick look at any virus def file these days shows that 90% of recent virii are based on attacking Office security holes. Mcrosoft's Office is a nice agar plate Just Waiting to be innoculated.
Only Microsoft is bozo enough to develop a system enabling the rapid spread of portable, CROSS PLATFORM virii. If you are a Mac user, something like 98% of your virus def file is made up of Office macro virii.
Office macro virii are a huge cost for IT organizations. If they are not careful somebody is going to realize this and start publishing a cost analysis....
Re:System board? Wow! (Score:1)
I think it's the chassis, powersupply, and (most importantly) the backplanes for the system cards.
Re:Ultra 5 (Score:1)
Re:eBay confirms it's Sun (Score:1)
Steve Westly, vice president of marketing and business development, said Friday afternoon that the database damage was traced to a failure in software developed by Sun Microsystems Inc. (Nasdaq:SUNW)
"We know it is a problem caused by the Sun software," Westly said. "We have their full support, going to the top of Sun, they are committed to solving this problem."
Uh oh... not Microsoft's fault? (Score:1)
I guess it is possible that another OS can, under certain conditions, fail. Of course, given my experiences with NT Server, I think it would be insane to put an NT server under those types of load conditions.
Any of y'all out there running Linux with the types of loads eBay has been experiencing?
And two other observations:
1) One wishes that Microsoft was as perceptive about their OS' flaws as they are about Solaris.
2) I'm just a wannabe anyway, what the hell do I know?
Some fun statistics from ms... (Score:4)
More here [microsoft.com]
Yeah, yeah, it's workstation and not server, totally different operating systems. Not.
Linux vs NT, which is unstable? (Score:1)
I use NT at work.
From those two facts, guess which one I have found more reliable, more useful, and enjoy the most?
HUH? Isn't eBay on NT? (Score:1)
Thad
Re:The FUD is so thick I can hardly see... (Score:1)
Do you have any links for MS cooperating w/ Tandem on this stuff?
Even with Tandem's help, I don't see it doing them much good. The reason you were able to hot-swap the CRU's with the Motherboards and stuff in them, was because the hardware switched things over to one of the many backup boards. Unless MS gets someone to develop this kind of hardware for x86, I can't see it doing them much good.
Your right that eBay should have switched over to a Tandem. Even though this is a software and not a hardware or OS problem, Non-stop UX does a hell of a lot to help you when something goes wrong. They probably would have been able to detect the problems with their software by now.
Well, before I blame eBay to much, I'm off to go see if Oracle has recent ports to tandem hardware.
The FUD is so thick I can hardly see... (Score:5)
--------quote on----------
3.When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
--------quote off----------
So, let me get this straight. The workstation which is responsible for "controlling" operations can be used to stop operations? What was Sun thinking, including a command that would turn things off! Of course, we all know that one of the main features of NT Server over NT Workstation is that the "Shutdown" command has been removed from the "Start" menu.
--------quote on------------
4.System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
--------quote off------------
This is supposed to be a problem? Now, I think it's pretty neat that you can migrate kernel memory off of a certain piece of hardware and swap it out at all. We're supposed to believe that under NT you can do this at all? Much less without telling the kernel first? The only conceivable way to allow this to happen without telling the kernel to clear the board out first would be to make sure that all the kernel memory has had a copy paged out to disk. Or perhaps keep multiple copies of all kernel data structures (and hope they don't get out of sync.) Maybe NT does do the last one. That would certainly explain why it's a memory hog.
Pretty amazing if you ask me. This web page is clearly meant for the PHBs of the world, as anybody with any knowledge at all of how computers work is simply going to laugh at this.
Re:By my calculations (Score:2)
24*365.25*10000*.001=87660 lost due to downtime.
Let's look at a Sun system, at 99.97% uptime:
24*365.25*10000*.0003=26298 lost due to downtime
So, that's a $60000 savings in one year.
Why doesn't M$ worry about their own problems (Score:1)
Re:FUD SHARKS to the left! FUD SHARKS to the right (Score:1)
This is pretty low. Yeah, it can happen - what else is an OS supposed to do when it has more processes than now remains as memory?
Come on now, NT's "You are running low on virtual memory" error messages is one of the most beautiful parts of the OS. It is perhaps the single most profound statements bestowed upon us. If Solaris (or Linux, what the hey) cannot provide the most highly trained administrators (I'm talkin top notch MCSErs here) with this sort of insight, well you get what you deserve then.
Re:MS forced ...wtf (Score:2)
My understanding was that Microsoft paid Dell for the conversion costs. (And that the WebObjects setup was breaking under the load, but who knows if that was hardware of software.)
--
Ebay uses not one E10K, but... (Score:3)
Re:System board? (Score:2)
Using software tools, it is possible to segment all of the system boards so that they behave as if they were individual physically discrete systems.
And yeah, I'd like to see Microsoft pull off the same thing.
eBay's custom software is buggy (Score:4)
It's just more FUD from the Empire.
The _real_ scoop on the E10k (Score:2)
If the SSP (Ultra 5) dies... well, wait. It really doesn't happen. Something like a hard drive crash might do the trick. When you are without and SSP, the domains (virtual hardware systems) on the E10k continue to operate. But you're not going to catch things like record stop dumps (hardware error and warnings... such as persistant ECC memory errors). However, most sites that have purchased E10ks have also purchased two SSPs. They're so cheap in comparison, it makes sense. We have YET to fail over onto the secondary SSP on any of our 10 E10Ks. Since when is an Ultra 5 an "unreliable device"?
Sun complaining that the OS needs to be temporarily quiesced in order to move the kernel from one bank of ram to another? Heck, it's a miracle that it can even happen at all. I'd like to see microsoft write the code to move the kernel on the fly. Not a project I would want to be on.
Poo-poo on the adaptors that don't do DR? Hardly even an issue. Look at them... token ring, ATM, third party. I wouldn't even run a third party SBUS card on my E10k. The translation is that "a minority of SBUS cards are not a good choice for the E10k." Big deal, Bill.
About the swap space issue... they might actually have an issue there. I'm sure Sun is working on a warning now, if it is a problem. BTW... at that point you haven't actually REMOVED the system board. You are doing an operation called a "DR Drain" which moves all the pages of memory from the RAM in that system board to another. Once successful, you are able to remove the system board from the configuration, or abort the change.
Classic obfuscation (Score:3)
1. Sun Enterprise 10000 systems have single points of failure. You can't hot-swap CPU boards arbitrarily, and the Ultra-5 front-end is a critical component.
2. Sun recommends that for high availability you cluster between multiple 10000 systems. This is bad.
3. Microsoft's commodity hardware platforms do not offer any of the scalability or reliability features of the Enterprise 10000, so clustering is the only option. This is good.
4. Microsoft's current clustering offering is primitive. In a survey, a majority of people said it was adequate.
5. Microsoft promises that Windows 2000 will have better clustering than NT.
6. eBay is not following Sun's recomendations that high-availability requires multiple systems. They have experienced outages.
BTW, it is shocking to me that eBay could have only a single server. This is at best incredibly naive; at worst blatant incompetence. Therefore I suspect it is false.
Always question competence first... (Score:3)
I run a Sun 10000 with two SSP's. 10000's are connected to their SSP's via private ethernets. I have three private networks; two to allow redundant interconnects between the SSP's and the two 10K control boards and a third for general use, NFS mounting CDROM's and the like. Most people will have no reason to put the SSP's on a public network at all -- I certainly don't. In order to hack the SSP, one must first hack the 10000. Once they've done that, the ability to reach the SSP's by network is irrelevant. The point about the "problem" of the SSP having control is as silly as claiming that EMC Symmetrix disk arrays (heavily used in IBM mainframe shops) can be crashed by the single laptop each array contains.
I would love to know the details of their failures -- I suspect the article is hinting at issues that have nothing to do with their real problems. Further, I'd bet that the main vulnerability that people cluster Sun's against is hardware failure -- and I'd also bet that the main reason people cluster NT boxes is software unreliability!
Can't wait for the reply (Score:2)
It's funny that MS holds up Dell as an example of a reliable, scalable NT-based site. At least their WebBoard support area is frequently inaccessible, and always incredibly slow.
MS also touts 99.9% uptime guarantees from Compaq, etc., but fails to mention that Sun claims 99.95% for the Enterprise 10000.
Nonetheless, my intuition (totally unsupported by any concrete info, other than their poor response to the eBayla exploit) is that eBay is a mickey mouse operation that got really lucky and rich, but does not have the technical expertise commensurate with a multi-billion dollar company. I wouldn't blame any of their vendors, MS or otherwise, for their troubles.
Microsoft web site unreliable. (Score:2)
I get one window of text (along with the usual decorations) which is empty if I scroll down, and has vanished if I scroll back up. Fascinating. "View source" shows more JavaScript than actual document text...
No doubt it works just peachy in Internet Exploiter. But MS misses the first point of communication, which is to convey the message.
No wonder MS is losing.
Re:Ultra 5 (Score:2)
Hardware: I think that Sun really made a mistake here. I'm not too unhappy that they threw out SBus and went to PCI, that really does strike me as a good idea, but dropping onboard SCSI in favour of onboard IDE, well, that was just plain stupid. As it is, every time we buy and Ultra 5 we have to burn a slot to get SCSI into the thing. I notice performance problems with my IDE disks on my workstation, I'd hate to imagine them in any kind of server. Likewise, they seem to have redesigned the case with inconvienience in mind. You have to eviscerate the damn thing every time you want to change anything (memory being the worst) and all the little bits and pieces seem to be fairly low quality.
In the end, the only reason that I upgraded from an Ultra 1 was the frambuffer. 24 (or 32) bit graphics are nice, especially when compared to the measley 8 bits I had before. I don't really have any one application that takes advantage of the extra colors, but color map conflicts (and thus epileptic flickering as maps are switched) are a thing of the past.
And, of course, there's the Mystery Bay. On the front of Ultra 5s is a little flip door that looks just about the right size to admit a 4mm tape. Of course, it isn't the right size, and no tape drive would fit inside anyway. When asked Sun said (after _much_ internal research and many days of not calling me back) that it was for a PCMCIA card reader. Great! I said, and where can I find this reader? "Well," they said, "we don't know. I actually don't think there is one. But when there is, you'll have a bay for it." -- wonderful.
OS: Solaris 7 is the standard these days, ships preinstalled on the Ultra 5. I ran Solaris 7 for two whole weeks on my Ultra 5 before purging it from the disks in a fit of retribution. To say that it's slow is an understatement. My TI-85 can serve web pages faster! I don't know if Solaris 7 is just broken (note the short time between Solaris 2.6 and Solaris 7 releases) or if it's only broken when it runs on Sun's new hardware. In any case, I dropped back to Solaris 2.6 and am much happier.
On an Ultra 1 Solaris 2.6 shows significant speed increases over 2.5.1, but all of these speed increases seems to have been effectively countered by the hardware in the Ultra 5. The end result: my Ultra 5 running Solaris 2.6 is now just about as fast as my old Ultra 1 running 2.5.1.
Marketing: Given all of these experiences I decided to go check out Sun's site [sun.com] to see what they had to say about Ultra 5s and new Solaris versions. I was somewhat amazed to find that they seemed to be marketing the thing as a desktop machine, trying (or so it seems) to compete with PC manufacturers. Now I'll admit that I like having a Sparc on my desktop, but a PC it ain't! The complete lack of emphasis on marketing the machines as servers was simply amazing. And this pattern seemed to be repeated for the other new Ultra machines.
It's really not clear to me what the heck Sun is up to, but I think that they have some serious thinking to do about their direction in the market. Presumably there's a reason to be (seemingly) ignoring their strengths, but I sure don't know what it is.
--
They are using Microsoft-IIS/3.0!!! (Score:2)
------------------------------------------
www.ebay.com is running Microsoft-IIS/3.0
Microsoft-IIS is also being used by Walt Disney, Compaq, Nasdaq, and The National Football League.
www.ebay.com is hosted by ebay.com.
------------------------------------------
Mario.
Kinda funny... (Score:3)
A single Starfire is rated as being able to deliver 99.95% availability with one - ie no clusters, and without all those caveats above - though it does need to be setup with reliability in mind for this - there's plenty of options.... Starfires aren't simple either - up to 64 CPUs, many more PCI and similar slots, memory slots, etc, etc. So, plenty of things to go wrong. Similar sized computers (from everyone) are really hard to transport without something going wrong. The only people more nutso on reliability on 'big iron' computers are IBM (from the companies I know a fair bit about anyway). Not only do they have backup CPUs, in their CPUs they do the same operation twice (in parallel, with checking at the end) to trap the ultra-freak chance of cosmic radiation or something casuing a flipped bit, or worse. (yes, they do seriously actually worry about such things... I remember an IBM proposal about how to design memory that can handle a once-a-month chance, for when you have a huge about of RAM, for some particular kind of radiation....)
The only complaint I've ever heard about Starfires in general is that if a PCI card (though not SBus card) breaks down it can hang the entire system until an operator manually flicks a switch to say that that particular card is defunct. Though this is really because of how PCI works - Sun's 99.999% reliable Netra 1800 has some highly specialised custom hardware to get around this problem with PCI cards, as well as backup CPUs and plenty of other stuff... ridiculously expensive too, they are... though apparantly more cost effective than anything in the same class. The Netra 1800s are a few months old, while the Starfire design is over 2 years old, btw.
I dunno about all of MS's claims, but I'm pretty damn sure that you can have hardware redundancy for just about everything, if you want, including the Ultra-5 controller. Most of the other claims seem to be related to the fact that you can hot-swap PCI cards, memory, CPUs and even mother-boards in a Starfire...
EBay do seem to have had more than their fair share of problems though... quite a few hardware problems it seems - I vaguely remember a problem earlier in the year was due to some controller card or something. As far as I know, nobody has had anything close to the problems EBay are having with their Starfire(s)...
Another little point... MS's idea of expensive downtime is $10,000/hour. I remember reading something on Sun's site a while back about high end availability systems. Sun's idea of expensive downtime is $10,000,000/hour - ie stockbrokers. They also had a list of most common causes for 'unplanned' downtime on their HA systems - first was 'operator error' (or lack of training, etc). I can't remember was second was, but third was 'fire'! (I'm pretty sure Sun's computers don't have a reputation for spontaneous combustion!)
Re:where's the hot backup server? (Score:2)
Microsoft's new asshole (Score:2)
It has nothing to do with a Microsoft system crashing, but rather trying to turn the eBay problem in into a FUD event that has people here upset.
This web page is the most outrageous piece of crap I have ever seen. Advising customers to rely of some piece of untested software still in beta to handle a massive mission-critical load. If I was eBay's CTO I would be looking to go upscale into some really heavy iron like the Himalayas or mainframes that operations that need REAL reliability use. The idea of going to a MS operating system for this sort of application is purely ludicrous.
The same day this is going on we have another round of word macro viruses terrorizing MS users everywhere. Why don't you see Corel and Lotus touting the fact that Word Macro virii don't trash their systems? Because they aren't low-life like Microsoft. Do you see slashdot trashing MS over this? No, even though they richly deserve it.
Microsoft deserves to be roundly excoriated on this one.
Re:Classic obfuscation (Score:4)
1:The first point is wrong, AFAIK -- E10ks are fully triple redundant. The second point is that no one but a maniac would hot-swap components without correctly varying them off. That would be like sticking a fork into a running toaster to change elements and being surprised at a nasty result. The Ultra5 front end is not critical. This is not true. You can manage them from any machine. You can attach a terminal or one of those awful JavaStations. So, two lies and a really bizarre attempt at deception (Buy my car -- that brand sucks because you can't change the brake pads while it is speeding down the Interstate!).
2:If you don't understand this, you don't understand business computing, clustering, or applied EE/CS. This requires a lot of remedial work in security basics. I would suggest Computer Security Handbook by Arthur E. Hutt (Editor), Seymour Bosworth (Editor), Douglas B. Hoyt (Editor)Paperback 3rd edition (September 1995) John Wiley & Sons; ISBN: 047111854. So, another really bizarre attempt at deception (Their car sucks because it needs tires.)
3:see above (Their car has those big steel bumpers and huge brakes, leading to costly repairs over the life of the vehicle. Our car has neither brakes nor bumpers, so you should should get a few of them and not worry about costly repair jobs.)(Why get a bus -- you can yoke 14 Yugos together -- see the user friendly and brightly-painted YugoYoke(TM)!)
4:Yeah, and 3/4 of the American population strongly felt that they made up 75% of the population. Duh. I do not ask a plumber for stock tips. I ask a stockbroker. I do not ask my stockbroker to preform dentistry. I ask my dentist. Etc.
5:And at some point in the future, you may win the lottery. Are you doing business in the future or right frigging now? Hmmm? I can't hear you
6:Well, a)I think that they have two E10ks (please correct me if I am wrong) so this is actually not true (again) and b)thank you Captain Obvious. And if try to drill through your skull with a drill press contrary to all logic and common sense, you might miss some work. Yes, you should do things that matter with some care.
Of course, this is just one woman's opinion
NT with 50 clients vs. Sun with 350 clients (Score:2)
The one, let's call it Site A, uses a $20,000 Dell NT Server 4.0 SP3 (dual PII-300) with 50 win95 clients; also runs MS Proxy Server 2.0.
Site B uses a Sun Ultra 2 Model 2300, dual sparc 300mhz. It supports tin, pine, lynx, gcc, filesharing, and ftp with 300 concurrent users across a 2 mile radius WAN with .2 cpu usage.
Is it fair to note that the last "restart" of the Sun was 67 days ago--the last "restart" of the NT...well, with all seriousness, it's been at an average 2 crashes per day (that's an 8 hour day).
eBay may have DB problems; let's not forget Oracle has all of its products availible for Linux and Oracle products are sold OEM thru Dell and their Online Store. I'm afraid Microsoft is trying to bolster their image. Don't believe NT needs PR help? See http://www.ntsecurity.net
Adios.
Oracle..... (Score:2)
www.amazon.com
www.1800flowers.com
www.cdnow.com
www.charlesschwab.com
www.cisco.com
www.dell.com
www.etrade.com
www.onsale.com
www.rei.com
I fear that it is the ebay ppl who are at fault....
Re:HUH? Isn't eBay on NT? (Score:2)
9 times out of 10 eBay is hosed due to the ASP/ODBC/IIS front end. Today it's the database backend (Sun + Oracle).
--
FUD SHARKS to the left! FUD SHARKS to the right! (Score:5)
RED HERRING: Daemons that control domain operations and perform monitoring functions run on an unreliable device (Ultra 5 workstation), hardly a desirable situation in the context of a data center.
So what? The E10000 will continue to truck on as before without it. This is a complete red herring. The SSP is a really just a console station, nothing more. If it dies, you reboot it, or in the worst case, replace it with another one from the closet, which with Sun's AutoClient technology, can take on the entire identity of the failed box in a couple of minutes. (AutoClient allows Wall St. traders to replace their workstations and be working again with NO IMPACT in 5 minutes. Let's see NT do that.)
FUD SHARK: When security is compromised on the System Service Processor, which runs on the Ultra 5 workstation controlling domain operations and performance monitoring, all running domains on the E 10000 can be brought down with a short command sequence.
No one in their right mind would put the SSP on a network that extends beyond the glass house!! It's a *console*, designed to be locked up securely, like all other mission-critical control consoles. MS still doesn't get the data center, do they?
DUH WHALE: System boards that are hosting non-pageable kernel data structures cannot be removed from a domain without interrupting service. The Solaris operating system has to undertake a special "quiesce," or suspend, operation while the critical pages are migrated to another board.
This is incredible. They're knocking the E10K because you can't walk up to it and pull a CPU card at random without telling the machine first that you plan to do this. These cards contain memory, too, folks, which is why it's pretty reasonable to let the system move things to a safe place before the card goes bye-bye. Pretty much only Tandems can accept this sort of things (because they've got at least two of everything all the time, and they cost like they have even more), and if you're after real fault tolerance, you won't be running NT on them, even though you could...
STING RAID: If you remove a system board from a running domain without enough swap space, Solaris will hang. The administrative tools do not warn you if you do not have enough swap space available.
This is pretty low. Yeah, it can happen - what else is an OS supposed to do when it has more processes than now remains as memory? Although a warning would be nice, E10K admins aren't stupid (we hope), and they understand that there are easy workarounds to this - the E10K makes it very easy to move enough resources into the OS domain in quesiton on a temporary basis. If you don't have enough hardware to do that, you misconfigured the machine in the first place. This is hardly a weakness.
On the whole, the incredible thing about this is that MS is throwing rocks at a really good system with availability features far in excess of that for any practical NT box. You've gotta admire their guts, though - some people will read this and think the E10K is a really expensive, dangerous computer. Funny how they neglect to mention that there's not an NT box on the planet that can provide the performance of an E10K, regardless of how much you spend. This may change eventually, but it's pretty cheeky now.
If you need real fault-tolerance, get a Tandem/Compaq - but after you've paid all that money, I bet the Compaq folks would be the first to advise against using NT on it if you really want fault tolerance.
Re:Buy NT because Sun's unreliable? (Score:2)
Well, eBay is constantly slow or being interrupted due to IIS or MS-ODBC flakeyness. When I heard eBay was down for 19 hours (on the radio), I assumed it was the Microsoft side. If I was them, I'd have a press release washing their hands too.
By the way - has anyone tried to buy anything at buy.com? I have on a couple occassions, and the damn thing is so flakey and defective that it won't let me. It also appears to be all MS Tech.
--
E-10000's are error prone, in practise (Score:4)
I don't know about EBay, but I know that E-10000's are extremely tricky to configure correctly.
Sun markets them as ultra-reliable and hardware-level redundant, but the truth is that configuring them is so complicated that even a team of experienced sysadmins is bound to screw something up sooner or later. If you bet the store on a single E-10000, then sooner or later the machine will crash hard and your store is hosed.
Given their size, expense and complexity, they are not appropriate for use as the main server in an internet commerce company. Sun should not sell these machines to companies like EBay.
In defense of Sun, I should point out that their "smaller" systems, namely the E-4000, E-6000, E-3000 etc., are rock solid and just as easy to configure as a small server. But no one would dream of running a whole store on a single one of them -- for reliability, you need to run several of them redundantly.
And Windows NT is far less reliable than any Sun machine. NT is the opposite of reliability. Production Solaris machines routinely stay up and running for months or years at a time. Show me an NT server which can do that.
It's time to rise up and be Constructive! (Score:2)
I'm declaring tomorrow Constructive M$ Bashing Day!
(Why do I have the power to do this? Because Barney says everyone's special in his or her own special way, and I'm invoking my privledge as a Special Person. That'll teach you to ask why. Feh!)
The next time you feel like correcting something that M$ claims and that's blatantly false, do so. THEN, email it directly to M$! If we had one day where EVERYONE from
Then, everything would start changing. The wheels would be in motion. M$ would realize the error of their ways and become Tibetian monks to pray for forgiveness!
We can make a difference, dammit! Can't we??
Re: At least NT is bearable (Score:2)
What really sucks is having to reboot whenever you install a new app with one of Micro~1's crappy shared DLLs. The latest offender was Foxpro 6, which I only need for some light file management, and which had me reboot three times (twice for the mandatory install of evilbad IE which I never use). Rebooting for network changes is bad enough, but for mere mortal applications??
This reminds me yet again why