High-Performance Web Server How-To 281
ssassen writes "Aspiring to build a high-performance web server? Hardware Analysis has an article posted that details how to build a high-performance web server from the ground up. They tackle the tough design choices and what hardware to pick and end up with a web server designed to serve daily changing content with lots of images, movies, active forums and millions of page views every month."
10'000 RPM (Score:3, Insightful)
But any web server is high-performance (Score:5, Insightful)
With dynamically generated web content it's different of course. But there you will normally be fetching from a database to generate the web pages. In which case you should consult articles on speeding up database access.
In other words: an article on 'building a fast database server' or 'building a machine to run disk-intensive search scripts' I can understand. But there is really nothing special about web servers.
"Three times the power?" (Score:5, Insightful)
If we were to use, for example, Microsoft Windows 2000 Pro, our server would need to be at least three times more powerful to be able to offer the same level of performance.
"three times?" Can somebody point me to some evidence for this sort of rather bald assertion?
A little disapointing really (Score:5, Insightful)
Anyone who's ever worked on a big server in this cash-strapped world will know that squeezing every last ounce of capacity out of apache and your web applications needs to be done.
Strange choice of processors (Score:5, Insightful)
SMP isn't a good thing in itself, as the article seemed to imply: it's what you use when there isn't a single processor available that's fast enough. One processor at full speed is almost always better than two at half the speed.
Re:But any web server is high-performance (Score:5, Insightful)
I'm just a programmer, but don't big sites put caching in front of the database? I always try to cache database results if I can. Honestly, I think relational databases are overused, they become bottlenecks too often.
-Kevin
Almost (Score:2, Insightful)
You can safely drop that 'almost'.
Re:That "howto" sucks (Score:5, Insightful)
-Kevin
Re:Not-so high performance (Score:4, Insightful)
I don't see your point. "ping" has never been designed to benchmark web servers AFAIK.
My servers don't answer to "ping". Does it mean that the web server is down? Noppe... it's up a running...
"ping" is not an all-in-one magic tool. By using "ping" you can test a "ping" server. Nothing else.
Quick howto (Score:1, Insightful)
Get the fastest AthlonXP out there.
Get a motherboard with onboard SCSI.
Get 15,000RPM SCSI 160MB/s drives
Get a NIC
Install linux
Install apache
Install mysql, php, perl, etc.
And there you have it. Is it really necessary to write a long article when all you're basically saying is "get the fastest hardware out there and slap it into one machine"? Come on folks.
Re:But any web server is high-performance (Score:5, Insightful)
Re:But any web server is high-performance (Score:4, Insightful)
Hardware only comes into play in a web app when you're doing very heavy database work. Serving flat pages takes virtually no computing effort. It's all bandwidth. Hell, even scripting languages like ASP, CF, and PHP are light enough that just about any machine will work great. The database though... that's another story.
Re:A little disapointing really (Score:2, Insightful)
The article seemed way too focused on hardware.
Well the name of the website is "Hardware Analysis"... ,-)
Re:But any web server is high-performance (Score:5, Insightful)
Beware of false economy when looking at hardware. While it's true that smaller boxes are cheaper, they still require about the same manpower per box to keep them running. You rapidly get to the point where manpower costs dwarf equipment cost. People are expensive!
Capacity is an issue. We try to plan for enough excess at peak that the loss of a single server won't kill you, and hope you never suffer a multiple loss. Unfortunately most often customers underequip even for ordinary peak loads, to say nothing of what you see when your URL sees a real high load.[1] They just don't like to spend the money. I can see their point, the machines we're talking about are not cheap; it's a matter of deciding what's more important to you, uptime and performance or cost savings. Frankly most customers go with cost savings initially and over time (especially as they learn what their peak loads are and gain experience with the reliability characteristics of their servers) build up their clusters.
[1] People here talk about the slashdot effect, but trust me when I tell you that that's nothing like the effect you get when your URL appears on TV during "Friends".
Re:OK so where do I start? (Score:1, Insightful)
The lack of system setup detail isn't good. Too many variables there. Apache2 may have been a better choice for this too...
BTW, you're prossibly disk io (requests not bandwidth) limited by your IDE RAID. Make sure atime is turned off - no point recording it for no good reason. Do what ever youcan to minimise disk io, because your IDE RAID is done in software (and if you use Promise drivers, stiff bikkies when you need to upgrade your kernel...)
A high "load" isn't much good info-wise either... what does "sar" have to say? Where is the "load" being generated???
Re:Apache 1.3x? (Score:4, Insightful)
The promise FastTrak and Highpoint and a few others are not actually hardware RAID controllers. They are regular controlers with enough firmware to allow BIOS calls to do drive access via software RAID (located in the firmware of the controller), and OS drivers that implement the company's own software RAID implementation at the driver level, thereby doing things like making only one device appear to the OS. Some of the chips have some performance improvements over a purely software RAID solutions, such as the ability to do data comparisons between two drives in a mirror during reads, but that's about it. If you ever boot them into a new install of windows without preloading their "drivers", guess what? Your "RAID" of 4 drives is just 4 drives. The hardware recovery options they have are also pretty damned worthless when it comes to a comparison with real RAID controllers - be they IDE or SCSI.
A good solution to the IDE RAID debacle are the controllers by 3Ware (very fine) or the Adaptec AAA series controllers (also pretty fine). These are real hardware controllers with onboard cache, hardware XOR acceleration for RAID 5 and the whole bit.
Anyway, I'm not really all that taken aback that this webserver is floundering a bit, but seems really responsive when the page request "gets through," so to speak. If it's not running low on physical RAM, it's probably got a lot of processes stuck in D state due to the shit promise controller. A nice RAID controller would probably have everything the disks are thrashing on in a RAM cache at this point.
~GoRK
Re:But any web server is high-performance (Score:5, Insightful)
Like the State field in an online form.
Every single hit requires a tag to the databases. Why?
Because, heck if we ever get another state, it'll be easy to update! Ummm, that's a LOT of cycles used for something that hasn't happened in, what, 50 years or so. (Hawaii, 1959)
Re:Server running at near 100% load (Score:3, Insightful)
Personally I think the new trend on Slashdot of "hey, I saw this article about ____, it's really insightful and just great!" being submitted by the author of that article is sort of shitty. If anybody knows about building a high traffic webserver, it would be Slashdot, so you'd think they'd be a little pickier about what they post regarding high performance servers.
Not to flame, but the article is bad for newbies (Score:2, Insightful)
1) For a high performance web server one *needs*
SCSI. SCSI can handle multiple request at one time and performs some DISK related processing compared to IDE that can only handle request for data single file and uses the CPU for disk related processing a lot more than SCSI does.
SCSI disk also have higher mean times to failure than SCSI. The folks writting this article may have gotten benchmark results showing their RAID 0+1 array matched the SCSI setup *they* used for comparison, but most of the reasons for choosing SCSI are what I mention above -- not the comparitive benchmark results.
2) For a high performance webserver, FreeBSD would be a *much* better choice than Redhat Linux. If they wanted to use Linux, Slackware or Debian would have been a better choice than Redhat Linux for a webserver. Ask folks in the trenches, and lots will concur with what I've written on this point due to mainenance, upgrading, and security concerns over time on a production webserver.
3) Since their audience is US based, It would make sense to co-lo their server in the USA. Both from the standpoint of how many hops packets take from their server to their audience, and from the logistical issues of hardware support -- from replacing drives to calling the data center if there are problems. Choosing a USA data center over one in Amsterdam *should* be a no brainer. Guess that's what happens when anybody can publish to the web. Newbies beware!!
Re:OK so where do I start? (Score:2, Insightful)
Probably 90% of all non-profit websites could be run off a single 500 MHz computer and most could be run from a sub 100 MHz CPU -- especially if you didn't go crazy with dynamic content.
A big bottleneck can be your connection to the Internet. The company I work for once was "slashdotted" (not by slashdot) for *days*. What happened was our Frame Relay connection ran at 100%, while our web server -- a 300 MHz machine (running Mac OS 8.1 at the time) had plenty of capacity left over.
Re:But any web server is high-performance (Score:1, Insightful)
"millions of page views every month" not High-Perf (Score:3, Insightful)
Re:10'000 RPM (Score:5, Insightful)
I've designed and implemented sites that actually handle millions of dynamic pageviews per day, and they look rather different from what these guys are proposing.
A typical configuration includes some or all of:
- Firewalls (at least two redundant)
- Load balancers (again, at least two redundant)
- Front-end caches (usually several) -- these cache entire pages or parts of pages (such as images) which are re-used within some period of time (the cache timeout period, which can vary by object)
- Webservers (again, several) - these generate the dynamic pages using whatever page generation you're using -- JSP, PHP, etc.
- Back-end caches (two or more)-- these are used to cache the results of database queries so you don't have to hit the database for every request.
- Read-only database servers (two or more) -- this depends on the application, and would be used in lieu of the back end caches in certain applications. If you're serving lots of dynamic pages which mainly re-use the same content, having multiple, cheap read-only database servers which are updated periodically from a master can give much higher efficiency at lower cost.
- One clustered back-end database server with RAID storage. Typically this would be a big Sun box running clustering/failover software -- all the database updates (as opposed to reads) go through this box.
And then:
- The entire setup duplicated in several geographic locations.
If you build -one- server and expect it to do everything, it's not going to be high-performance.