Should Servers be Mono-Process or Multithreaded? 96
An anonymous reader wonders: "How would you design the fastest possible Linux-based server application today? A few years ago, the thinking was that multi-threading was not the way to go — instead, high-performance servers used an event-driven, mono-process model (consider lighttpd and haproxy). However, things have changed. Today CPUs have dual cores, and over the next few years this is only likely to increase. Also, the 2.6 Linux kernel has made multi-threading much more efficient. So I'm wondering, does Slashdot think that modern high performance server software should be designed to be multi-threaded, or does it still make more sense to use an event driven, mono-process architecture, despite the advances in the Linux 2.6 threading and the arrival of multi-core CPUs?"
Re:Error (Score:2, Funny)
Re:Error (Score:2)
Re:Error (Score:1)
info (Score:5, Informative)
outdated info (Score:2, Informative)
Also... they opening statement and its bias toward Unix "for obvious reasons" doesn't lend towards it's credibility. I've tuned out high volume systems on both pl
Re:outdated info (Score:2)
Does Windows have the kind of kernel control necessary to implement things quickly like on this page? Serious question; I remember discovering that Windows' support for asynchronus process communication left me severely wanting.
Re:outdated info (Score:2)
Re:outdated info (Score:2, Informative)
Windows support for IPC is actually very robust, shared memory and semaphores is about as fast as you can get and it exists on both Windows and *nix just under different names.
A lot of people either don't know or forget that the NT->XP->2k3 kernel owes a lot to the Vax/VMS group from Digital who were snatched away to build the NT kernel. There seems to be this illusion that XP is the bastard step child of Windows 1.x and that is not t
Re:outdated info (Score:1)
Re:outdated info (Score:2)
Re:outdated info (Score:2)
Re:outdated info (Score:2)
To be fair Windows handles threads much better.
Re:outdated info (Score:3, Interesting)
Eh?
Here is what I see:
Perhaps the bias, and lack of credibility, is elsewhere...
combination (Score:3, Informative)
Re:combination (Score:4, Informative)
it's easy for threads to eat monstrous amounts of resource
Don't you mean forked processes? The advantage of threads is that they're lightweight and use shared memory (dead locks hoorah!), forked processes are heavy because they need their own memory, etc.
Re:combination (Score:1)
Re:combination (Score:2)
On the other hand, with threads you'll still use a
Re:combination (Score:2)
In my real world very little of the process memory is written to.
Specifically, I'm talking about a w
Re:combination (Score:2)
Re:combination (Score:3, Insightful)
I disagree. While using a single thread per connection is definately a retarded way to go, making I/O operations asynchronous and handling callbacks via a thread pool is definately the fastest and most efficient way to build a scalable daemon today. And once you get used to the idea, it is even much easier
Re:combination (Score:3, Informative)
That's because it's only needed (as a performance hack) with Windows. Forking is cheap with *nix.
"Those who don't understand UNIX are doomed to reinvent it, poorly."
--Henry Spencer
Get ESR's "The Art of UNIX Programming" - or if cheap read it online. Chapter 7 [catb.org] might be a good place for you to start.
Re:combination (Score:2)
Re:combination (Score:3, Insightful)
Re:combination (Score:2)
I think the long and short of this whole discussion is that there's no silver bullet here. If there was, then the opposing system would by and large fall into disuse. Depending on what sort of thing you're making, how it behaves, and what it interacts with, different scenarios will be easier / harder to program, and more / less efficient with system resources.
Re:combination (Score:2)
If you have to fork to serve a new request, regardless of how efficient unix makes it, it will still require a fair amount of resources and start up some context switching. In addition you then have to set up the network connection and so on. Its a lot more expensive than a single process running and sharing work within itself. (you can see this in action by comparing cgi based webservers with modular ones. The cgi ones are a tenth of the perfo
Simple rule. (Score:2)
Either use asynchrous and non-blocking calls throughout, OR use multiple threads.
The answer is (Score:1)
the answer is yes... (Score:2)
Absent context, servers should be both and neither and something we havent invented yet.
What is the server for? is it serving a database to 27 hojillion simultaneous users?? serving a static intranet to 5?? filtering mail for AOL.com??
On suspects that each problem will have an optimum solution... so, as the first post noted, nothing to see here.
err!
jak.
Re:the answer is yes... (Score:1)
I'd say any time there's over 27 brazilian simultaneous users.
Re:the answer is yes... (Score:1, Interesting)
Bingo.
I recently replaced a server for a client. His previous one was just too slow, and his response was to keep replacing it with a new one with a faster CPU and more RAM. His result? Only marginal increases in speed.
I profiled the machine as he was using it, noticed that the bottleneck was disk access, and replaced the IDE drives he was using with a caching
OT, yes, but... (Score:1)
I had a very similar experience with a client in 1997 or so. They had an old file server: dual Pentium 133's running Windows NT 3.51 with 6 hot-swap SCSI drives in RAID-5. It would take approximately 45 minutes for a designer to open a single file (about 1GB in size). Their previous consultants told them that their server was hopelessly out-of-date (which it was) and that they needed a new file server: $15,000, please.
I came and did some profiling
fastest way (Score:5, Informative)
Have 1 process per node. I mean "node" in a NUMA sense. 64-bit AMD systems have one node per chip package. Other PCs (except exotic stuff) have one node for the whole system. Lock your processes to separate nodes, so that they all get local memory. If you don't do this, at least remember to use the new system call for moving pages from one node to the other. (eh, "move_pages" if I remember right -- see unistd.h in the kernel source)
You'll need extra threads to do disk IO. Not counting those: On each node, have at least 1 thread per bottom-level (usually L2 or L3) cache, but not more than 1 thread per virtual core (hyperthreading thing). If you go with 1 thread per physical core but have virtual cores (hyperthreading) enabled, lock your threads to virtual cores that don't share phycical cores.
A lot of this should be configurable. Hopefully you'll make an easy way to automatically determine the best configuration, writing out the appropriate config file so that manual config hacking is not required to get the best performance.
forgot to mention IRQs (Score:2)
Debugging (Score:3, Interesting)
Debugging a monolithic C process like Squid is still much easier than debugging multithreaded software like Microsoft Windows.
This isn't likely to change regardless of how many processors you throw in the machine.
The rules change when you move to something like Java where cross-thread contamination of the data structures is relatively difficult. But then if you're talking about Java then why are you seriously considering hardware performance issues?
Re:Debugging (Score:3, Insightful)
Is this intended as flamebait? Performance in Java is as important as performance anywhere else. Most of the performance in any complex system depends more on the algorithms and how easy it is to debug a complex, fast algorithm, than on any inherent speed advantage of a particular language. I'd say the question of what design of network I/O and threading maximizes performance is at least as important a question in Java as any other language.
Re:Debugging (Score:2)
The original poster distincly said hardware performance issues. Algorithm efficiency is something that I'd probably characterize as software performace. I think the OP was trying to point out that, if your hardware is fast enough to run a Java VM comfortably, it'll be fast enough for you to not worry about the performance impact of "low level" design decisions.
Re:Debugging (Score:2)
Re:Debugging (Score:2)
Re:Debugging (Score:2)
Re:Debugging (Score:2)
Re:Debugging (Score:1, Interesting)
Re:Debugging (Score:2)
Why would you want to debug Microsoft Windows?
Re:Debugging (Score:4, Funny)
You work for Microsoft, don't you?
Multiple processes (Score:4, Informative)
On systems with crippled schedulers and VM, threads are very efficient compared to everything else because your application becomes its own scheduler, and it reduces the total physical footprint and amount of cache invalidations. With better scheduler and VM, it makes more sense to rely on the OS insulating processes and scheduling their i/o, so the solution with multiple processes becomes more efficient in the absence application's need to share some data between multiple requests.
Re:Multiple processes (Score:3, Insightful)
Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data a
Re:Multiple processes (Score:3, Informative)
Because there IS interaction between threads -- mutexes handling takes resources, too, and checking i/o status also requires time and syscalls (OS just supplies pending i/o information to its scheduler by itself). And because data a
Re:Multiple processes (Score:2)
One of the worst offenders is memory deallocation.
Yes, you heard that right: deallocation.
A decent memory allocator these days has multiple memory arenas. When you alloca
Neither. Mutiprocess usually Prevails (Score:5, Insightful)
But again, "it depends on the application". The above pipelining method only performs well if you're processing items in an assembly line fashion. If you're an HTTP proxy server you wouldn't want that model. You would probably want a single process libevent type of thing. I have some code that doesn't use either of those models. It's a multiprocess model but event driven with *everything* in shared memory. It's very close to a multithreaded model but I needed security context switching. Also, contrary to popular belief threaded servers are slower than equivalent multiprocess servers. So in-general, the benifit of a multithreaded server is pretty much just about convenience for the programmer. Since you can acheive the same effect by just creating an allocator from a big chunk of shared memory mapped before anything is forked, there's very little reason to use threads at all.
Re:Contrary to popular belief (Score:2)
http://lists.samba.org/archive/samba-technical/20
Consider Fault Tolerance & Thread Safety Too (Score:5, Informative)
- that you're building a custom-purpose, client-server or message processing application,
- it needs to be highly parallel to be efficient
- the language is C, C#, or C++, and not Java (process-based servers in Java?)
I have done this before, using both processes and threads, for the same application. Consider the impact of application faults on your design, and then consider how hard it will be to create thread-safe code.
o A highly multithreaded server, where threads are performing complex and/or memory demanding tasks, will be susceptable to complete failure of all running jobs on all threads, if just a single thread SEGFAULTs. And despite your best testing efforts, complex code (1M+ lines) will at some point, somewhere, fail.
o Threaded code must be thread safe. Static variables, shared data structures, and factories all previously accessed through a singleton must now be protected with guard functions and semaphores. Race conditions need to be considered. Design for this up front. It will be much harder to add it later.
The Project
I worked on a team which added an asychronous processing engine to a web application. The engine was responsible for performing memory and time-intensive financial analysis and reporting for 16,000 accountants, so that they could close a large company's financial books. Unlike a webserver triggered by on-line end users, this engine is triggered by events in the company's financial database: once the database raises the "ready" flag, this engine begins running as many reports as it can, as fast as it can, on behalf of the 16K users. The analysis and report code was 2 million lines of C++, running on AIX.
Process implementation
The initial implementation used processes. A dispatcher job monitored the database for the ready flag, and then forked children of itself to analyze slices of the data, and generate the reports. One child job was used for each analysis and report pair, and the manager controlled how many jobs ran in parallel, maintaining a scoreboard of which jobs succeeded, and which failed.
Due to the complexity of the system, failures (core) occasionally occurred. The monitor would record this, retry the failed analysis up to 3 times, and keep a uniquely named core file of the event. Other analysis reports would continue to be generated, otherwise unharmed by the thrashing thread. Approximately once every 90 days, the development team would collect the few cores generated, use the gbd/xldb debugger to determine the cause of failure, and correct the fault.
The downsides of this? The solution was slowed because couldn't re-use resources like database connections (they were destroyed with each process), and more memory was used than need be. DB2 caching helped somewhat, but potential performance improvements remained.
Threaded implementation
In a large company, there are IT standards, and one of the standards at my company is that applications shall never, ever, ever fork(), even if running on a large dedicated machine. After losing the fight against this, my team re-architected the report engine. Largely
the same as the previous, the new engine waits for the "ready" signal, and then spawns pthreads (POSIX threads) as workers to analyze the data and generate the report. In theory, it was robust.
The alpha version of this solution immediately failed (cored) during testing. We neglected to identify the less obvious non-thread-safe code in the application, and failed to identify several race conditions. Unlike previous failures, this faults were total: a SEGFAULT in code on one of 20 threads would halt the entire application. And the corefile generated was now huge - it contained a snapshot of memory for all 20 running jobs, instead of just the one of interest.
Extensive root-cause analysis, design, and restart management solved this, and the current version is as robust, and a good bit faster, than the previous. At a significant price.
Re:Consider Fault Tolerance & Thread Safety To (Score:2)
Not necessarily. It is entirely possible to design code that is threadsafe without using locks for data access. Instead, in many cases you can ensure that due to program logic variables you are accessing can only be used by a single thread (e.g. by only allowing a single thread to work on a connection's state at onc
Re:Consider Fault Tolerance & Thread Safety To (Score:1)
May I ask - Gods, man ! Why oh why ?
Re:Consider Fault Tolerance & Thread Safety To (Score:1, Informative)
In Windoze and OS/2 (am I showing my age here
Re:Consider Fault Tolerance & Thread Safety To (Score:2)
However, if the dying thread has run amok over the heap already, that's not going to help much. Instead, you are likely to get other
Re:Consider Fault Tolerance & Thread Safety To (Score:1)
Standard API package java.nio, serving you since 2002.
Introduced in J2SE 1.4 (Merlin) after an almost unanimous vote in the Java democracy: http://jcp.org/en/jsr/detail?id=51 [jcp.org]
Synchronous or Asynchronous IO (Score:1)
which blocks, where as multithreading is asynchronous and non-blocking. You can get far more throughput with the same amount of memory with threads, and be able to levarage multi-core, multi-processor and distributed processing architectures.
mb
Real life examples (Score:5, Informative)
I cannot speculate, but I can look at what people are doing today. One thing that I have noticed, is the widespread research into, with compelling arguments, for massively multithreaded programming techniques. See Erlang for example. It is designed right from the beginning for this sort of problem - high throughput, high reliability, high uptime telephony networks.
As a rough benchmark, someone's got this [www.sics.se].
That's an order of magnitude increase in "performance" (depends on what you mean by performance". I thought I'll do a casual informal test of my own, with a decent static file size (instead of the 1 byte used in that benchmark)
Server Software: Yaws/1.56
Document Length: 402 bytes
Concurrency Level: 500
Time taken for tests: 8.480740 seconds
Complete requests: 5000
Requests per second: 589.57 [#/sec] (mean)
Time per request: 848.074 [ms] (mean)
Server Software: Apache/2.0.54
Document Length: 402 bytes
Concurrency Level: 500
Time taken for tests: 29.787216 seconds
Complete requests: 5000
Requests per second: 167.86 [#/sec] (mean)
Time per request: 2978.722 [ms] (mean)
Output edited to get past lameness filter.
Err crap, I could have sworn the first time I tried this, when Yaws was first installed, its performance was worse! Oh well, perhaps it's something I've inadvertently done since then. Could have been due to my computer reboot (this is a desktop PC). It seems I've proven my point, although I was trying to disprove it. Standard caveats regarding benchmarks apply. Both servers are default Ubuntu installs with no configuration changes - I didn't compile anything manually.
Additionally it has also been noted that:
Well, that's where it could be headed anyway - a multiprocessor system with green threads (ie simulated threads, like Java ones) implementing massive concurency and redundancy. Some prototypes for systems like this are already available, and being used. Cheers.
Re:Real life examples (Score:2)
Re:Real life examples (Score:2)
I don't understand this. These results are HORRIBLE. 5000 requests in 8 seconds? 1 request in .8 seconds? I think your decimal point is off.
Re:Real life examples (Score:1)
This is not a beefed up server - it's a very old computer, and is used only for development. Remember that there are 500 simultaneous connections, with the test itself also running on the same machine, and I'm also running a shitload of other stuff (X, mozilla-firefox - heaps of tabs open, Lisp (this one shows hundreds of megs of memory usage, but I've read that for this kind of applications, top is a deceptive measure), emacs etc).
The benchmarks I've shown are from running "ab". I was going to show all p
Threading --- hype, more hype and extra hyped hype (Score:5, Insightful)
Threads are useful, that's granted - but it would seem a lot of people are trying to convert wholesale over to this threading model just for the hell of it, running along with the apparent reasoning that threading is "lighter" than processes. Maybe threads are lighter/cheaper on Windows systems - but a Unix system with copy-on-demand paging forking/process system is _DESIGNED_ to handle processes. Right now a lot of the time threads are a hack. Unix and processes work nicely together.
As for "maximising" available resources, well don't forget there's typically another couple of dozen processes running on any give Unix setup, more so on a multi-user multi-purpose machine (let's say WWW, email and DNS setup - throw in SpamAssassin for lots of fun) there's no shortage of available processes to use up a CPU. On a monolithic system where it's running only one process, sure, threads become useful there to spread the load.
My gripe basically boils down to a lot of people going along and choosing to use threads rather than forking because they think that it's "cool" or (supposedly) "lighter" - not because they've done any real world testing/checking. Remember, Unix was built around the idea of many small processes/programs working together, so that'd tend to naturally allow usage of multiple CPUs without any exotic hacks.
Re:Threading --- hype, more hype and extra hyped h (Score:3, Informative)
Re:Threading --- hype, more hype and extra hyped h (Score:2)
Which reminds me. Why the hell does the new Yahoo! Messenger use 25 threads? There's no good reason for that.
non-blocking (Score:5, Informative)
however, it's easier conceptually to write a threaded server, it's more natural to write, and you just launch a single thread per connection. unfortunately, currently, this doesn't scale (see Why Events Are A Bad Idea (for High-concurrency Servers) http://www.usenix.org/events/hotos03/tech/vonbehre n.html [usenix.org] for an argument that thread implementations, and not their design, are the issue).
the former method can handle thousands of simultaneous connections with high throughput, even on a decent workstation; the latter cannot. threads simply have an inherent overhead that cannot be eliminated.
i've actually been working on writing a non-portable insanely fast httpd in my spare time (svn co svn://parseerror.dyndns.org/web/) over the past few weeks as a way to explore non-blocking I/O + epoll() and it performs very well (~600% faster conns/sec than a traditional fork()ing server (which i wrote first)).
for further discussion see The C10K Problem http://www.kegel.com/c10k.html [kegel.com] which goes in-depth on these very subjects
Re:non-blocking (Score:2)
You'd have to write such a system carefully, but I think if it was done right you'd get b
SEDA: Mixing events and threads (Score:1)
For an interesting hybrid approach between threads and events, check out SEDA - Architecture for Highly-Concurrent Server Applications [harvard.edu]. Basically, you write a server as a collection of stages connected by event queues. A stage receives an event on an incoming queue, does some processing on it, and then places it on an queue to some other stage. This mirrors the way an event-driven system is designed. Each stage has its own thread pool to handle events. All IO is asynchronous and is treated like any oth
The state of the art is hybrid of the two (Score:2, Insightful)
So, why not combine the multi-threaded and event-driven models? Some very interesting research has been done in this area. Check out Staged Event-Driven Architecture, or SEDA [harvard.edu]. Like threads, it has high fairness in sch
This hits home. What I did. What should I do? (Score:4, Interesting)
This debate hits home with me. I wrote a server daemon to handle the SMTP and POP protocols, and when I first started out I had to make a choice. The choice I made back then was to use a threaded model. The way it works is I spawn X threads which collectively use blocking calls to the accept() function. Each thread will only return from accept() once they have been assigned a new connection by the kernel. For performance I spawn the threads ahead of time. This architecture was a mistake. The issue is that I have to spawn a seperate pool of threads to listen on port 25 (SMTP), port 110 (POP), port 465 (SMTP over SSL) and port 995 (POP over SSL). With this model if I could end up with extra threads listening on port 25, when I need more threads listening and processing connections on port 465. This problems leads me to overcompensate by spawning _extra_ threads just in case. Of course this strategy wastes resources as now extra threads eat memory without benefit.
To address the SEGFAULT issue, ie one rouge thread taking the whole system down, I also fork multiple processes. In my case I fork 12 processes with 128 threads each. If one process gets killed by a SEGFAULT, the remaining processes continue to work. When I first launched the system, and it faced a torrent of email... 100K+ messages a day, I would have about one process die every 24 hours. With careful debugging work, I've gotten the code stable enough now that I haven't lost a process in about 9 months.
My theory when I first wrote this code was to leave scheduling to the kernel. I figured that if a thread was blocked waiting for IO data the kernel wouldn't schedule time slices for it. This meant those extra threads sat in the background waiting, but not using CPU time. I am starting to wonder whether this is a good theory? I am considering switching to a different model (more on that in a second), but am not sure which one is best? By the way, the reason each process has so many threads is for DB connection pooling. Each process gets 8 DB connections which are shared between the 128 threads. Each process also has its own copy of the antivirus database. I know its possible, but trying to share DB connections and data between processes is much more difficult.
I plan to refactor this code soon, and have been struggling with what to do. I am curious to hear the thoughts of others?
The current plan is to move to a model where I spawn a single thread for each port. When these listening threads have a new connection, they dump the socket handle, and the the protocol into a buffer. I would then also spawn a pool of worker threads which read the incoming connections out of the buffer. Using semaphores and reflection these worker threads would pickup incoming connections and feed them to the right function depending on the protocol. I think this model would work much better than what I have now, but is this the best option?
The other option is to create system where I spawn only 8 worker threads (or some similar number). This pool of 8 threads then uses epoll() to find out which sockets need attention and address them accordingly. The problem with this model is that if an incomplete message is receieved, the thread couldn't process all the way into the output stage. Instead the data would need to be stored until the message sending was complete. Let me give an example, the thread might get "RCPT TO: " the first time it checked a socket. The thread stores this incomplete message. Then the second time around another thread picks up "example@example.com". The thread assembles the message into "RCPT TO: example@example.com" and then processes the entire command accordingly.
Does this model work better? Keep in mind that when DB calls need to be made, the MySQL library won't work the same way. A slow database server could hang all 8 worker threads effectively killing the model. There are also SPF, SPAM and Virus libraries. Any one of them could tie up a thread for an extended period, thereby killing this model. What does everyone else think? Am I not thinking about an event processing model correctly? Or is that this type of daemon is better off served using the one thread per connection model?
Re:This hits home. What I did. What should I do? (Score:1)
This is important because it means I use the 2.6 version of the Linux kernel, and the CentOS/RHEL version of the kernel uses the Native Posix Thread Library (NPTL) by default.
Re:This hits home. What I did. What should I do? (Score:4, Informative)
Each condition variable has a mutex associated with it. The first thing you do, is lock the mutex. If someone else has the mutex locked, then this blocks. You then (atomically) release the mutex and wait on the condition variable. The other thread then acquires the mutex, signals the condition variable, and releases the mutex. The first thread then wakes up with the mutex.
Really though, you should be using an asynchronous model if you want to be able to reason about your code. Difficulty in debugging scales linearly with the number of asynchronous threads/processes and exponentially with a synchronous approach.
To be honest, if you need concurrency, you would be better off using a language like Erlang which is designed for concurrency, rather than trying to shoehorn it into a hacked version of PDP-11 assembler.
Re:This hits home. What I did. What should I do? (Score:2)
It is also good to avoid having a thread per request (practically if y
Event devices (Score:2, Insightful)
Seriously though, I'm not sure what methods the WIN32 API has, under the hood, of its event waiting calls,
Re:Event devices (Score:1)
Re:Event devices (Score:2)
It depends on software and hardware setup. (Score:3, Informative)
If your software is destined to serve few clients and your hardware has more than one CPU core, and then your services are not I/O bound, then performance would increase if your O/S can dispatch threads to different cores. If your services are I/O bound, then increased performance from threading depends on O/S and hardware architecture (in other words, how fast I/O can be multiplexed).
If your software is destined to serve thousands of clients, you need clustering: a few thousand machines that can process requests plus a few others to dispatch requests to the cluster. I actually have no idea how this is done though, so take this advice lightly. In this case, your software is going to be multiprocess/multithreaded anyway.
Whatever fits the reqs (Score:3, Interesting)
My only complaint with POSIX threads is they do not have a "generic" join function that grabs *any* threads that have exited.
Having just written a multithreaded server... (Score:5, Informative)
We worked in C, because we needed guaranteed low latencies.
In 2004 we decided to rebuild these frameworks to handle OS multithreading. The reason was that on a single CPU we could not get the performance we needed, and the choice was either to use clusters, or multithreading.
We continued to work in C. C, and C++ are really nasty for multithreading because the languages have zero support for concurrency. You need to handle everything yourself, and most threading errors are extremely hard to detect.
It cost us about 10 times more to write our software as multithreaded code than using virtualised threads and we had to build whole reference management frameworks to ensure that threads could share data safely.
We did keep virtual threading, in fact, but virtual threads get handled by a pool of OS threads. Using 1 OS thread per connection is not scalable beyond a few hundred threads. Modern Linux kernels handle lots of threads but we also target Solaris, and Windows with the same code. So we use two virtual threads per connection, for full-duplex traffic, and we design most of the major server components as threaded objects, which are asynchronous event-driven objects.
Doing multithreading in C is a *huge* work. C++ has frameworks like ACE that help a lot.
But there is a performance gain. Our software is a messaging server (implementing the AMQP draft standard). We maxed out at around 55,000 messages per second using a pure virtual-threaded model. Very efficient code. On a single CPU the multithreaded code hits 35,000 messages per second. With two CPUs we're back at 55k, and with 4 dual-core Opterons we're at 120k-150k and higher. (Our software runs a massive trading application that processes 1.5bn messages per day). We still need to improve some of the low-level locking functions to use lock-free mechanisms, and we max out a gigabit network. It is difficult to find machines powerful enough to really stress test the software.
Without very robust frameworks, I'd never attempt such a project. As it was, we paid a lot for the extra performance. Our frameworks will eventually be released as free software, along with the middleware server.
Interestingly, a very similar application written in Java 1.5 and using the BEA runtime gets similar performance to ours. Java's threading is so good that I'd be hesitant to chose C on the basis of performance again. I'm not sure whether ACE can reach the levels of performance we need; 100k messages per second is extreme.
Other questions that are very important to ask:
- The number of clients you expect to connect at once. If it's less than 500 you can probably use one or two OS threads per connection. If it's more you need to virtualise connections or share your OS threads.
- The footprint. If you don't care, then I'd advise using Java. If you want a native Linux service, consider C++ and ACE. If you really want to write multithreaded C code, and don't have a full toolkit, consider seeing a doctor.
When it comes to the future, clearly multiple cores are the way we're heading. This was clear two years ago, and was the main reason we bit the bullet and chose to write our software multithreaded rather than using a clustering model. It seemed clear to me that within a decade, systems would have 32, 64, 128 cores, and software that could take advantage of this would survive for longer. Clustering is not as powerful an abstraction as multithreading.
threads are not nice to the CPU (Score:2)
Programmers (Score:3, Insightful)
As far as I'm concerned threading should only be considered when you have either:
1) a high tolerance for extremely difficult debugging
2) a customer who likes it (pays you more) when your program crashes
3) a team of programmers experienced with threading (at least 75%...if not 100% you *must* review code changes)
In modern Unix systems most people won't be able to really tell the speed difference between create_thread and fork. If you screw up pushing data through a pipe it just doesn't come out the other end, but if you screw up using shared memory you won't know until something unexpected (and seemingly unrelated) happens.
If you can build it that way... (Score:3, Insightful)
If you then want to take advantage of multiple CPU's efficiently, you need to look at the event handler code in the above implementation, and see if you have race conditions if two branches are run in parallel. If not, you can just fork a couple of processes and/or threads to tag-team the same event stream, and the code just cruises and uses more CPUs.
If there are race conditions between branches, you take those branches that need to share a resource, make a single separate thread/process for those handlers, and forward the required events to that thread/process so they get handled sequentially.
This style of code is often harder to design than a more obvious multiple-threads type implementation, but it is faster (and often easier to maintain) when properly done. In either case, the source of obscure bugs is race conditions that are overlooked.
Don't pick any - pick them all with Flux (Score:2)