The problem is that hyperthreading CPUs and x64-64 EPIC are predicated around floating point performance. The idea is that if you're FPU bound, then you want to minimize RAM latency by flipping between threads while you have FPU-load stalls.. You add speculative execution, predicate registers, pipeline execution stacks to minimize branch-misses, etc. But it's all about FPU with 200 clock execution times (e.g. divides and transcendental ops - as with FFT).
But I'm sorry, no matter how fast you make their FPUs, they're not going to beat FPGA or ASIC or raw-silicone GPU's. These bastards optimize memory paths and reduce critical path latencies.. The only advantage CPUs have over GPUs is that you can context switch unrelated tasks better than with GPUs.
A vast majority of apps in the world are NOT FPU based. They are pure integer. And moreover, these days, they are RAM constrained.. If your're writing a NoSQL DB procedure to perform zlib or merge-sorting or state-machine syntax parsing, FPU oriented architectures are of ZERO benefit. This is all RAM -> branch-prediction related. That is, read-data, make a decision, jump to new code (which triggers new RAM loads) run two or three instructions, then repeat. While SOME of the app state-tables and code-paths can get cached efficiently, the input stream is generally far larger than your L3 cache (on the order of gigabytes).
So, SOME of the memory pre-loading, branch-prediction, and on-stalled-thread-ctx-switch could be leveraged.. But MT apps suffer from barriers in critical regions.. Namely if you memory stall while holding a lock, you cripple the parallel performance.
Co-processes are very efficient (e.g. apache pre-fork, postgres co-threads with specific shared-mem-segments, erlang, ruby-unicorn, etc) in that they organize very small messages to pass between processes and keep all remaining cache-lines isolated to their single thread and thus semi-dedicated CPUs. This can very nicely leverage co-processors without necessarily saturating RAM -though if the apps themselves are RAM-bound you still have problems; BUT if you have NUMA, the CPU can segment memory spaces better with co-processes than with MT. That being said, the SUN light-weight-threads are (I believe) designed around shared memory-spaces having minimal context-switching time v.s. posix-threads or normal co-processes, so they can't really take advantage of co-processes as well as MT.. So SUN light-threads are forced to endure potentially bad programming by DB, file-IO, OS, signal-processing applications.. Namely if you can't create isolated memory regions (malloc/free-locks, IO/pipe-locks, concurrent-data-structure 'critical-region' locks, etc), you'll find yourself dirtying shared cache-lines so often, you actually find yourself running slower than if you were just single-threaded.
I know, for example, a simple merge-sort can run significantly slower (3x) when run in parallel v.s. single-threaded predominantly because of intel's MESI implementation. Well, not necessarily 3x slower human-time wise, but in consumed CPU time with little or no visible decrease in human-response-time.
As another example, mysql INNODB had a inverse performance curve for the longest time.. Meaning, the more physical CPUs you added, the SLOWER it's total throughput would be.. Predominantly due to excessive critical-region locks. Many of those locks have been replaced with less-accurate atomic spin-locks (as with sequence-counters). Namely you can now 'lose' a primary key's sequential value under the right circumstances - but at the benifit of removing a major classic stall-point. But INNODB is still full of complex algorithms that require critical-regions. Lock-free-code is really hard and is very limiting. But that isn't to say people haven't figured out how to architect good designs. 'redis' NoSQL and erlang based apps (like rabbit-MQ) are good examples.. Namely copy-on-write small data-structures.
But there are two types of apps that have lots of parallel threads. Those with MASSIVE memory requirements and those that