This may be true for Java. It isn't true for C/C++.
With C/C++ and NPTL, the many-thread blocking IO style yields slightly lower latency at low IO rates, but offers significant latency variability and sharply decreased thruput at higher IO rates. It seems that the linux scheduler is much to blame for this-- the number of times that a thread is scheduled on a different CPU increases dramatically with more threads, and this trashes the caches. I've seen order-of-magnitude decreases in performance and order-of-magnitude increases in latency as a result of what appears to be the cache trashing.
I've seen similar arguments against using hyper-threading with Java. Curious, have you tried limiting the number of threads to prevent the thrashing or a pbind equivalent to keep a thread closer to a cache pipeline?
Duct tape is like the force. It has a light side, and a dark side, and it holds the universe together ... -- Carl Zwanzig