Gosh, stupid html tags ate most of my posting. Anyway here it is.
I don't understand why people still don't understand the difference between latency and bandwidth, and the fact that a huge amount of the desktop IO load is still less than 4k with a queue depth of basically 1.
If you look at many of the benchmarks you will notice that the .5-4k IO performance is pretty similar for all of these devices and that is with deep queues. Why is that? Because the queue depth and latency to complete a single command dictate the bandwidth. So you either need deeper queues or lower latency to go faster at those block sizes.
So the latency on PCIe is not that much better, but the queue depth can be much deeper than what is possible with a normal AHCI controller. This helps a lot with benchmarks, but not so much for a single user.
Anyway, boot times, and general single user performance is bottle necked mostly by latency. Especially when the throughput of larger transfers is greater than a few hundred MB/sec. So, the pieces large enough to take advantage of the higher bandwidth is a smaller (and growing smaller) portion of the pie.
Next time you start your favorite game look at the CPU/DISK IO. Its likely the game never gets anywhere close to the max IO performance of your disk, and if it does its only for a short period.
Anyway, its like multicore, beyond a fairly low core count most desktop type operations are better off with faster CPU's rather than more of them.
And just like desktop benchmarks, the guys running benchmarks seem lothe to heavily weigh single thread operations, or queue depth 1 1k IO loads in the overall performance picture even though its a large portion of actual system performance running everyday tasks.