Comment Re:How can that be? (Score 1) 122
Well, I suppose either my application (face recognition for hundreds of users) is under the threshold for your definition of HPC, or it's a notable exception. Our algorithm consists primarily of repeated BLAS level 1 and 2 operations on chunks of data that fit in CPU cache, but not GPU cache. Essentially, it's low arithmetic intensity operations performed repeatedly (hundreds of times) on gallery image sets that take up a couple megs at a time (and there are a couple hundred of those that can be computed in parallel). Under these conditions, we find that a dual socket quad-core Xeon machine is roughly comparable to a high-end Fermi. There is locality in our memory access pattern that the CPU has enough cache to exploit, but the GPU does not. I'm not a chip designer, so I can't ~really compare the opportunity cost (in $/flop) of adding more cache vs. widening a memory bus of a GPU, but I suspect the cache is cheaper, especially given that it need not be shared (again, for our application).