For data processing workloads, a frequent problem with GPU acceleration is that the working dataset size is too large to fit into the available GPU memory and the whole thing slows to a crawl on data ingest (physical disk seeks, random much of the time) or disk writes for persisting the results.
For folks serious about getting good ROI on their GPU hardware in real world scenarios, I strongly recommend you take a look at the fusion IO PCIe flash cards, which now support writing to and reading from them directly from CUDA via DMA, with little to no CPU handling required. (See: http://developer.download.nvidia.com/GTC/PDF/GTC2012/PresentationPDF/S0619-GTC2012-Flash-Memory-Throttle.pdf).
I can't talk about what we do with it, but lets just say the following hardware combination has lead to interesting results;
i) 16x PCIe slot chassis: http://www.onestopsystems.com/expansion_platforms_3U.php
ii) 8x Nvidia Kepler K20x's
iii) 8x Fusion IO 2.4TB IoDrive 2 Duo's
We have been able sustain over 4 million data operations a second, each one processing ~16 K of data in a recoverable, transactionally consistent manner, totaling up to around 50 Gigabytes of data processed per second. All in a 5U deployment drawing less than 4 kilowatts.