Really modern GPUs (Fermi) can do OK on branches if the branches occur at a warp level or stay to a small number (less than 16) or stop--partially because of support for concurrent kernels. (Check out http://psilambda.com/ if you think that concurrent kernels can not be used.) You are incorrect or imprecise to state that there is a problem if it splits in half--if a warp splits in half, then there is less performance but if half of the warps split at a branch then there is not a performance loss (i.e. it splits at the warp level). There is no "penalty" in these cases.
It is wrong to say: "a relational database, they fall over flat. A normal CPU creams them performance wise.". It depends on what you are talking about. As one example, for SQL window functions over partitions that involve any calculations (not just lag and lead--thank you), the GPU can cream the CPU. Check out: http://psilambda.com/2010/07/kappa-quick-start-guide-for-windows/ for some throughput numbers. The relational database is more the bottleneck--not the GPU. Other examples for relational databases that could benefit from the parallel processing of the GPU might be calculating the optimal query plan with large numbers of joins, sorting, merging keys (joins), etc. (if the relational database is written to process blocks of tuples instead of a tuple at a time--a relational database written to process blocks of tuples would also be more efficient on the CPU because of call overhead and cache).
If you wish for your computations to be parallel at a level higher than algorithm steps (i.e. you can build libraries upon libraries that are efficient parallel computation throughout the layers of libraries), then neither the CUDA driver or the CUDA runtime API (or OpenCL or DirectCompute) are very good. An example of this for CUDA is that even usage of the Fermi concurrent kernel execution feature is not generally possible using all (or even very many) CUDA kernels in a program by just using the CUDA APIs.
MPI (message passing interface) gives parallel computation at the clustering level and the Kappa Library gives you this at the library component level. If somebody knows about something other than MPI or Kappa that does this and is available for general use, I would be interested to hear about it.
Indeed. With Cuda, DirectCompute, and OpenCL, nearly 100% of your code is boilerplate interfacing to the API. There needs to be a language where this stuff is a first-class citizen and not just something provided by an API.
If you use CUDA, OpenCL or DirectComputeX it is--try the Kappa library--it has its own scheduling language that make this much easier. The next version that is about to come out goes much further yet.