The entire reason why CUDA works and is powerful is exactly because it is limited. Nvidia knows that there is no silver bullet. They're not claiming that this is one (David Kirk has said so himself at conferences). CUDA is a fairly elegant way of mapping embarrassingly data parallel programs to a large array of single precision FP units. If your problem fits into the model, the performance you get via CUDA will smoke just about anything else (except maybe an FPGA in some scenarios).
Your notion about particular models making some parts of parallel programming easy while other parts are hard is what people really need to learn to accept about parallel programming. If you're expecting a single model to make everything easy for you, trust me, stop programming right now.
You need to pick the programming model that matches the parallelism in your application- there will never be one solution. When sitting down to write code, you have to ask yourself: what is the right model for this algorithm? Is it:
Data parallel (SIMD, Vector)
Streaming (pipe and filter)
There are many models out there, and many languages + hardware substrates for these models that will give you orders of magnitude speedup for parallel programs. They key is to just to sit down, think about the problem, and pick the right one (or combinations).
The real research focus in parallel programming should be to make a taxonomy of models and start coming up with a unified infrastructure to support intelligent selection of models, mixing and matching, and compilation.