I am one such programmer. Yet I also coded for an Nvidia Tesla C1060 board and found it much more straightforward to handle several thousand threads at once.
Not all types of threads are created equal. I usually explain CUDA to people as the "Zerg Rush" model of computing - instead of a couple, well-behaved, intelligent threads that try to be polite to each other and clean up their own messes, you throw a horde of a thousand little vicious, stupid threads at the problem all at once, and rely on some overlord to keep them in line.
Most of the guides explained it as, "Flops are free, bandwidth is expensive." This board had a 384 or 512-bit wide memory bus with a very high latency, and the reason you throw that many threads at it is to let the hardware cover up the latency - it can merge a huge number of memory reads/writes into one operation, and as soon as a thread is waiting on memory I/O it can swap another thread into that same SP and let it compute. If memory serves me, the board was divided into blocks of 8 scalar processors (each block had some scratchpad memory that could be accessed almost as fast as a register) and you wrote groups of 16 threads which ran in lock-step on that processor (no recursion was allowed, and if one branched, the others would just wait around until it reached the same point) in two rounds.
Sure, that's a bit complex to optimize for, but it beats the hell out of conventional threading while trying to optimize for x86 SIMD. And if you manage to write it so it runs well on CUDA, it generally will scale effortlessly to whatever card you throw it at.
It's looking like OpenCL won't be much different, but I have yet to try it. I'm kind of eager, since apparently AMD/ATI's current cards, for the money, have a bit more raw power than Nvidia's.