So there seem to be several questions as to why people would want to use CUDA when an open standard exists for the same thing (OpenCL).
Well, honestly, the reason why I wrote this was because when I started, OpenCL did not exist.
I have heard the following reasons why some people prefer CUDA over OpenCL:
Additionally I would like to see a programming model like CUDA or OpenCL replace the most widespread models in industry (threads, openmp, mpi, etc...). CUDA and OpenCL are each examples of Bulk Synchronous Parallel models, which explicitly are designed with the idea that communication latency and core count will increase over time. Although I think that it is a long shot, I would like to see more applications written in these languages so there is a migration path for developers who do not want to write specialized applications for GPUs, but can instead write an application for a CPU that can take advantage of future CPUs with multiple cores, or GPUs with a large degree of fine-grained parallelism.
Most of the codebase for Ocelot could be re-used for OpenCL. The intermediate representation for each language is very similar, with the main differences being in the runtime.
Please try to tear down these arguments, it really does help.
The greatest challenges lie in accommodating arbitrary control flow among threads within a cooperative thread array. NVIDIA GPUs are SIMD multiprocessors, but they include a thread activity stack that enables serialization of threads when they reach diverging branches. Without hardware support, this kind of thing becomes difficult on SIMD processors which is why Ocelot doesn't include support for SSE yet. It is also one of the obstacles for supporting AMD/ATI IL at the moment, though solutions are in order.
Translation from PTX to LLVM to multicore x86 does not necessarily throw away information concerning the PTX thread hierarchy initially. The first step is to express a PTX kernel using LLVM instructions and intrinsic function calls. This phase is [theoretically] invertible and no information concerning correctness or parallelism is lost.
To get to multicore from here, a second phase of transformations insert loops around blocks of code within the kernel to implement fine-grain multithreading. This is the part that isn't necessarily invertible or easy to translate back to GPU architectures and is what is referenced in the note you are citing.
Disclosure: I'm one of the core contributors to the Ocelot project.
Nothing big today - just an RFID Terminator Gun. It basically fries any RFID chip in range. Not sure what good it is, unless you want to play a trick on your friends and family by frying their passports. Big fun.
"my terminal is a lethal teaspoon." -- Patricia O Tuama