The greatest challenges lie in accommodating arbitrary control flow among threads within a cooperative thread array. NVIDIA GPUs are SIMD multiprocessors, but they include a thread activity stack that enables serialization of threads when they reach diverging branches. Without hardware support, this kind of thing becomes difficult on SIMD processors which is why Ocelot doesn't include support for SSE yet. It is also one of the obstacles for supporting AMD/ATI IL at the moment, though solutions are in order.
Translation from PTX to LLVM to multicore x86 does not necessarily throw away information concerning the PTX thread hierarchy initially. The first step is to express a PTX kernel using LLVM instructions and intrinsic function calls. This phase is [theoretically] invertible and no information concerning correctness or parallelism is lost.
To get to multicore from here, a second phase of transformations insert loops around blocks of code within the kernel to implement fine-grain multithreading. This is the part that isn't necessarily invertible or easy to translate back to GPU architectures and is what is referenced in the note you are citing.
Disclosure: I'm one of the core contributors to the Ocelot project.