This is incorrect. CUDA cores are at a higher level than ALUs or FPUs; they're like small, simple cpu cores. They can do integer and floating point arithmetic, and they have hardware support for thread context switching, which they can generally do in a single clock tick. There can be varying numbers of CUDA cores in a streaming multiprocessor, but CUDA thread blocks are arranged in groups of 32 ("warps") which share a scheduling unit and which execute the same instruction in lock-step on different memory addresses. When threads running on adjacent CUDA cores read and write adjacent memory addresses, memory access is very efficient ("coalescing").
CUDA cores aren't as capable or powerful as CPU cores; they don't have things like branch prediction or preemptive execution, but they are cores none the less. They achieve high performance via sheer numbers - thousands of cores on top-end GPUs - and they're very good at streaming, which consists of doing the same operation in parallel on many array elements when each operation is independent of all the others.