When it comes to being better than a GPU for applications, you have to remember GPUs have abysmal memory bandwidth (due to being limited by PCIe's 16GB/s to the CPU) after you run out of data in the relatively small memory on the GPU. Due to this, GPUs are really only good at level 3 BLAS applications (matrix-matrix based... basically things you would imagine GPUs were designed for, which are related to image/video processing).
This is only true if your problem does not fit in the VRAM (which is getting over 10GB nowadays). If it does, you'll be 8x to 12x faster than any brand new CPU for any element-wise operation. Also, it is much more common to find an easy way to cut the problem nicely than not.
That being said, do you know with how much embedded RAM will you be proposing your architecture (even a rough projection)?