I honestly thought that we'd got away from this 500x nonsense a few years ago. I would suggest that AMD is one source for the information that 2-3 is more reasonable. AMD, Qualcomm, Khronos, any of the members of the OpenCL committee you talk to as well as NVIDIA insiders if you catch them at a conference. I gave multiple public talks countering any factors over about 10 when I worked at AMD, which were approved by management.
Just think the raw numbers through. The GPU has, say, 32 cores. The CPU ALSO has multiple cores. Don't count them, then you're cheating. So let's say we have 8 CPU cores there. Each CPU core has two SSE units or one AVX unit, to be conservative. That core is doing 8 ALU ops per cycle per core. So you have 64 ops per cycle. The clock rate is 3x the GPU so let's call it 196 ops per GPU cycle. The GPU had 32 cores. Each GPU core can do 64 ops/cycle (fair number for GCN). So you have 2048 ops/cycle on the GPU. 2048/196 is roughly 10. That's your peak - now you add in divergence costs on the wide GPU SIMD units (which statistically will hit you much earlier than with the CPU's narrow SIMD units), count the tiny GPU caches leading to more cache misses than the CPU and you can see why that factor of 10 invariably drops to 2 or 3x.
More honestly you're looking at a factor of 10 or so for ALU throughput, and 10 or so for memory throughput - and those are not multiplicative. In real use cases 2-3 is about right when comparing against well-optimised CPU code.
If there is a 500x speedup appearing with Libreoffice here, and the likelihood is that that is somewhat cherrypicked anyway, then what we are seeing is the difference between someone optimizing code and someone else not doing so. There is every reason to think the original code was only lightly optimized, not parallel, not vectorized or some set of the above.