I think part of the problem is that the axes aren't linear. If you know the problem you're trying to tackle a priori you can tackle it with multiple magnitudes of greater efficiency. For a fully specified, unchanging problem, I'd expect 3 orders of magnitude or better in most spaces, because you'd build exactly the hardware you need, and strip away all the hardware that supports unneeded programmability—you build a hardwired ASIC. Even in the programmable space, spending a bit of effort matching your problem to your processor can bring huge gains in efficiency, at least 5x. Also, consider that efficiency isn't just run time, but rather a function of power, performance, and cost.
The algorithms that run on a hearing aid would sop the hearing aid's battery before they were even fully loaded if you tried to run them on a typical desktop processor. But, they're baked down to a hyperefficient DSP or ASIC that's tuned specifically for the problem.
You cite a SPEC benchmark that runs faster on an A7 than an A15. Is that in clocks or wall-clock time? I suspect it's dominated by pointer dereferences, such as a linked list traversal. Load-to-use latency (which isn't a function of cache organization, but rather pipeline depth) becomes a dominant term for those workloads.
Backing up a bit: My problem with your thesis is that you assume there's a "best GPP" and then seek to prove there's no one processor that could possibly be that on the basis that across random applications, the winner varies. Your argument seems to be, at the limit: "if you don't tell me your application ahead of time, I can't pick a best processor, so therefore there's no general purpose processors."
It's the other way around. There's a cluster of processors that are OK at a range of random tasks. They're distinguished from special-purpose processors by the fact that the special purpose processor performs at least 5x or more (and likely orders of magnitude in some cases) better than the average for the cluster. That's true even if some of the processors in that cluster are 2x more efficient than the others. A processor is a GPP if there's few or no problems for which it's orders of magnitude more efficient than its cohort. 2x is nothing to sneeze at, but a specialized processor should reach much higher. 5x at a minimum.
And please note I'm mentioning efficiency. It's not raw cycles or even wall clock time. Maybe a better measure is "energy per function", or "energy per function per dollar." (Although the latter is a bit dubious, as you buy the hardware once, but you use it many, many times. Lifetime costs are best approximated by energy costs over the lifetime of the device, if you're doing significant compute.)
You mention GPUs. Sure, GPUs provide cheap FLOPs, and they can even start to run arbitrary C programs. But, what %age of those FLOPs get utilized when running random programs? You might get a 4x speedup offloading some algorithm to your video card, but is that a win when your video card's raw compute power is 100x your host CPUs? Would you buy a Windows machine powered only by a GPU, running everything from your statistical regression to your web browser?
(I may exaggerate, but only slightly.)
To me, "general purpose" means, "I run the compiler, and for the most part, I get what I get. If there's some hotspots, maybe I can tune for this specific architecture. Most of the time, I don't worry." Specialized means "by selecting this processor for this task, I know up front I need to spend time optimizing the implementation of the task to this processor."
Perhaps the qualm is that really that's more a function of the application than the processor. OK. I can buy that. But, when you look across the space of processors that get deployed in that way, you'll see that most processors tend to end up one one side or the other of that line fairly often, and few are on the fence. You find very few DSPs and GPUs asked to run Linux or Windows kernels and applications (the core code, not the stuff they compile to be offloaded, say, in a shader language). You find some number of x86s asked to run signal processing applications, but only where they can afford the cooling.