Comment Re:Hardware architecture not software (Score 2) 167
I used to have a printed sheet of paper on my wall at work; allegedly it originally came from Google, but who knows? And with the age of it, it was probably referring to processors from a decade or more ago, but I'm not sure that it's entirely irrelevant even at this point.
Anyway, it listed memory types by their "distance" in CPU Clock cycles from the CPU. Level-1 cache took about 3 CPU cycles to access; L-2 cache about 10, L-3 cache about 30, and Main memory about 100 CPU clock cycles to access. This backs up your contention that there is a bottleneck between the CPU and memory that we would benefit from breaking.
On the other hand, I've worked on custom single-chip MCUs. Normally "main memory" in such a system is a block on on-die fast FLASH memory. For this MCU, however, "main memory" was an off-die serial SPI Flash. It would seem that fetching instructions from a slow serial bus (one....bit...every...50 MHz....bus....clock) would drastically slow operations; but adding a small cache memory gave us 80% of the performance we'd get running from zero-wait state internal memory. I guess that's a long-winded way of saying that for most computing problems, the extensive cache architectures implemented in modern high-performance CPU's very nicely address the CPU-Memory bottleneck that you're concerned about. Sure, there are some problems that this doesn't solve, and that could benefit from alternative architectures, but that isn't our mainstream issue.