Software isn't the bottleneck. Caches are *tiny* compared to the size of even single functions in modern programs, which means they get flooded repeatedly, which in turn means that you're pulling from main memory a lot more than you'd like.
Wrong.
The code size of average function is much smaller than instruction cache for any modern processor.
And then there are L2 and L3 caches.
Instruction fetch needing to go to main memory is quite rare.
And then about data.. depends totally on what the program does.
Multi-core CPUs aren't (as a rule) fully independent - they share caches and share I/O lines, which in turn means that the effective capacity is slashed as a function of the number of active cores. Cheaper ones even share(d) the FPU, which was stupid.
None one of the CPU's sharing FPU with multiple HW threads are cheap.
Sun Niagara I had slow shared FPU, but the chip was not cheap
AMD Bulldozer, which usually has sucky performance, sucks less on code which uses the shared FPU.
FPU operations just have long latencies and there are always lots of data dependencies, so in practice you cannot
utilize FPU well from one threads, you need to feed instructions from multiple treads.
Intel uses HyperThreading for this, AMD Bulldozer it's CMT/shared FPU/module.
GPU's are barrel processors for the same reason.
The bottleneck problem is typically solved by increasing the size of the on-chip caches OR by adding an external cache between main memory and the CPU.
Much more often the bottleneck is between the levels of the chip's caches.
The big outer level caches are slow and processors spend quite often small time waiting for data coming from them. And if you increase the size of the last level caches, you make them even slower.
One of the reason's for bulldozer's sucky performance is because it has small L1 caches(so it needs to fetch data deom L2 cache often), but big and slow L2 cache. So there is this relatively long L2 latency happening quite often.
External cache.. has not been been used for about 10 years by Intel or AMD. It's either slow or expensive, and usually both. Now when even internal caches can easily be made with sizes over 10 megabytes, the external cache has to be very expensive in order to compete with internal caches, and still it only makes sense on some server workloads.
After that, it depends on whether the bottleneck is caused by bus contention or by slow RAM. Bus contention would require memory to be banked with each bank on an independent local bus. Slow RAM would require either faster RAM or smarter (PIM) RAM. (Smart RAM is RAM that is capable of performing very common operations internally without requiring the CPU. It's unpopular with manufacturers because they like cheap interchangeable parts and smart RAM is neither cheap nor interchangeable.)
Smart RAM is a dream, and a research topic in universities. It's uncommon because it does not (yet) exist.
And most of the problems/algorithms are not solveable by "simple" smart ram that can only operation on data near each others. And it you try to make it even smarter, then you end up making it costlier and slower, it will become just chip with multicore processor and memory on same chip.
There are some computational tasks where smart ram would improve the performance by great magnitude, but for the >90% of all the other problems, it has quite little use.