What I think many people are missing is... well Seymore Cray.
I make this point because Cray started his super computing journey by building highly advanced machines connected to a single centralized memory system.
As things stand right now, probably the biggest problem we're facing in AI is cache coherence. The bigger the machines we build, the bigger this problem is. Currently, I'm trying to troubleshoot a fairly small HPC, about a thousand cores. As the system exists in its current state, the more cores to a single node I add, the slower the machine gets. This is because the cost of sharing memory between the cores is just too high. HBM2 and HBM3 don't help at all because it's an operating design issue. Thrashing the memory which is what AI does means the number of CPU spinlocks increase. Historically, the cheapest form of shared memory has been atomic variables. They exist in a single page and are always cache coherent. Right now, every access of such a variable is taking a very long time because the kernels are initiating spinlocks to wait for coherence. As such, 128 core or larger processors are generally a lot slower than much smaller processors with duplicated memory in readonly regions.
We need to see progress made in high performance multi-ported memory systems. I think that specialized data lines, for example entirely separate LVDS pairs for reading and writing synchronized (coherent) memory regions could be helpful. Any writes to memory in specific regions would be multicast across a full mesh and spinlocks on reads could be local to the core. As part of the multicast LVDS mesh, there can be a "dirty" status line where a centralized broker would identify writes to a region (as any mmu would) and with minimal propagation delay, raise a dirty flag at the speed of electricity to all subscribers of notification to said region.
Honestly, Cray would probably come up with something much more useful. But, with optimizations like these, performance can improve drastically enough that substantially fewer cores could achieve the same tasks.
From what I've been looking at, GPU is trash for AI. I have racks full of NVidia and AMD's best systems, in a few cases, I have access to several of the computers ranked in the top 10 on the Top500 list. And the obscenely wasted cycles and transistors in general for AI processing is unforgiveable. a chip specifically designed to run transformers should hold at least 100 times the capacity of a single GPU. Then there's the additional fact that by optimizing the data path for AI in combination with smarter cache coherency, we could fit maybe thousands of times more capacity into a single chip.
Strangely, right now, I think the two most interesting players in the market are GraphCore and Huawei. They both have substantially smarter solutions to this problem than either NVidia or AMD.