On a 8-core machine, a processor will be placed into a wait queue roughly 7 out of 8 times that it needs access.
You just snuck into your analysis the assumption that every core is memory saturated, and I don't think that all the memory path aggregates in many designs until the L3 cache (usually L1 not shared, L2 perhaps shared a bit). The real bottleneck ends up being the cache coherency protocol, and cache snoop events, handled on a separate bus, which might even have concurrent request channels.
I think in Intel's Xeon E5 line-up there are single-ring and bridged double-ring SKUs for forwarding dirty cache lines from one cache to another (and perhaps all memory requests). This resource can also drown for many workloads.
In many systems, you have all these cores running tasks which are fairly well isolated (not much cache conflict), except they all want to be able to allocate as much memory as they need from a giant memory space (e.g. a TB of DRAM) so they fundamentally have to fall through to a shared memory allocation framework.
You can learn a lot about the challenges involved by following the winding path of something like jemalloc as increasing concurrency levels expose yet another degeneracy.
The real problem with this field is that there isn't a single, simple story like the one you tried to tell. There are usually dozens of ways to skin the cat, each with completely different scaling stories, with different sets of engineers who are good as tweaking or debugging those stories.
At this point, what you have is a fragile coordination problem between your solution space, your architecture, and the engineers you employ, forcing ambitious ventures to crack out the golden recipe: pour in seven cement mixers full of head hunters, one 55-gallon oil drum of exclamation marks, a metric butter tonne of job perks, and agitate appropriately.