Won't the DDR take "50 to 150 cycles" to service each request? Or is there some sort of pipelining going on, where the DDR can take a request every 10 cycles but have a whole bunch of queued requests in flight?
Actually, that's pretty much exactly how it works. If you have a bunch of independent requests to DDR—and by independent, I mean that the processor(s) do not stall waiting for the information from one request in order to make the next—then you can get multiple requests in flight and they can pipeline. Streaming works this way, for example. The STREAM benchmark is a textbook example of a benchmark dominated by throughput, where all the accesses are independent. For example, a[i] = b[i] + c[i] does not depend on a[i - K] = b[i - K] + c[i - K] or a[i + K] = b[i + K] + c[i + K] for any value of K in STREAM's "Add" loop. All four loops of the benchmark have that character. So as long as the processor can get enough work in-flight, it can get multiple cache misses outstanding to DDR. And if one processor and its caches have limited ability to 'execute ahead' like this, multiple processors (or multiple independent threads on the same processor) acting independently can fill in those gaps.
Linked list traversal results in a series of requests that are all dependent on each other. If all the requests miss the caches and must go out to DDR, then the CPU's performance is bounded by the round trip latency to DDR, not the DDR's throughput. Take a look at the linked list benchmarks in Ulrich Drpper's paper, "What Every Programmer Should Know About Memory." (Specifically, go down to section 3.3.2 on page 20.) Pay particular attention to Figure 3.15, Sequential vs. Random Read (for a single thread), and also compare to Figure 3.21 which shows multi-threaded random accesses for 1, 2, and 4 threads.
The paper might be a little old (it uses a Pentium 4 for its benchmarks, after all), but the principles remain true. I should know... part of my day job is as a memory system architect.