First, summary != article.
Hey, let's play a little game called "scroll up in the thread": "That said, a lot of this article summary is nonsensical hype"
Literally my very first post in the thread.
That said, everything in the summary is from the article, including that quote, so it doesn't matter which one is referred to.
You're doing it again.
Confusing compute with memory bandwidth.
I'm not "confusing" anything. As was laid out in detail above, compute is maxed in actual real-world usage. Which is the reason why this hardware is made with such extreme compute capabilities.
You brought up an irrelevant data point, and I pointed out the stupidity of it.
It is precisely the topic of the thread that the M3 has the computational performance of a potato when in, properly run, real-world scenarios, the compute capacity absolutely is critical - which is why servers designed for AI tasks have such immense compute capacity to begin with.
You can run R1, period, in 200W.
You "can" run R1 on 20W. That doesn't make it either fast or efficient. This is a thread about performance and efficiency ,as a result of a summary about performance and efficiency, as a result of an article about performance and efficiency.
It's not a naive parallelization approach- it's a simple fact. A network must be evaluated sequentially. The layers must be split between the 2 cards, and you cannot evaluate layer 2 until you have evaluated layer 1.
I *literally described to you different forms of parallellization and their optimizations* beyond , and you keep posting as if that never happened. Pipeline parallelization by layers is NOT the only way to distribute a model across multiple servers. And MoEs CAN distribute whole experts to individual machines so that only hidden states before and after the FFN need to be synced.
There are numerous libraries (seemingly growing by the day) for how best to manage parallelization. It is NOT, I repeat, NOT, just "let's put these layers on machine 1, and these other layers on machine 2",
You are also of course correct about batching- which is where the multi-GPU paradigm actually shines- in service multiple inferences at once, even if any particular inference is still limited by the performance of a single card. You, as a person, with your 2 B100s, or 7 RTX3090s, are not going to be helped by that expansion
If you're not a moron and you run speculative decoding, YES, you WILL benefit from that performance. Even in the deeply-abnormal "single-user-issuing-queries-consecutively" scenario. Speculative decoding in effect creates batching from a single prompt.
I'll repeat: you keep comparing naive inference approaches as if the year were 2019 and no modern research on fast inference had been done. It's frankly embarrassing.