Please create an account to participate in the Slashdot moderation system

 



Forgot your password?
typodupeerror
×

Comment David Hume would like a word (Score 1) 101

The United States has won far more Nobel Prizes in physics, chemistry, and medicine than any other nation

Just because the US has won more prizes in the past, says nothing about how many it will win in the future.

If one was to make a prediction, one would hazard a guess that the current Trump policies on education will lead to poorer quality university education and research, with top quality researchers moving elsewhere (France inviting US scientists). Certainly, I think that many considering moving to the US for post-doc or higher positions will be re-examining the advisability of that move.

Would you want to work in the US if you worked in atmospheric physics, vaccine technology or "woke" science more generally?

I came across a couple of accusations about Einstein recently, one accused him of doing "Jewish", non-Aryan science, while the other accused him of doing "physical idealism" and not following dialectical materialist principles.

Comment Re:Beware of Pooh's Bearing gifts (Score 1) 90

Look, if you want an olive branch here: If you're looking for a local machine for inference of large models for under $10k instead of tens to hundreds of thousands of dollars... yeah, the M3 ultra IS a good option. I do not object to this - at all.

What I object to is the nonsensical claim that it is "fast" or "efficient" compared to modern NVidia servers. It is not. At all. Unless you're making lazy, contrived scenarios, that is.

Comment Re:Beware of Pooh's Bearing gifts (Score 1) 90

First, summary != article.

Hey, let's play a little game called "scroll up in the thread": "That said, a lot of this article summary is nonsensical hype"

Literally my very first post in the thread.

That said, everything in the summary is from the article, including that quote, so it doesn't matter which one is referred to.

You're doing it again.
Confusing compute with memory bandwidth.

I'm not "confusing" anything. As was laid out in detail above, compute is maxed in actual real-world usage. Which is the reason why this hardware is made with such extreme compute capabilities.

You brought up an irrelevant data point, and I pointed out the stupidity of it.

It is precisely the topic of the thread that the M3 has the computational performance of a potato when in, properly run, real-world scenarios, the compute capacity absolutely is critical - which is why servers designed for AI tasks have such immense compute capacity to begin with.

You can run R1, period, in 200W.

You "can" run R1 on 20W. That doesn't make it either fast or efficient. This is a thread about performance and efficiency ,as a result of a summary about performance and efficiency, as a result of an article about performance and efficiency.

It's not a naive parallelization approach- it's a simple fact. A network must be evaluated sequentially. The layers must be split between the 2 cards, and you cannot evaluate layer 2 until you have evaluated layer 1.

I *literally described to you different forms of parallellization and their optimizations* beyond , and you keep posting as if that never happened. Pipeline parallelization by layers is NOT the only way to distribute a model across multiple servers. And MoEs CAN distribute whole experts to individual machines so that only hidden states before and after the FFN need to be synced.

There are numerous libraries (seemingly growing by the day) for how best to manage parallelization. It is NOT, I repeat, NOT, just "let's put these layers on machine 1, and these other layers on machine 2",

You are also of course correct about batching- which is where the multi-GPU paradigm actually shines- in service multiple inferences at once, even if any particular inference is still limited by the performance of a single card. You, as a person, with your 2 B100s, or 7 RTX3090s, are not going to be helped by that expansion

If you're not a moron and you run speculative decoding, YES, you WILL benefit from that performance. Even in the deeply-abnormal "single-user-issuing-queries-consecutively" scenario. Speculative decoding in effect creates batching from a single prompt.

I'll repeat: you keep comparing naive inference approaches as if the year were 2019 and no modern research on fast inference had been done. It's frankly embarrassing.

Comment Re:As my humble zero-analysis Dunning-Kruger take. (Score 1) 109

To me, your "better to have two independent mechanisms that just-so exist with just-so parameters, than one mechanism driven by a more complex underlying process", comes across as the same as saying:

"Hmm, when I drop this rock it goes down but when I release this balloon it goes up... rather than trying to unify the two, which would have to deal with things like density and interactions with the surrounding atmosphere, I'll just say there are two separate forces, one which pulls rocks down and one which lifts balloons up".

Or:

"Hmm, when I roast these nuts, they turn sweet, but when I roast this sugar, it turns bitter. Rather than trying to understand the chemistry behind the Maillard reaction, I'll just define an equation that describes the sweetening of nuts and a different one that describes the bittering of sugar."

Your approach is not just deeply conceptually unsatisfying, but runs counter to the goals of physics research throughout its entire history. We live in a world full of complex reactions that create net impacts that are the result of their components.

Comment Re:As my humble zero-analysis Dunning-Kruger take. (Score 1) 109

That's an impressive straw man there. The entire theory of quintessential inflation is that it didn't turn off - that its intensity just declined by many orders of magnitude with density/time, and continues to decay with density/time. You're the one inserting "turned off" into the picture.

(And yes, just to head you off, there are numerous papers on reheating with respect to quintessential inflation, and there's a surprisingly large number of viable mechanisms)

And yes, electroweak unification should not have been accepted if it was just an arbitrary mashing together of two things without any evidence.

Again, impressive straw man work there. Nothing - not disjoint inflation vs. dark energy, nor unified inflation with dark energy - should be "accepted" "without evidence". But nobody serious rejected the search for a mechanism linking the two just because "it involves a more complex interaction than just having a few disjoint parameters", which is the argument you've been pushing to conceptually reject quintessential inflation.

Comment Re:Beware of Pooh's Bearing gifts (Score 1) 90

I fully get you want to entirely ignore the compute capability of the M3 and avoid having to discuss it at all costs because it's embarassingly slow by the standards of AI tasks, and yes, this VERY much matters in the real world. Because if I were trying to argue your side, I'd likewise be trying to avoid having to deal with discussing how few FP4 FLOPS the M3 has.

Comment Re:Beware of Pooh's Bearing gifts (Score 1) 90

The super-fast came from your imagination.

Huh, I must have imagined that the summary said "The new DeepSeek-V3-0324 in 4-bit runs at > 20 tokens/second on a 512GB M3 Ultra with mlx-lm!" as if this some sort of extreme performance figure. They even included an exclamation point for good measure. Or did I imagine that too?

As for efficiency? I don't think that can be reasonably argued. It is vastly more efficient.

Utter nonsense. It has an literal order of magnitude worse fp4 tflops per watt.

The mention of 3090 was only to have a flops comparison point as to how poor the performance of the M3 studio is.

It's not poor at all, particularly in the context that it can do things you need 7 3090s to do.

YOU are the only person here suggesting the absurd notion of using 7 3090s. The 3090 was only brought up to give a grounding of the level of compute power. NOT as a VRAM comparison. NOT as a "suggested alternative implementation". The fact that this has been pointed out to you multiple times and yet you persist has far beyond moved into straw man territory. You've decided what scenario you actually want to argue about - a scenario that was never suggested - and persist in trying to argue about it rather than have to defend the simply false case that the M3 is higher performance than modern NVidia servers.

Was the article inarticulate in what the critical difference really is? Of course.

"Inarticulate" is a kind way to spell "wrong".

You do NOT get a mere 20 tokens per second at fp4 precision on Deepseek on Nvidia servers that consume kilowatts of power.

You don't.

Then how can you avoid reaching the conclusion that the author's comparison of the power consumption of the two is absurd?

For our similarly outfit B200 system, we'll need a minimum of 2 B200s (192GB a piece).
Layer offloading is sequential, so we won't be able to leverage the performance of both of them, so for all that power and bandwidth, we're still limited to the token crunching performance of a single B200- roughly 10x that of an M3 Ultra.

Beyond the fact that this is a naive parallelization approach, it presumes zero batching. In the real-world, many batches are processed concurrently. Matrix ops scale with batch size, which is why servers are designed with such an emphasis on TFLOPS rather than memory size. You very much *do* actually utilize the 2 orders of magnitude more flops. And if you weren't doing that, then you wouldn't be drawing the full power consumption either.

Even if we take batching (aka, the real world) out of the equation, that again is a naive parallelization approach - you're acting like layer parallelism is the only approach, when its the least efficient way to go around it. Basic tensor parallelism, well implemented, is generally much faster. Beyond that, with MoEs, you can d expert parallelism, localizing specific experts to individual servers, needing only to sync the output hidden states. There's also various just-in-time asynchronous data transfer methods (like are used in DualPipe). And then there's speculative decoding, which in effect creates self-batching, so even if you're only serving individual consecutive requests (not a mainstream serving task), you still benefit from the utilization efficiency of batching.

You're arguing for a nonsensical scenario.

Comment Re:But ... (Score 1) 41

Citizens United passed the Supreme Court 5-4 under Obama.

And Obama condemned the decision almost immediately in his State of the Union address. The President doesn't inherently control the Supreme Court (and in principle shouldn't, because of the separation of powers, although with strong support in the Senate they can pack it).

Comment Re:Beware of Pooh's Bearing gifts (Score 1) 90

Are you having fun straw manning? This was about the article's claim that Deepseek + Mac M3 studio was some sort of super fast, efficient way to run a model. The mention of 3090 was only to have a flops comparison point as to how poor the performance of the M3 studio is. Nobody is saying that you can fit DeepSeek onto a 3090; that's a complete derailing of the topic. The topic is about how M3 studio's performance compares to NVidia servers that use "several kilowatts of power". Aka, not a gaming card like the 3090.

You do NOT get a mere 20 tokens per second at fp4 precision on Deepseek on Nvidia servers that consume kilowatts of power. The have literal orders of magnitude higher performance. The statement in the summary was absolutely wrong. Stop defending things that are wrong via straw-man topic changes.

Comment Re:Beware of Pooh's Bearing gifts (Score 3, Insightful) 90

That physically can't happen when run locally. That said, it does contain the Great Firewall of China, rather crudely shoehorned into the finetune (to the point where it sometimes suddenly switches from first person to using the phrase "We" when speaking from the perspective of the CCP).

That said, a lot of this article summary is nonsensical hype. For example:

"While traditional AI infrastructure typically relies on multiple Nvidia GPUs consuming several kilowatts of power, the Mac Studio draws less than 200 watts during inference."

If you're running on an NVidia server that takes kilowatts of power, you're going to get WAY better performance than 20 tokens per second. A B200 is 20 petaflops (20000 teraflops) at fp4 precision. M3 Ultra is 115 teraflops at fp16 (AFAIK it doesn't accelerate lower precisions than that faster than fp16) - 25% slower than an outdated Nvidia RTX 3090 gaming card. These things are not the same.

Slashdot Top Deals

Loose bits sink chips.

Working...