

Chinese Firm Trains Massive AI Model for Just $5.5 Million (techcrunch.com) 56
Chinese AI startup DeepSeek has released what appears to be one of the most powerful open-source language models to date, trained at a cost of just $5.5 million using restricted Nvidia H800 GPUs.
The 671-billion-parameter DeepSeek V3, released this week under a permissive commercial license, outperformed both open and closed-source AI models in internal benchmarks, including Meta's Llama 3.1 and OpenAI's GPT-4 on coding tasks.
The model was trained on 14.8 trillion tokens of data over two months. At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
Andrej Karpathy, former OpenAI and Tesla executive, comments: For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.
Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms.
The 671-billion-parameter DeepSeek V3, released this week under a permissive commercial license, outperformed both open and closed-source AI models in internal benchmarks, including Meta's Llama 3.1 and OpenAI's GPT-4 on coding tasks.
The model was trained on 14.8 trillion tokens of data over two months. At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
Andrej Karpathy, former OpenAI and Tesla executive, comments: For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.
Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms.
So, the usual (Score:1)
At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
You can have any two of cheap, good and fast, even when the definitions of "good" and "fast" are as blurry as the one of "AI".
Re:So, the usual (Score:4, Insightful)
It's a sign that the attempts to limit Chinese AI development are having the expected effect - accelerated development. They clearly have the talent to advance this field very quickly, and we should probably have used environmental reasons to force the same kind of improvements here.
Re: (Score:3)
Re: (Score:1, Insightful)
I was saying at the time the US came up with the sanctions that this would have the opposite effect, and taking your enemy for a fool was a stupid way to attack the problem.
here we are, China is leapfrogging the US with fundamental research...
Re: (Score:2)
Re: (Score:3)
Protectionism is an utterly dumb move if the other side has a reasonable chance of reacting. China does, with the expected effects.
Re: (Score:2)
Funniest thing in this narrative is that PRC is extremely protectionist. Far more so than US is today. If US implemented even a fraction of protectionism that PRC does, trade between PRC and USA would basically cease.
And yet US protectionism bad, PRC protectionism good, because look how China supposedly succeeds "because of increase of US protectionism (please don't look at PRC's protectionism)". According to the china bots, and their gullible victims who are chronically incapable of taking a look at entire
Re: So, the usual (Score:3)
PRC protectionism would hurt them more if they were as dependent on tech imports as the USA. But they're not dependent on the same way.
And for issues like this, they graduate 100K engineers each year. They can afford to throw smart people at the problem when the hardware is unavailable.
PRC protectionism still hurts them though. But it also provided protection against sanctions because they weren't as integrated in the world market as others.
Re: (Score:3)
China is far, FAR more dependent on imports compared to US. This is the part that people who have no understanding of how PRC functions outside the media narrative in the West do not understand.
The reason why PRC is so desperate in building out a fleet of medium range military vessels and transport ships, is because if you just blockade Straight of Malacca, Chinese state doesn't just suffer economic hardships.
It starves in the darkness. Because it makes only a fraction of food that it needs to feed it's peo
This^ (Score:2)
Re: (Score:2)
PRC economy in a nutshell is "we import almost all precursors (easy ones directly, hard ones pre-refined and purified by Japanese etc), deploy our cheap and fairly intelligent workforce to craft it into a mid tier thing, and then ship it abroad capturing some of the value-add".
It works as long as imports and exports are easy. This is why PRC has invested an incredible amount of money in logistics hubs across the world as a part of its Silk Road program. It's desperate to bind itself to the rest of the world
Re: (Score:2)
You just gave an excellent example of an idiot completely missing the point. It is not about "good" or "bad" at all. It is about what works. What a fail.
Re: (Score:2)
You just gave an excellent example of an idiot completely missing the point, and then proceeding to project his inferiority on the other person. The point is that China practices extreme protectionism where many can in fact react, and there are plans in place to do so in many nations. Yet PRC not only practices extreme protectionism, it routinely tightens it.
And it still functions very well, even in fields it is exceedingly protectionist and gets hit by counter-measures.
Re: (Score:3)
You can have any two of cheap, good and fast
That's a common saying about software, but it applies less to hardware where there isn't much difference between "fast" and "good".
Custom tensor processors are the future of AI and they are cheaper and faster.
Re: (Score:2)
For AI models you have:
Not cheap, good, fast: Large AI model on expensive GPU
Cheap, not good, and fast: Small AI model on RAM or cheap GPU
Cheap, good and not fast: Large AI model on RAM
You can get cheap, good and fast from Google's API if you are willing to give them your data though, but I guess that's a form of cost.
Re: (Score:2)
Re: (Score:2)
Oh that's funny, I work on 48GB too but with an RTX a6000. I would call it a pretty good compromise on all three. I use 70B with q4 models. Not cheap compared to RAM, not really good compared to state of the art models (or even full llama3 70b), speed is not bad but could be better.
A lot depends on your situation of course. Good/fast/cheap are all inherently subjective. The point is not that it's true for everyone, the point is that there is a sizeable portion of people for who this is true.
Re: (Score:3)
The "AI" has no "future", but whatever.
Re: So, the usual (Score:2)
Re: (Score:2)
MoE on average will run far faster than a dense model. Both in training and inference this will run more like a 34B dense model than a 671B dense model.
Re: (Score:2)
"Good" boils down to "crappy" instead of "more crappy" with LLMs.
Crappy? (Score:2)
Re:So, the usual (Score:4, Interesting)
fast and good don't really matter, it's only a matter of time before AI training becomes ongoing, dynamic and distributed, the next gen AIs will be self-taught and they'll be self-managed and far beyond understanding and they will lose on the Internet, this is just the Adolescence of P1
https://en.wikipedia.org/wiki/... [wikipedia.org]
then we'll find out just how stupid and unethical the human race really has been when our AI master takes over
Re: (Score:1)
OOLCAY ITAY .
Re: (Score:2)
ofyay oursecay
Re: (Score:2)
At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
You can have any two of cheap, good and fast, even when the definitions of "good" and "fast" are as blurry as the one of "AI".
DeepSeek v2 (236B) is still one of my favorite models and I can't wait to try the 671B version.
Llama 3.1's 405B was too painful to run without a bank of GPUs because all 405 billion parameters are active. DeepSeek v2 is MoE with only 21B active and this new model slightly more 37B active which means reasonable performance on a CPU with lots of multichannel DDR5. Assuming a 5bit quant you need half TB ram which isn't too crazy for something that can beat GPT4.
Re: (Score:2)
All the tea in China that we know of... (Score:2)
Rather vacuous Subject, but still a rather sound FP. What if people received a karma boost for such?
But the joke I was looking for was something along the lines of my new Subject. Whenever I see a story like this about a secretive place like China (and some other comparably secretive places) it mostly makes me wonder about the story we didn't see.
They shoot horses, and horse thieves, and whistleblowers, don't they?
The PR is interesting (Score:4, Interesting)
There was a time when China dominated the Top 500 list of scientific supercomputers. Now, the systems that dominated the list lie at #15 and #24. What's interesting is that China could at anytime easily ticket back to the #1 position, but it chooses not to do so. Why? Because the attention was a double-edged sword. There was satisfaction in bragging about besting the West, but the accompanying attention prompted Western ideas about containing China.
That's why this piece of PR is a bit puzzling. The West is already in the middle of a China containment initiative, so the PR is a message that the containment isn't working, which suggests that the containment needs to be increased, which isn't in Chinese interests. Furthermore, if the PR news is indeed true, keeping it secret as a trade secret would seen to be far more advantageous. There are some bragging rights, but as with the Top 500 list, China is realizing that topping the list and garnering global acclaim are not the same.
Re: (Score:3)
China is not a monolith. It's a capitalist nation where even government power is divided, not applied with a unified objective. If the execs at Tencent decided to build the most powerful supercomputer in the world, they could. The government of Shenzen could decide to outdo them the next year, giving a massive grant to the South University of Science and Technology. Xi Jinping could stop either, but it would take political capital.
DeepSeek is a product of an AI hedge fund, High-Flyer. Presumably they're try
Re:The PR is interesting (Score:4, Insightful)
Actually China simply no longer publishes any material on their super computers anymore. What's the point. If someone publishes, then US tries to sanction. Therefore they don't care about the dick measuring contest of who's #1 in the rankings anymore.
Re: (Score:2)
Unironically one of the big changes is that PRC's censorship apparatus started hitting Chinese nationalists posting abroad. "China numbah one" fifty centers all suddenly stopped.
Didn't stop in China proper though. Their social media is still full of "China numbah one" screechers. They're less prominent as they not as useful any more, but they're certainly there. Though their primary use domestically is to beat down the people complaining about "Garbage Time of History" as most Chinese people are suffering f
Re: (Score:2)
Re: (Score:2)
The only supercomputers listed in the top 500 are those that people want listed in the top 500. There are plenty out there (I've personally used one which would rank top 50 performance wise) that are not listed. Those owned by private companies or run by governments for purposes not related to open research are not going to be found in that list.
Re: (Score:2)
"11x less" - learn your math (Score:1)
I hate seeing these "n times less" where n is more than 1.
Even if you can correctly say "n times more", you can't say "n times less".
Ie. If something needs half more resouces, it's clear that a 50% increase is needed. And if it needs half less resources, it needs 50% less resources. Now, when it needs twice the resources, it needs original plus one more set of original, or 100% more. What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!
Not to speak about 11 times les
Re: "11x less" - learn your math (Score:2)
Re: (Score:2)
Even if it takes 1/11th fraction, it does not take 11 times less!
Words don't need to map 1:1 to operators. Due to the context of "times", "less" can be unambiguously interpreted as division.
So it can have the clear meaning of 1/11th, you clearly know the meaning is 1/11th, it is customary to use it to mean 1/11t ... it is the meaning of 11 times less.
Re: (Score:2)
By your argument, "a third less" really 'unambiguously' means "three times as much", which as we know is now anonymous with "three times more", which more literally means "four times as much".
You are bad and you should feel bad.
Re: (Score:2)
No, by my argument "a third times less" would be unambiguous, but it's not customary.
Re: (Score:2)
You don't argue over the unambiguous meaning, I see.
And you should feel extra bad for nitpicking at a typo.
Re: (Score:2)
Leaving out a "times" is quite a large typo.
Re: (Score:2)
I will consider your whine when you stop leaving out your entire brain. You are advocating the normalization of nonsensical English to defend innumerate people writing stupid things, and your only argument is that it is supposedly unambiguous.
Re: (Score:2)
"a third less" really 'unambiguously' means "three times as much"
"A third less" is easy to understand: it means y = x - x/3.
"Three times less" is also easy to understand: it means y = x/3, being synonymous with "a third of".
"A third times less" has no semantic content and is meaningless. One might construct a forced meaning by extrapolating from the three previous expressions, but no one would use it in actual speech, so it'd be mere word play.
Re: (Score:2)
What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!
it isn't that complicated:
twice more = double the need of the original.
twice less = half the need of the original.
"a needs twice more than b" means exactly the same thing than "b needs twice less than a"
Sorry for the rant, but I hate when math is being mishandled.
it's okay, your problem isn't with the math but either with reading comprehension, deliberately being a douche on general language usage or both.
did i misunderstand the math here? (Score:1)
The model was trained on "14.8 trillion tokens of data"
which is supposedly 1.6 times the size of Meta's Llama 3.1 which was trained on 405 billion parameters and 128,000 context window size with 15 trillion multilingual tokens.
where is this 1.6 supposed to be?
Re: (Score:3)
[Citation Required]
Some of the numbers seem off. (Score:1)
Re: Some of the numbers seem off. (Score:1)
Just waiting for gwehir's comment (Score:2)
I'm just here waiting for gwehir to hop on, to tell us all that AI is not really intelligent, and that there's not a single valid use case for LLMs, and even the smartest people in the field don't know what they're talking about, because LLMs are just predicting the next character, and human intelligence is obviously far superior, even though LLMs can beat an average teenager in thousands of real-world tests, and hopefully the AI hype will crash by next year.
Don't let me down, gwehir!