Chinese Firm Trains Massive AI Model for Just $5.5 Million (techcrunch.com) 56

Posted by msmash on Friday December 27, 2024 @01:21AM from the bucking-the-trend dept.

Chinese AI startup DeepSeek has released what appears to be one of the most powerful open-source language models to date, trained at a cost of just $5.5 million using restricted Nvidia H800 GPUs.

The 671-billion-parameter DeepSeek V3, released this week under a permissive commercial license, outperformed both open and closed-source AI models in internal benchmarks, including Meta's Llama 3.1 and OpenAI's GPT-4 on coding tasks.

The model was trained on 14.8 trillion tokens of data over two months. At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.

Andrej Karpathy, former OpenAI and Tesla executive, comments: For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being brought up today are more around 100K GPUs. E.g. Llama 3 405B used 30.8M GPU-hours, while DeepSeek-V3 looks to be a stronger model at only 2.8M GPU-hours (~11X less compute). If the model also passes vibe checks (e.g. LLM arena rankings are ongoing, my few quick tests went well so far) it will be a highly impressive display of research and engineering under resource constraints.

Does this mean you don't need large GPU clusters for frontier LLMs? No but you have to ensure that you're not wasteful with what you have, and this looks like a nice demonstration that there's still a lot to get through with both data and algorithms.

Chinese Firm Trains Massive AI Model for Just $5.5 Million

This discussion has been archived. No new comments can be posted.

Load All Comments

Search 56 Comments Log In/Create an Account

Comments Filter:

So, the usual (Score:1)

by Mr. Dollar Ton ( 5495648 ) writes:

At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
You can have any two of cheap, good and fast, even when the definitions of "good" and "fast" are as blurry as the one of "AI".
- Re:So, the usual (Score:4, Insightful)
  
  by AmiMoJo ( 196126 ) writes: on Friday December 27, 2024 @01:17AM (#65042189) Homepage Journal
  
  It's a sign that the attempts to limit Chinese AI development are having the expected effect - accelerated development. They clearly have the talent to advance this field very quickly, and we should probably have used environmental reasons to force the same kind of improvements here.
  
  - Re: (Score:3)
    
    by locater16 ( 2326718 ) writes:
    
    That's just not the Silicon Valley Way son, more money needs to be thrown at the problem until such time as the company fails or we're the last company around and have a monopoly!
  - Re: (Score:1, Insightful)
    
    by sxpert ( 139117 ) writes:
    
    I was saying at the time the US came up with the sanctions that this would have the opposite effect, and taking your enemy for a fool was a stupid way to attack the problem.
    here we are, China is leapfrogging the US with fundamental research...
    - Re: (Score:2)
      
      by daninaustin ( 985354 ) writes:
      
      It's not the opposite effect. The sanctions will slow them down.
  - Re: (Score:3)
    
    by gweihir ( 88907 ) writes:
    
    Protectionism is an utterly dumb move if the other side has a reasonable chance of reacting. China does, with the expected effects.
    - Re: (Score:2)
      
      by Luckyo ( 1726890 ) writes:
      
      Funniest thing in this narrative is that PRC is extremely protectionist. Far more so than US is today. If US implemented even a fraction of protectionism that PRC does, trade between PRC and USA would basically cease.
      And yet US protectionism bad, PRC protectionism good, because look how China supposedly succeeds "because of increase of US protectionism (please don't look at PRC's protectionism)". According to the china bots, and their gullible victims who are chronically incapable of taking a look at entire
      - Re: So, the usual (Score:3)
        
        by St.Creed ( 853824 ) writes:
        
        PRC protectionism would hurt them more if they were as dependent on tech imports as the USA. But they're not dependent on the same way.
        And for issues like this, they graduate 100K engineers each year. They can afford to throw smart people at the problem when the hardware is unavailable.
        PRC protectionism still hurts them though. But it also provided protection against sanctions because they weren't as integrated in the world market as others.
        
        Re: (Score:3)
        
        by Luckyo ( 1726890 ) writes:
        
        China is far, FAR more dependent on imports compared to US. This is the part that people who have no understanding of how PRC functions outside the media narrative in the West do not understand.
        The reason why PRC is so desperate in building out a fleet of medium range military vessels and transport ships, is because if you just blockade Straight of Malacca, Chinese state doesn't just suffer economic hardships.
        It starves in the darkness. Because it makes only a fraction of food that it needs to feed it's peo
        
        This^ (Score:2)
        
        by daninaustin ( 985354 ) writes:
        
        We would have a massive shortage of consumer electronics, and Temu garbage, they would have a massive problem with fuel, raw materials, and food. Millions would starve. The massive Chinese manufacturing advantage won't do much good with supplies cutoff. Unfortunately, unless Xi goes away, i thin we will see this in real life.
        
        Re: (Score:2)
        
        by Luckyo ( 1726890 ) writes:
        
        PRC economy in a nutshell is "we import almost all precursors (easy ones directly, hard ones pre-refined and purified by Japanese etc), deploy our cheap and fairly intelligent workforce to craft it into a mid tier thing, and then ship it abroad capturing some of the value-add".
        It works as long as imports and exports are easy. This is why PRC has invested an incredible amount of money in logistics hubs across the world as a part of its Silk Road program. It's desperate to bind itself to the rest of the world
      - Re: (Score:2)
        
        by gweihir ( 88907 ) writes:
        
        You just gave an excellent example of an idiot completely missing the point. It is not about "good" or "bad" at all. It is about what works. What a fail.
        
        Re: (Score:2)
        
        by Luckyo ( 1726890 ) writes:
        
        You just gave an excellent example of an idiot completely missing the point, and then proceeding to project his inferiority on the other person. The point is that China practices extreme protectionism where many can in fact react, and there are plans in place to do so in many nations. Yet PRC not only practices extreme protectionism, it routinely tightens it.
        And it still functions very well, even in fields it is exceedingly protectionist and gets hit by counter-measures.
- Re: (Score:3)
  
  by ShanghaiBill ( 739463 ) writes:
  
  You can have any two of cheap, good and fast
  That's a common saying about software, but it applies less to hardware where there isn't much difference between "fast" and "good".
  Custom tensor processors are the future of AI and they are cheaper and faster.
  - Re: (Score:2)
    
    by Njovich ( 553857 ) writes:
    
    For AI models you have:
    Not cheap, good, fast: Large AI model on expensive GPU
    Cheap, not good, and fast: Small AI model on RAM or cheap GPU
    Cheap, good and not fast: Large AI model on RAM
    You can get cheap, good and fast from Google's API if you are willing to give them your data though, but I guess that's a form of cost.
    - Re: (Score:2)
      
      by daninaustin ( 985354 ) writes:
      
      define good. You can run a pretty damn good model on 2 3090's or 2 P40's (total 48 GB vram each.) Apple devices could also do it well if Apple would reduce the memory prices. Give it another generation of hardware (but not from Nvidia, they are intentionally trying to keep inference costs high.)
      - Re: (Score:2)
        
        by Njovich ( 553857 ) writes:
        
        Oh that's funny, I work on 48GB too but with an RTX a6000. I would call it a pretty good compromise on all three. I use 70B with q4 models. Not cheap compared to RAM, not really good compared to state of the art models (or even full llama3 70b), speed is not bad but could be better.
        A lot depends on your situation of course. Good/fast/cheap are all inherently subjective. The point is not that it's true for everyone, the point is that there is a sizeable portion of people for who this is true.
  - Re: (Score:3)
    
    by Mr. Dollar Ton ( 5495648 ) writes:
    
    The "AI" has no "future", but whatever.
  - Re: So, the usual (Score:2)
    
    by Fons_de_spons ( 1311177 ) writes:
    
    I agree here. Been doing some simple embedded software as a holiday project. The compiled code is ridiculous. Needlessly spending way too much time putting stuff on and off the stack. No, no. Hold your horses. I am not suggesting to do an Ai model in assembly. But there is a huge amount to gain in hardware there. Once the software side converges to something stable. The hardware guys will do their magic. They may even phone to their semiconductor friends and ask for something different than CMOS for Ai. Sof
- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  MoE on average will run far faster than a dense model. Both in training and inference this will run more like a 34B dense model than a 671B dense model.
- Re: (Score:2)
  
  by gweihir ( 88907 ) writes:
  
  "Good" boils down to "crappy" instead of "more crappy" with LLMs.
  - Crappy? (Score:2)
    
    by daninaustin ( 985354 ) writes:
    
    solving math problems that Fields prize mathematicians struggle with is crappy? Your criticisms are based on really old data. You really should keep up. The models are very good now.
- Re:So, the usual (Score:4, Interesting)
  
  by 2TecTom ( 311314 ) writes: on Friday December 27, 2024 @06:50AM (#65042451) Homepage Journal
  
  fast and good don't really matter, it's only a matter of time before AI training becomes ongoing, dynamic and distributed, the next gen AIs will be self-taught and they'll be self-managed and far beyond understanding and they will lose on the Internet, this is just the Adolescence of P1
  https://en.wikipedia.org/wiki/... [wikipedia.org]
  then we'll find out just how stupid and unethical the human race really has been when our AI master takes over
  
  - Re: (Score:1)
    
    by Squatting_Dog ( 96576 ) writes:
    
    OOLCAY ITAY .
    - Re: (Score:2)
      
      by 2TecTom ( 311314 ) writes:
      
      ofyay oursecay
- Re: (Score:2)
  
  by WaffleMonster ( 969671 ) writes:
  
  At 1.6 times the size of Meta's Llama 3.1, DeepSeek V3 requires substantial computing power to run at reasonable speeds.
  You can have any two of cheap, good and fast, even when the definitions of "good" and "fast" are as blurry as the one of "AI".
  DeepSeek v2 (236B) is still one of my favorite models and I can't wait to try the 671B version.
  Llama 3.1's 405B was too painful to run without a bank of GPUs because all 405 billion parameters are active. DeepSeek v2 is MoE with only 21B active and this new model slightly more 37B active which means reasonable performance on a CPU with lots of multichannel DDR5. Assuming a 5bit quant you need half TB ram which isn't too crazy for something that can beat GPT4.
  - Re: (Score:2)
    
    by daninaustin ( 985354 ) writes:
    
    I'm hoping a 72B model is released soon.
- All the tea in China that we know of... (Score:2)
  
  by shanen ( 462549 ) writes:
  
  Rather vacuous Subject, but still a rather sound FP. What if people received a karma boost for such?
  But the joke I was looking for was something along the lines of my new Subject. Whenever I see a story like this about a secretive place like China (and some other comparably secretive places) it mostly makes me wonder about the story we didn't see.
  They shoot horses, and horse thieves, and whistleblowers, don't they?
The PR is interesting (Score:4, Interesting)

by larryjoe ( 135075 ) writes: on Friday December 27, 2024 @01:44AM (#65042215)

There was a time when China dominated the Top 500 list of scientific supercomputers. Now, the systems that dominated the list lie at #15 and #24. What's interesting is that China could at anytime easily ticket back to the #1 position, but it chooses not to do so. Why? Because the attention was a double-edged sword. There was satisfaction in bragging about besting the West, but the accompanying attention prompted Western ideas about containing China.
That's why this piece of PR is a bit puzzling. The West is already in the middle of a China containment initiative, so the PR is a message that the containment isn't working, which suggests that the containment needs to be increased, which isn't in Chinese interests. Furthermore, if the PR news is indeed true, keeping it secret as a trade secret would seen to be far more advantageous. There are some bragging rights, but as with the Top 500 list, China is realizing that topping the list and garnering global acclaim are not the same.

- Re: (Score:3)
  
  by dhasenan ( 758719 ) writes:
  
  China is not a monolith. It's a capitalist nation where even government power is divided, not applied with a unified objective. If the execs at Tencent decided to build the most powerful supercomputer in the world, they could. The government of Shenzen could decide to outdo them the next year, giving a massive grant to the South University of Science and Technology. Xi Jinping could stop either, but it would take political capital.
  DeepSeek is a product of an AI hedge fund, High-Flyer. Presumably they're try
- Re:The PR is interesting (Score:4, Insightful)
  
  by Anonymous Coward writes: on Friday December 27, 2024 @04:15AM (#65042353)
  
  Actually China simply no longer publishes any material on their super computers anymore. What's the point. If someone publishes, then US tries to sanction. Therefore they don't care about the dick measuring contest of who's #1 in the rankings anymore.
  
  - Re: (Score:2)
    
    by Luckyo ( 1726890 ) writes:
    
    Unironically one of the big changes is that PRC's censorship apparatus started hitting Chinese nationalists posting abroad. "China numbah one" fifty centers all suddenly stopped.
    Didn't stop in China proper though. Their social media is still full of "China numbah one" screechers. They're less prominent as they not as useful any more, but they're certainly there. Though their primary use domestically is to beat down the people complaining about "Garbage Time of History" as most Chinese people are suffering f
    - - Re: (Score:2)
        
        by daninaustin ( 985354 ) writes:
        
        They are not rising any more. Just the opposite.
- Re: (Score:2)
  
  by thegarbz ( 1787294 ) writes:
  
  The only supercomputers listed in the top 500 are those that people want listed in the top 500. There are plenty out there (I've personally used one which would rank top 50 performance wise) that are not listed. Those owned by private companies or run by governments for purposes not related to open research are not going to be found in that list.
- Re: (Score:2)
  
  by daninaustin ( 985354 ) writes:
  
  No, they could not. The new supercomputers are using the latest Nvidia GPU's and they can't be sold to China, and are already in short supply. (even before Elon builds his 1 million GPU behemoth.) China's latest model is great for opensource, but it is still not leading the world. OpenAI is in the lead with O3 and I would expect xAI to pass them soon, since they have an overwhelming compute advantage, which is only going to increase.
"11x less" - learn your math (Score:1)

by japa ( 28571 ) writes:

I hate seeing these "n times less" where n is more than 1.
Even if you can correctly say "n times more", you can't say "n times less".
Ie. If something needs half more resouces, it's clear that a 50% increase is needed. And if it needs half less resources, it needs 50% less resources. Now, when it needs twice the resources, it needs original plus one more set of original, or 100% more. What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!
Not to speak about 11 times les
- Re: "11x less" - learn your math (Score:2)
  
  by flyingfsck ( 986395 ) writes:
  
  Hmm, those clowns must be doing reciprocal math
- Re: (Score:2)
  
  by Pinky's Brain ( 1158667 ) writes:
  
  Even if it takes 1/11th fraction, it does not take 11 times less!
  Words don't need to map 1:1 to operators. Due to the context of "times", "less" can be unambiguously interpreted as division.
  So it can have the clear meaning of 1/11th, you clearly know the meaning is 1/11th, it is customary to use it to mean 1/11t ... it is the meaning of 11 times less.
  - Re: (Score:2)
    
    by Entrope ( 68843 ) writes:
    
    By your argument, "a third less" really 'unambiguously' means "three times as much", which as we know is now anonymous with "three times more", which more literally means "four times as much".
    You are bad and you should feel bad.
    - Re: (Score:2)
      
      by Pinky's Brain ( 1158667 ) writes:
      
      No, by my argument "a third times less" would be unambiguous, but it's not customary.
      - Re: (Score:2)
        
        by Entrope ( 68843 ) writes:
        
        You don't argue over the unambiguous meaning, I see.
        And you should feel extra bad for nitpicking at a typo.
        
        Re: (Score:2)
        
        by Pinky's Brain ( 1158667 ) writes:
        
        Leaving out a "times" is quite a large typo.
        
        Re: (Score:2)
        
        by Entrope ( 68843 ) writes:
        
        I will consider your whine when you stop leaving out your entire brain. You are advocating the normalization of nonsensical English to defend innumerate people writing stupid things, and your only argument is that it is supposedly unambiguous.
    - Re: (Score:2)
      
      by alexgieg ( 948359 ) writes:
      
      "a third less" really 'unambiguously' means "three times as much"
      "A third less" is easy to understand: it means y = x - x/3.
      "Three times less" is also easy to understand: it means y = x/3, being synonymous with "a third of".
      "A third times less" has no semantic content and is meaningless. One might construct a forced meaning by extrapolating from the three previous expressions, but no one would use it in actual speech, so it'd be mere word play.
- Re: (Score:2)
  
  by znrt ( 2424692 ) writes:
  
  What about twice less? original minus one more set of original = 0??? Or Original - 100% = 0!
  it isn't that complicated:
  twice more = double the need of the original.
  twice less = half the need of the original.
  "a needs twice more than b" means exactly the same thing than "b needs twice less than a"
  Sorry for the rant, but I hate when math is being mishandled.
  it's okay, your problem isn't with the math but either with reading comprehension, deliberately being a douche on general language usage or both.
did i misunderstand the math here? (Score:1)

by Jayhawk0123 ( 8440955 ) writes:

The model was trained on "14.8 trillion tokens of data"
which is supposedly 1.6 times the size of Meta's Llama 3.1 which was trained on 405 billion parameters and 128,000 context window size with 15 trillion multilingual tokens.
where is this 1.6 supposed to be?
- Re: (Score:3)
  
  by thegarbz ( 1787294 ) writes:
  
  [Citation Required]
Some of the numbers seem off. (Score:1)

by ChucktheMan ( 1991030 ) writes:

There may be a problem with what is reported here: 2 months of training time - 60 days * 24 hours -> 1440 hours. 2,800,000 GPU hours training / 1440 hours -> 1945 GPU cores. Nvidia 4090 claims 16284 cores.
- Re: Some of the numbers seem off. (Score:1)
  
  by Reckoning ( 10502566 ) writes:
  
  It's numbers reported by China. They're not likely to show their cards. I'd also wager they acquired their data through nefarious means.
Just waiting for gwehir's comment (Score:2)

by null etc. ( 524767 ) writes:

I'm just here waiting for gwehir to hop on, to tell us all that AI is not really intelligent, and that there's not a single valid use case for LLMs, and even the smartest people in the field don't know what they're talking about, because LLMs are just predicting the next character, and human intelligence is obviously far superior, even though LLMs can beat an average teenager in thousands of real-world tests, and hopefully the AI hype will crash by next year.
Don't let me down, gwehir!

There may be more comments in this discussion. Without JavaScript enabled, you might want to turn on Classic Discussion System in your preferences instead.

So, the usual (Score:1)

Re:So, the usual (Score:4, Insightful)

Re: (Score:3)

Re: (Score:1, Insightful)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: So, the usual (Score:3)

Re: (Score:3)

This^ (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:3)

Re: So, the usual (Score:2)

Re: (Score:2)

Re: (Score:2)

Crappy? (Score:2)

Re:So, the usual (Score:4, Interesting)

Re: (Score:1)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

All the tea in China that we know of... (Score:2)

The PR is interesting (Score:4, Interesting)

Re: (Score:3)

Re:The PR is interesting (Score:4, Insightful)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

"11x less" - learn your math (Score:1)

Re: "11x less" - learn your math (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

Re: (Score:2)

did i misunderstand the math here? (Score:1)

Re: (Score:3)

Some of the numbers seem off. (Score:1)

Re: Some of the numbers seem off. (Score:1)

Just waiting for gwehir's comment (Score:2)

Related Links Top of the: day, week, month.

Slashdot Top Deals