Catch up on stories from the past week (and beyond) at the Slashdot story archive

 



Forgot your password?
typodupeerror
×

Comment Re:Flash is costly? (Score 5, Informative) 37

Creating the training dataset is the *last* step. I have dozens of TB of raw data which I use to create training datasets that are only a few GB in size. Of which I'll have a large number sitting around at any point in time.

Take a translation task. I start with several hundred gigs of raw data. This inflates to a couple terabytes after I preprocess it into indexed matching pair datasets (for example, if you have an article that's published in N different languages, it becomes (N * N-1) language pairs - so, say, UN, World Bank, EU, etc multilingual document sets greatly inflate). I may have a couple different versions of this preprocessed data sitting around at any point in time. But once I have my indexed matching pair datasets, I'll weighted-sample only a relatively small subset of it - stressing higher-quality data over lower quality and trying to ensure a desired mix of languages.

But what I do is nothing compared to what these companies do. They're working with common crawl. It grows at a rate of 200-300 TB per month. But the vast majority of that isn't going to go into their dataset. It's going to be markup. Inapplicable file types. Duplicates. Junk. On and on. You have to whittle it down to the things that are actually relevant. And in your various processing stages you'll have significant duplication. Indeed, even the raw training files... I don't know about them, but I'm used to working with jsons, and that adds overhead on its own. Then during training there's various duplications created for the various processing stages - tokenization, patching with flash attention, and whatnot.

You also use a lot of disk space for your models. It's not just every version of the foundation you train (and your backups thereof) - and remember that enterprise models are hundreds of billions to trillions of FP16 parameters in their raw states - but especially the finetune. You can make a finetune in like a day or so; these can really add up.

Certainly disk space isn't as big of a cost as your GPUs and power. But it is a meaningful cost. As a hobbyist I use a RAID of 6 20TB drives and one of 2 4TB SSDs. But that's peanuts compared to what people working with common crawl and having hundreds of employees each working on their own training projects will be eating up in an enterprise environment.

Comment Putting numbers into perspective (Score 3, Interesting) 129

This is all to produce a peak of 240k EVs per year. Production "starts" in 2028. It takes years for a factory to hit full production. Let's be generous and say 2030.

Honda sold 1,3 million vehicles in the US alone last year - let alone all of North America, including both Canada and Mexico. If all those EVs were just for the US it'd be 18% of their sales, but for all of North America, significantly less.

In short, Honda thinks that in 2030 only maybe 1/7th to 1/8th of its North American sales will be EVs. This is a very pessimistic game plan.

Comment Re:Wonder if he can make it funny again. (Score 2) 30

Onion went politically correct about five years ago and all but died as a result. "Safe edgy" competes with mainstream, and that just doesn't work for Onion's niche.

If you still want anglo edgy counterculture predicting future insanity, Babylon Bee is probably the closest thing you'll get to Onion from over a decade ago. But it has all the weird hang ups of US Christians, since it's a Christian site. So not quite the same thing.

Comment Re:You know what this means, right? (Score 1) 94

It's much worse than that. Google's main strategy has somewhat recently become to look into your search (including all the data they have on you), see what ad they can pitch to you based on that search and deliver that instead of a search result.

And for a while you used to be able to just go to page 3 or so and still find that search result you wanted. Nowadays, they don't even let you see the "billions of results" they quote. They just give you a few pages of mostly pitching things to you, and that's that.

Comment Re:"Hate Speech" you say. (Score 0, Troll) 111

I find it hilarious how accurate my description of your behaviour was. Yup, accusations of looking into primary sources over spin doctors being a "conspiracy theory" immediately follow.

For a moment, you made me entertain trying to get you to also state that this is a threat to our democracy. That's usually the next mantra in your cult after the conspiracy theory one. But honestly, I just can't be bothered. If looking into primary sources is "suspect" and "a conspiracy theory", so be it. It's not like I'm going to convince a man of faith over the internet that his God is not real.

Comment Re:"Hate Speech" you say. (Score 1) 111

I have no specialist skills to reverse engineer the specific knowledge of software that he used from a sound clip. That would require extreme level of knowledge in field of generative audio models and comprehensive data analysis.

I'm not sure how that's relevant which specific generative model he used either. There are quite a few that are open to all.

Comment Re:"Hate Speech" you say. (Score 0, Troll) 111

This is my point. You don't care about what happened. If you did, you'd go to readily available primary sources.

Instead you decided to attack me for referencing primary sources, because spin doctors you're comfortable with didn't mention them. Essentially, your attack on me is an act of maintenance of your ideological bubble against the reality that is more than willing to stare you in the face. It just takes one search on the subject and going on from there.

But you're telling me, straight up with full honesty that you won't do that, because that is something that is associated with primordial horror and death for you. It's a reaction of a cultist sensing something that might damage his faith.

You can take the horse to the water. You can't make it drink. One look at the man's social media accounts and then listening to the AI generated speech will tell you exactly how BLM features into it. It's literally one search away. From where you'll probably quickly run into the local far left militant cells perpetrator successfully mobilized appealing to black lives mattering to attack the white principal.

But you won't do it. Because "ghosts", the feature of a primordial death and destruction awaits should you do that.

Comment Re:Right (Score 1) 37

There was a massive increase in demand for cloud services as well, because of all of the "masses working from home" novelty. So there would be a lot of demand for enterprise as well.

I suspect this is more of it being a long lead product. Or the story is just generally inaccurate on everything, since it also seems to imply that entire chatGPT training data set could fit on two hard drives, and yet that is somehow causing a shortage of hard drives.

Slashdot Top Deals

Don't panic.

Working...