Become a fan of Slashdot on Facebook

 



Forgot your password?
typodupeerror
×

Comment Nonsense (Score 1) 25

The author has no clue what they're talking about:

Meta said the 15 trillion tokens on which its trained came from "publicly available sources." Which sources? Meta told The Verge that it didn't include Meta user data, but didn't give much more in the way of specifics. It did mention that it includes AI-generated data, or synthetic data: "we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3." There are plenty of known issues with synthetic or AI-created data, foremost of which is that it can exacerbate existing issues with AI, because it's liable to spit out a more concentrated version of any garbage it is ingesting.

1) *Quality classifiers* are not themselves training data. Think of it as a second program that you run on your training data before training your model, to look over the data and decide how useful it looks and thus how much to emphasize it in the training, or whether or not to just omit it.

2): Synthetic training data *very much* can be helpful, in a number of different ways.

A) It can diversify existing data. E.g., instead of just a sentence "I was on vacation in Morocco and I got some hummus", maybe you generate different versions of the same sentence ("I was traveling in Rome and ordered some pasta" ,"I went on a trip to Germany and had some sausage", etc), to deemphasize the specifics (Morocco, hummus, etc) and focus on the generalization. One example can turn into millions, thus rendering rote memorization during training impossible.

B) It allows for programmatic filtration stages. Let's say that you're training a model to extract quotes from text. You task a LLM with creating training examples for your quote-extracting LLM (synthetic data). But you don't just blindly trust the outputs - first you do a text match to see if what it quoted is actually in the text and whether it's word-for-word right. Maybe you do a fuzzy match, and if it just got a word or two off, you correct it to the exact match, or whatnot. But the key is: you can postprocess the outputs to do sanity checks on it, and since those programmatic steps are deterministic, you can guarantee that the training data meets certain characteristics..

C) It allows for the discovery of further interrelationships. Indeed, this is a key thing that we as humans do - learning from things we've already learned by thinking about them iteratively. If a model learned "The blue whale is a mammal" and it learned "All mammals feed their young with milk", a synthetic generation might include "Blue whales are mammals, and like all mammals, feed their young with milk" . The new model now directly learns that blue whales feed their young with milk, and might chain new deductions off *that*.

D) It's not only synthetic data that can contain errors, but non-synthetic data as well. The internet is awash in wrong things; a random thing on the internet is competing with a model that's been trained on reems of data and has high quality / authoritative data boosted and garbage filtered out. "Things being wrong in the training data" in the training data is normal, expected, and fine, so long as the overall picture is accurate. If there's 1000 training samples that say that Mars is the fourth planet from the sun, and one that says says that the fourth planet from the sun is Joseph Stalin, it's not going to decide that the fourth planet is Stalin - it's going to answer "Mars".

Indeed, the most common examples I see of "AI being wrong" that people share virally on the internet are actually RAG (Retrieval Augmented Generation), where it's tasked with basically googling things and then summing up the results - and the "wrong content" is actually things that humans wrote on the internet.

That's not that you should rely only generated data when building a generalist model (it's fine for a specialist). There may be specific details that the generating model never learned, or got wrong, or new information that's been discovered since then; you always want an influx of fresh data.

3): You don't just randomly guess whether a given training methodology (such as synthetic data, which I'll reiterate, Meta did not say that they used - although they might have) is having a negative impact. Models are assessed with a whole slew of evaluation metrics to assess how good and accurately they respond to different queries. And LLaMA 3 scores superbly, relative to model size.

I'm not super-excited about LLaMA 3 simply because I hate the license - but there's zero disputing that it's an impressive series of models.

Comment Re: Cue all the people acting shocked about this.. (Score 1) 41

Under your (directly contradicting their words) theory, then creative endeavour on the front end SHOULD count If the person writes a veritable short-story as the prompt, then that SHOULD count. It does not. Because according to the copyright office, while user controls the general theme, they do not control the specific details.

"Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output."

if a user instructs a text-generating technology to “write a poem about copyright law in the style of William Shakespeare,” she can expect the system to generate text that is recognizable as a poem, mentions copyright, and resembles Shakespeare's style.[29] But the technology will decide the rhyming pattern, the words in each line, and the structure of the text

It is the fact that the user does not control the specific details, only the overall concept, that (according to them) that makes it uncopyrightable.

Comment Re: Cue all the people acting shocked about this.. (Score 1) 41

Based on the Office's understanding of the generative AI technologies currently available, users do not exercise ultimate creative control over how such systems interpret prompts and generate material. Instead, these prompts function more like instructions to a commissioned artist—they identify what the prompter wishes to have depicted, but the machine determines how those instructions are implemented in its output.[28] For example, if a user instructs a text-generating technology to “write a poem about copyright law in the style of William Shakespeare,” she can expect the system to generate text that is recognizable as a poem, mentions copyright, and resembles Shakespeare's style.[29] But the technology will decide the rhyming pattern, the words in each line, and the structure of the text.[30]

Compare with my summary:

" their argument was that because the person doesn't control the exact details of the composition of the work"

I'll repeat: I accurately summed up their argument. You did not.

Comment Re:AI Incest (Score 2, Interesting) 41

Yes, "you've been told" that by people who have no clue what they're talking about. Meanwhile, models just keep getting better and better. AI images have been out for years now. There's tons on the net.

First off, old datasets don't just disappear. So the *very worst case* is that you just keep developing your new models on pre-AI datasets.

Secondly, there is human selection on things that get posted. If humans don't like the look of something, they don't post it. In many regards, an AI image is replacing what would have been a much crapper alternative choice.

Third, dataset gatherers don't just blindly use a dump of the internet. If there's a place that tends to be a source of crappy images, they'll just exclude or downrate it.

Fourth, images are scored with aesthetic gradients before they're used. That is, humans train models to assess how much they like images, and then those models look at all the images in the dataset and rate them. Once again, crappy images are excluded / downrated.

Fifth, trainers do comparative training and look at image loss rates, and an automatically exclude problematic ones. For example, if you have a thousand images labeled "watermelon" but one is actually a zebra, the zebra will have an anomalous loss spike that warrants more attention (either from humans or in an automated manner). Loss rates can also be compared between data +sources+ - whole websites or even whole datasets - and whatever is working best gets used.

Sixth, trainers also do direct blind human comparisons for evaluation.

This notion that AIs are just going to get worse and worse because of training on AI images is just ignorant. And demonstrably false.

Comment Re:Cue all the people acting shocked about this... (Score 4, Interesting) 41

As for why I think the ruling was bad: their argument was that because the person doesn't control the exact details of the composition of the work, than the basic work (before postprocessing or selection) can't be copyrighted. But that exact same thing applies to photography, outside of studio conditions. Ansel Adams wasn't out there going, "Okay, put a 20 meter oak over there, a 50 meter spruce over there, shape that mountain ridge a bit steeper, put a cliff on that side, cover the whole thing with snow... now add a rainbow to the sky... okay, cue the geese!" He was searching the search space for something to match a general vision - or just taking advantage of happenstance findings. And sure, a photographer has many options at their hands in terms of their camera and its settings, but if you think that's a lot, try messing around with AUTOMATIC1111 with all of its plugins some time.

The winner of Nature Photographer of the year in 2022 was Dmitry Kokh, with "House of Bears". He was stranded on a remote Russian archipelago and discovered that polar bears had moved into an abandoned weather station, and took photos of them. He didn't even plan to be there then. He certainly didn't plan on having polar bears in an abandoned weather station, and he CERTAINLY wasn't telling the bears where to stand and how to pose. Yet his work is a classic example of what the copyright office thinks should be a copyrightable work.

And the very notion that people don't control the layout with AI art is itself flawed. It was an obsolete notion even when they made their ruling - we already had img2img, instructpix2pix and controlnet. The author CAN control the layout, down to whatever level of intricate detail they choose. Unlike, say, a nature photographer. And modern models give increasing levels of control even with the prompt itself - with SD3 (unlike SD1/2 or SC) - you can do things like "A red sphere on a blue cube to the left of a green cone" . We're heading to - if not there already - where you could write a veritable short story's worth of detail to describe a scene.

I find it just plain silly that Person A could grab their cell phone and spend 2 seconds snapping a photo of whatever happens to be out their window, and that's copyrightable, but a person who spends hours searching through the latent space - let alone with ControlNet guidance (controlnet inputs can be veritable works of art in their own right) - isn't given the same credit for the amount of creative effort put into the work.

I think, rather, it's very simple: the human creative effort should be judged not on the output of the work (the work is just a transformation of the inputs), but the amount of creative effort they put into said inputs. Not just on the backend side - selection, postprocessing, etc - but on the frontend side as well. If a person just writes "a fluffy dog" and takes the first pic that comes up, obviously, that's not sufficient creative endeavour. But if a person spends hours on the frontend in order to get the sort of image they want, why shouldn't that frontend work count? Seems dumb to me.

Comment Cue all the people acting shocked about this... (Score 4, Informative) 41

... when the original ruling itself plainly said that though the generated content itself isn't copyrightable, human creative action such as postprocessing or selection can render it copyrightable.

I still think the basic ruling was bad for a number of reasons, and it'll increasingly come under stress in the coming years. But there is zero shock to this copyright here. The copyright office basically invited people to do this.

Comment Re:Sigh... (Score 1) 49

Here we go again with this.

NVidia shipped 100k AI GPUs last year, which - if run nonstop - would consume 7,4 TWh. Crypto consumes over 100 TWh per year, and the world as a whole consumes just under 25000 TWh per year.

AI consumption of power is a pittiance. To get these huge numbers, they have to assume long-term extreme exponential scaling. But you can make anything give insane numbers with an assumption like that.

I simply don't buy the assumption. Not even assuming an AI bust - even assuming that AI keeps hugely growing, and that nobody rests on their laurels but rather keeps training newer and better foundations - the simple fact is that there's far too much progress being made towards vastly more efficient architectures at every level - model structure, neuron structure, training methodologies, and hardware. . Not like "50% better", but like "orders of magnitude better". I just don't buy these notions of infinite exponential growth.

Comment Re: 20% survival is pretty good (Score 1) 57

I won't return in coin by calling you an idiot, because I don't think you are one. What I think you are is too *ignorant* to realize you're talking about evolution. "Survival of the fittest" is a phrase coined by Herbert Spencer in 1864 to refer to natural selection, a concept that's in the actual *title* of Darwin's book.

Comment Re:Well... (Score 3, Insightful) 125

It's not just a question of whether it's justifiable. It's just simply nonsense to think that they can enforce this. Anyone can run Stable Diffusion on their computer. There's a virtually limitless number of models finetuned to make all kinds of porn. It's IMHO extremely annoying all the porn flooding the model sites; I think like 3/4ths of the people using these tools are guys making wank material. Even models that aren't tuned specifically for porn, rarely does anyone (except the foundation model developers, like StabilityAI) specifically try to *prevent* it.

The TL/DR is: if you think stopping pirated music was hard, well, *good luck* stopping people from generating porn on their computers. You might as well pass a law declaring it illegal to draw porn.

Comment Re:really - the whole world's ? (Score 1) 57

Well, no *one* of us in a position to save the coral reefs. Not even world leaders can do it. But we *all* are in a position to do a little bit, and collectively all those little bits add up to matter.

Sure if you're the only person trying to reduce is carbon footprint you will make no difference. But if enough people do it, then that captures the attention of industry and politicians and shifts the Overton window. Clearly we can't save everything, but there's still a lot on the table and marginal improvements matter. All-or-nothing thinking is a big part of denialist thinking; if you can't fix everything then there's no point in fixing anything and therefore people say there's a problem are alarmists predicting a catastrophe we couldn't do anything about even if it weren't happening.

As to the loss of coral reefs not being the worst outcome of climate change, that's probably true, but we really can't anticiapte the impact. About a quarter of all marine life depends on coral reefs for some part of their life cycle. Losing all of it would likely be catastrophic in ways we can't imagine yet, but the flip side is that saving *some* of it is likely to be quite a worthwhile goal.

Slashdot Top Deals

Scientists will study your brain to learn more about your distant cousin, Man.

Working...