
OpenAI's New Tool Attempts To Explain Language Models' Behaviors (techcrunch.com) 20
An anonymous reader quotes a report from TechCrunch: In an effort to peel back the layers of LLMs, OpenAI is developing a tool to automatically identify which parts of an LLM are responsible for which of its behaviors. The engineers behind it stress that it's in the early stages, but the code to run it is available in open source on GitHub as of this morning. "We're trying to [develop ways to] anticipate what the problems with an AI system will be," William Saunders, the interpretability team manager at OpenAI, told TechCrunch in a phone interview. "We want to really be able to know that we can trust what the model is doing and the answer that it produces."
To that end, OpenAI's tool uses a language model (ironically) to figure out the functions of the components of other, architecturally simpler LLMs -- specifically OpenAI's own GPT-2. How? First, a quick explainer on LLMs for background. Like the brain, they're made up of "neurons," which observe some specific pattern in text to influence what the overall model "says" next. For example, given a prompt about superheros (e.g. "Which superheros have the most useful superpowers?"), a "Marvel superhero neuron" might boost the probability the model names specific superheroes from Marvel movies. OpenAI's tool exploits this setup to break models down into their individual pieces. First, the tool runs text sequences through the model being evaluated and waits for cases where a particular neuron "activates" frequently. Next, it "shows" GPT-4, OpenAI's latest text-generating AI model, these highly active neurons and has GPT-4 generate an explanation. To determine how accurate the explanation is, the tool provides GPT-4 with text sequences and has it predict, or simulate, how the neuron would behave. In then compares the behavior of the simulated neuron with the behavior of the actual neuron.
"Using this methodology, we can basically, for every single neuron, come up with some kind of preliminary natural language explanation for what it's doing and also have a score for how how well that explanation matches the actual behavior," Jeff Wu, who leads the scalable alignment team at OpenAI, said. "We're using GPT-4 as part of the process to produce explanations of what a neuron is looking for and then score how well those explanations match the reality of what it's doing." The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which they compiled in a dataset that's been released alongside the tool code. "Most of the explanations score quite poorly or don't explain that much of the behavior of the actual neuron," Wu said. "A lot of the neurons, for example, are active in a way where it's very hard to tell what's going on -- like they activate on five or six different things, but there's no discernible pattern. Sometimes there is a discernible pattern, but GPT-4 is unable to find it."
"We hope that this will open up a promising avenue to address interpretability in an automated way that others can build on and contribute to," Wu said. "The hope is that we really actually have good explanations of not just what neurons are responding to but overall, the behavior of these models -- what kinds of circuits they're computing and how certain neurons affect other neurons."
To that end, OpenAI's tool uses a language model (ironically) to figure out the functions of the components of other, architecturally simpler LLMs -- specifically OpenAI's own GPT-2. How? First, a quick explainer on LLMs for background. Like the brain, they're made up of "neurons," which observe some specific pattern in text to influence what the overall model "says" next. For example, given a prompt about superheros (e.g. "Which superheros have the most useful superpowers?"), a "Marvel superhero neuron" might boost the probability the model names specific superheroes from Marvel movies. OpenAI's tool exploits this setup to break models down into their individual pieces. First, the tool runs text sequences through the model being evaluated and waits for cases where a particular neuron "activates" frequently. Next, it "shows" GPT-4, OpenAI's latest text-generating AI model, these highly active neurons and has GPT-4 generate an explanation. To determine how accurate the explanation is, the tool provides GPT-4 with text sequences and has it predict, or simulate, how the neuron would behave. In then compares the behavior of the simulated neuron with the behavior of the actual neuron.
"Using this methodology, we can basically, for every single neuron, come up with some kind of preliminary natural language explanation for what it's doing and also have a score for how how well that explanation matches the actual behavior," Jeff Wu, who leads the scalable alignment team at OpenAI, said. "We're using GPT-4 as part of the process to produce explanations of what a neuron is looking for and then score how well those explanations match the reality of what it's doing." The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which they compiled in a dataset that's been released alongside the tool code. "Most of the explanations score quite poorly or don't explain that much of the behavior of the actual neuron," Wu said. "A lot of the neurons, for example, are active in a way where it's very hard to tell what's going on -- like they activate on five or six different things, but there's no discernible pattern. Sometimes there is a discernible pattern, but GPT-4 is unable to find it."
"We hope that this will open up a promising avenue to address interpretability in an automated way that others can build on and contribute to," Wu said. "The hope is that we really actually have good explanations of not just what neurons are responding to but overall, the behavior of these models -- what kinds of circuits they're computing and how certain neurons affect other neurons."
Wonder If That Scales Consistently . . . (Score:3)
To that end, OpenAI's tool uses a language model (ironically) to figure out the functions of the components of other, architecturally simpler LLMs -- specifically OpenAI's own GPT-2.
"We're using GPT-4 as part of the process to produce explanations of what a neuron is looking for and then score how well those explanations match the reality of what it's doing." The researchers were able to generate explanations for all 307,200 neurons in GPT-2, which they compiled in a dataset that's been released alongside the tool code.
Will it require GPT-6 to explain GPT-4's behavior?
Re: (Score:2)
It would seem so. However , the benefit here is more just getting a conceptual and practical idea of how these damn things *actually* work beyond "Its a giant inscrutable matrix of vectors with x amount of attention heads and y amount of parameters".
Re: (Score:1)
Re: (Score:1)
> Look up repertory grids, as used in clinical psychology
Similar to Factor Tables, see my sig.
Re: (Score:1)
As is tradition (Score:2)
The first integrated circuits were laid out on paper, similar or perhaps even the same as the large formats used by draftsmen to make building plans.
Once the chips were good enough, they ditched that and started laying out chips with a GUI. The state of the art advanced at a quickening pace.
Using AI to make AI better is not the least bit unexpected. It's how good tech is done, all the way back to using the first wheel to get better lumber from further away, to make better wheels.
Let's settle this (Score:2)
Re: (Score:2)
While inherently "true", it's not meaningful.
Sure, I can say a chess playing system looked at all possibilities X moves out and followed a scoring system I provided. But when you layer the complexity enough such that the X moves out becomes dynamic and the scoring system variable, then it can behave in unexpected ways.
The factor of obscurity is what results in the "we can't explain why" from the standpoint of "we didn't program it to do the thing it's doing".
So it's a bit disingenuous and it's still import
Re: (Score:2)
The programmers are responsible for its "behavior" because there is no "behavior" that was not programmed.
I would rather say that if anyone is "responsible", then it would be the trainers of the model (i.e. those who controller, what material was fed to it during the training). And this highlights the actual problem - as this article shows, there were no way to predict, how the neural network will form based on whatever input, so even the trainers could not have predicted the model's behavior in detail and what kind of emergent features it would have.
In the end, I consider GPT a tool, so the end user (with a
Re: (Score:2)
Re: (Score:2)
Sure. The problem is that with "training" as the programming method used, the programmers have no clue what they really programmed.
Re: (Score:2)
It's not fair to say we've got "no clue". Some kinds of models are completely transparent and we know exactly what is being encoded. Other kinds of models are less clear, though even if we don't know precisely what is encoded, we do know that it is nothing more than a small sample of the information contained in the training data. No matter how opaque, there are a bunch of ways to tease that out anyway.
Take a look at SHAP [github.com], which is a general purpose way to do exactly that.
One of my favorite examples [googleblog.com] come
Re: (Score:3)
Actually, it is completely fair. The problem is that the models are far too large because tons of training data got shoveled in. At that point, any attempt to understand what it does dies due to complexity and size. Whether you can find out for one specific question hardly matters because you do not know anymore what questions to ask or to look at.
I do agree that for a specific question it may indeed be possible to find out how the answer was created. But that does not solve the problem of safety and reliab
Re: (Score:2)
That's a different argument, isn't it? "There's just too much for anyone to understand" is not the same thing as "It's a deep mystery". You're right in that the incredible size of these things makes it essentially impossible to guarantee that the model won't produce undesirable output, but that doesn't mean "we have no clue". We know a great deal about the kind of information that's being encoded from the training data and we know how the model produces output.
I think you'd agree that people have some v
Re: (Score:2)
"No clue" is not well defined. A "clue" in this context usually means a general understanding as to how things work and in the context of the original story it relates to safety, i.e. rare behaviors matter a lot. With that, I feel confident that "no clue" adequately describes the situation. This is a judgment call though.
Of course we do generally know how these things work, and given a small enough training data set we can still fully understand the possible behaviors. As "god of the gaps" is a nonsense ide
Brain mapping (Score:3)
In conjunction with some future fmri tech, might someday tell us why we think what we do?
Feedback howl (Score:3)
So, GPT-4 tries to come up with its best guess as to what is going on in the4 tiny little mindlet of GPT-2. Sometimes it produces useful explanations, but mostly it vomits out voluminous "this neuron sorta responds to everything, maybe the phases of the Moon too, who knows", and that isn't useful for giving researchers insight. So ... they'll have to train up a whole new model with all of THAT data, maybe call it GPT-GGG for Gonkulator Generation Gapper. Feed those results back through recursively and, um, lessee, adjust the gain and phase until the shebang oscillates in a chaotic limit cycle.
Somewhere a light is going to go on. Or more likely heaps of lights will go dark from all the power being drawn by the Gonkulator.
Re: (Score:2)
Hehe, yes, sounds very much like it. Does not sound like this is ever going to scale though.
Re: (Score:2)
Baby step towards models of consciousness? (Score:2)
Is generating an explanation/rationale for, outputs from parts of the brain, the bits and pieces of information that is meant to be dispatched to the action centers, can be what constitutes consciousness?
It could also simulate possible responses using the same generative mechanism and feed it back to the language processing centers to check if it conforms to the objective. If you now ask a generative model why it said what it said, it will not only have an explanation but also a rationalization.