Comment Re:Mathematician commentary included (Score 0) 62
My understanding is that LLMs are built on a foundation of ANNs, and that indeed the backpropagation used to train ANNs is a statistical process;
Two responses. One, that's discussing individual-neuron scale processes rather than collective processes; and this was a discussion about inference, not training. Human neurons also learn by error minimization (Hebbian learning). But this does not describe the macroscopic processes that result from said minimization.
* During training, neurons develop into classifiers that detect superpositions of concepts that collectively follow the same activation process. Individual neurons weight their input space and subdivide it by a fuzzy hyperplane to achieve a classification result.
* In subsequent layers, said input space is formed from a weighted combination of the previous layer's classification; thus, the superpositions of questions being formed are more complex, as are the classification results.
* In a LLM, this iterates for dozens of layers, gaining complexity at each layer, to form each FFN
* The initial input space to a FFN is a latent (conceptual representation), as is the output; the FFNs, in result, function as classifier-generators; they detect combinations of concepts in the input space, and output the causally-resultant concepts into the output space
* FFNs alternate with attention layers dozens to hundreds of times in order to process the information, each layer building on the results of the previous one.
The word to describe that is not "statistics". It's "logic".
In a LLM, the first few layers focus on disambiguation. If there's a token for "bank", is this about a riverbank, a financial bank, banking a plane, etc? As the layers progress, it starts building up first simple circuits, and then progressively more complex circuits - you might get a circuit that detects "talking like MAGA", or "off-by-one programming errors", or whatnot. In the late layers, you have the general conclusions reached - for example, if it were "The capitol of the state that contains America's fourth-largest metro area is...", you've already had FFNs detect the concepts of fourth-largest metro area and encoded Dallas-Forth Worth, and then later taken that and encoded "Texas", and then finally encoding "Austin". And then in the final couple layers you converge back toward linguistic space.
Anthropic has done some great work on this with attribution graph probes and the like; you can detect what circuits are firing, and on what things those circuits fire, and ramp them up or down to see how it modifies the output. They very much work through long chains of logical inferences.