Good to see we're abandoning the premise that the logic behind LLMs is "simple".
LLMs, these immensely complex models, function basically as the most insane flow chart you could imagine. Billions of nodes and interconnections between them. Nodes not receiving just yes-or-no inputs, but any degree of nuance. Outputs likewise not being yes-or-no, but any degree of nuance as well. And many questions superimposed atop each node simultaneously, with the differences between different questions teased out at later nodes. All self-assembled to contain a model of how the universe and the things within it interact.
At least, that's for the FFNs - the attention blocks add in yet another level of complexity, allowing the model to query a latent-space memory, which each FFN block then outputs transformed for the next layer. The latent space memory being.... all concepts in the universe that exist, and any that could theoretically exist between any number of existing concepts. These are located in an N-dimensional space, where N is hundreds to thousands. The degree of relationship between concepts can be measured by their cosine similarity. So for *each token* at *each layer*, a conceptual representation of somewhere in the space of everything that does or could exist is taken, and based on all the other things-that-does-or-could exists and their relative relations to each other, are transformed by the above insane-flow-chart FFN into the next positional state.
Words don't exist in a vacuum. Words are a reflection of the universe that led to their creation. To get good at predicting words, you have to have a good model of the underlying world and all the complexity of the interactions therein. It took achieving the Transformers architecture, with the combination of FFNs and an attention mechanism, along with mind-bogglingly huge scales of interactions (the exponential interaction of billions of parameters), to do this - to develop this compressed representation of "how everything in the known universe interacts".