Great summary, Derec01, it does put some prospective on how it work. However, the fact that it is somewhat simple does not diminish its coolness. I would guess our own minds work also in some simple way. How do we, ourselves, approach answering the question? We get the question, "What is life?", then turn it around, "The life is...", and then "just auto-complete" the rest of it. Think about it, when you answer this question, you really do not do any calculations, or logical proofs, or even literature searches, the "auto-completion" comes up on its own while you are writing it. It is based on your training. I suspect that something very similar is happening in a ChatGPT session. At this point I am not saying that this is the only "operational mode" our minds work in, but definitely it is one of the modes it works in. What do you think about it?
True, it's certainly cool what it can do, and I don't mean to downplay that. There's certainly some interesting things that Transformers are likely doing internally. As a couple examples, the process of taking "in-context" training examples (e.g. the text it is autocompleting) may actually effectively recapitulate a training operation like gradient descent (https://arxiv.org/abs/2212.07677). Also, interestingly, we can learn how it effectively operates by projecting out the underlying computations from inside the black box of the neural network (https://arxiv.org/abs/2211.01288). This may teach us new ways to encode information.
However, I don't think it's going to be quite complete in the near term. I would consider it the difference between the following two things:
1. a concise computer program that takes an input and produces the right output, respecting the constraints of the problem
2. a set of sequential machine instructions that take place in the process of (1) computing an output from an input.
Imagine that you only get to see the operations in (2) rather than the full program. It is both much more raw data than the original program but usually less information than the original program. The best you can do is extrapolate out from what those executions do, but the places you will do worst are the ones that the original program provided no output.
Similarly, I think most of the GPT-like models are training on the "data exhaust" of the more complex processes going on inside a human mind. I believe GPT-3 will often perform pretty well when it is asked to do tasks that are compositions of operations it has seen. However, it has fewer training points for things that are clearly untrue and so less opportunity to abstract in a way that respects the distinction. The set of untruths is much greater than the set of truths, and no one takes the time, for instance, to specifically mention that "Calvin Coolidge did not ride an elephant to the War of the Roses" or that "24 is never equal to 5". I am skeptical that the current approaches will successfully introduce the kind of reusable abstraction that are really necessary to induce a higher level program.