I'm going to do something I usually don't. I'm going to reply, even though I gave up on this thread. I realize that you have a point when it comes to your questions and me not providing a good answer. This reply is probably not going to sway you, but this is more for others who might stumble upon this thread some time in the future.
Ok. So memory and state are connected. When I said that the model has no memory of previous runs you countered by mentioning KV caching, which led me to believe that you either on purpose misunderstood what I meant, or that you just used google/chatgpt for a reply. In case this was just a misunderstanding I'll try to explain why.
Memory, when discussing Intelligence (artificial or otherwise), means that previous processing or information impacts the outcome of the current processing. KV Caching exists only to improve performance, it doesn't alter the outcome. If you run the transformer with the same input it will end up with the same output, regardless if you are using a KV cache or not. You just save yourself some compute and time.
Different types of ML models can be trained to extrapolate, and you could of course argue that an LLM "extrapolates the next token in a series of tokens".
If we are talking about actual intelligent extrapolation then an LLM can't do that if it's outside of its training data. It is blind luck if it manages to do so.
I also realize that you did not, in fact, state that LLMs can think and reason, apologies for that.
The important fact that most people do not understand is that the Transformer, which is the core of everything going on now in this AI hype, only process input tokens to generate a list of the next probable token. It doesn't actually select the next token, so it has no clue if the sentence will now be about vehicles, amoebas or shades of green. In fact, when the next token is selected, based on temperature setting, and the Transformer is processing the previous tokens, plus the newly selected token from the last run, it has no clue if any of those tokens have been generated by itself, or if it completely new input. That is what I meant by not having state (or memory).
This has a huge impact on what an LLM can and can't do. Don't get me wrong, it's amazing how well it can perform with enough training data and clever implementations on the host side, but the fact remains. It can only create a list of the next probably token, one at a time.
Anyway, I realize I can't provide enough details in a post like this, but I wanted to give a bit more context to my point. You will probably respond like you have previously, but now I have given it a go, and will let this one go.