It's important to note the skills of these AI math solvers don't come from the stochastic transformer networks for the most part. Instead they come from logic engines with entirely predictable output steps based on inputs, for which the transformers only assist by trying to (appropriately enough) transform the word/notation parts into the appropriate inputs for the logic engines. So the math that comes out of these can be entirely correct, assuming the inputs are correct.
However that part, transforming the wording and notation into correct inputs, is going to be quite tricky I assume. Exactly what the notation in anything is going to be representing is not going to be perfectly clear and self evident paragraph to paragraph (my own papers held up as the shining example), as stated above even for a small proof errors start creeping in. And that, I assume, is because a careful understanding of what the words and notations mean exactly is a bare minimum requirement for generating a correct proof from such. And right now transformer LLMs are the literal "Chinese Room" thought experiment, doing nothing but memorizing the bare minimum rules for translating one thing to another, rather than doing any useful work (to translate from thermodynamics into complexity theory), on that translation. And so the instant it comes across something memorized slightly imperfectly, and there's such an enormous problem space for this memorizing it would be laughable, it's probably going to fuck up.
Logic engines are cool. Connecting logic engines to an LLM is a cool tech demo, but my impression is that there's a long way to go from the cool tech demo to something useful. Which is so often the case with tech demos, one more on the pile should be unsurprising.