Did you read the paper?
The paper is about whether we reinforce it to answer correctly (whatever that means) vs it simply saying "I don't know".
Most LLMs are rewarded when they give a correct answer. Say: 1 for correct, and 0 for incorrect.
So they are rewarded to always try an answer even when they aren't certain. The paper suggests rewarding "I don't know". For example, you could give 0.2 points for saying "I don't know", 0 for incorrect, and 1 for correct. This way the model will try to answer but only when it meets a certain threshold.
All generative models are generative. What they generate is the most probably text. What you consider as most probable text would depend on several factors (esp. the data set). But during the reinforcement learning phase they can teach it to say "I don't know" so it plays nicer with humans that expect answers that are more than just gramatically correct.