But to be clear to the GP, that doesn't mean "it's a 10% better model". For most queries that one does for any two models, most of the generations / fixes will be "good", and so it's just basically a coin flip as to which model to choose ("I like this one's documentation more", "This one's fix was more concise", "This model was more polite", etc). 10% is actually a pretty big difference and reflect the cases where one model was unambiguously better than the other.
I have a creeping suspicion that people are judging the best model on how much they approve of human responses saved in a database.
"Call immediately. Time is running out. We both need to do something monstrous before we die." -- Message from Ralph Steadman to Hunter Thompson