But to be clear to the GP, that doesn't mean "it's a 10% better model". For most queries that one does for any two models, most of the generations / fixes will be "good", and so it's just basically a coin flip as to which model to choose ("I like this one's documentation more", "This one's fix was more concise", "This model was more polite", etc). 10% is actually a pretty big difference and reflect the cases where one model was unambiguously better than the other.
I have a creeping suspicion that people are judging the best model on how much they approve of human responses saved in a database.
To invent, you need a good imagination and a pile of junk. -- Thomas Edison