"Apparently the researchers didn't analyze OS fingerprints at all."
Did you look into their paper? This is apparently not true. They focused on the ICMP data set but also looked into others, in particular the service probes that you mentioned. One of their validation sets is using that data set.
Okay, point taken about the service fingerprints, but I still see no mention for the OS fingerprints. If they looked at the data format that is there, they could get much more out of the set. (they'd also find more mess by the way as there was some weird bug that destroyed quite a few samples there)