These results aren't as significant as you might think. In fact, the small gain over ELO lies inside the margin of error for the small sample set of games being used. No matter what the sample size, there will be some algorithm, specifically tailored to that sample that will achieve better results than any other algorithm, especially if the other algorithm satisfies the criteria of being the most statistically valid across all possible sample sets on average.
ELO is a pure logistic norming of rating-difference vs. expected-result. Mathematically/statistically it's not possible to improve on ELO as a general predictor of success in general. Only in specific sub-samples. Given enough games in a large enough pool of players, ELO should be a perfect predictor. Just like the sum of an infinite amount of continuous uniformly distributed random variables would result in a normal distribution.
It's like having the best overall car. There will be other cars that beat you in the short race, some that beat you in the long race, even some that beat you in the medium race, some that beat you for mph, mpg, etc.
Or picture a chaotic curve with lots of randomness that has an overall trend from 0,0 to x,y. y=x is the best model of the whole curve you can get. But if you zoom in on some subsection of the curve and make all the randomness disappear, you can tweak your straight line to fit the data slightly better. But if you extrapolated the tweaked line over the whole spectrum, you'd get a pretty poor predictor that wouldn't even stay in the domain or range.