- First, there's the whole issue of averaging 1-10 ratings. First, those number will not be uniformly distributed. Rather, they'll be clustered in the 1-2, 5, 8-10. Second they aren't ratio quantities, so you just can't average them. By this I mean that 1/10+2*10/10 = 21/30 scores the same as 3*7/10. That really doesn't make sense. A reddit style voting system will address this, but requires a larger sample size.
- Ignoring the first issue, your first round has fairly low confidence of selecting the best stories for review. Let's be generous and assume that of your initial 20 reviewers half actually review it. Let's further assume none of them lie and just call it e.g. a 7/10 without reading it. You still have a sample size of 10. By terribly misusing the CLT because the sample is too small, we would assume the results are normally distributed about the "true" mean with a standard deviation of approximately 1.68. That means that if a score averages 8/10 after 10 reviews, there's a 17% chance that it's really 6.3 or worse. Similarly, a 6/10 average has 17% chance of really being 7.68. Not very encouraging.
- The above makes some very, very bad assumptions (e.g. nobody just says "Screw it; i'm putting down 7.") and misapplies the CLT. In reality, you have no idea what your confidence interval is, other than that it's not tight.
- You can increase the sample size for part 1, but that loses the benefits of your scheme and, as people are bothered to review more, they'll participate more rarely unless you reward them well.
In short, it's a pretty meaningless system based on a flawed average with unknown, but low, confidence in the scores.