Why did the original incorrect post get score 5, but the corrections only get 2?
As was stated, the CLT states that the average of many tests will be approximately normally distributed. Imagine someone producing a positive result in a paper. Then imagine we have 99 other people replicate exactly the same experiment. Each experiment will give us an average result, for a total of 100 averages from 100 experiments. It is this set of averages that is normally distributed.
This makes no assumption on the distribution of the sample itself. Sampling distribution does not need to be normally distributed. It only requires that the samples be independent and identically distributed.
Now, in reality, we only perform experiment and make our conclusions from that. In frequentist statistics (the type that you most likely learned), we use our single experiment to infer the other 99 experiments. Here, it is important that we pick the correct statistical test since different tests make different assumptions. The basic t-test does have a normality, homoskedasitic, and independence assumption and are usually correct because of the Law of Large numbers. But these assumptions are be tested for and, if not met, the scientist/statistician will choose different statistical test instead.
Just to finish the review: Now that we inferred the entire universe of possible results, we assume the Null Hypothesis: Our treatment in the experiment did nothing (treatment group vs control), or the two groups are identical (blacks/whites or rich/poor). Due to sampling, the average of each group will vary every time we perform the experiment. The statistical test measures how often will we see our particular result in relation to the entire universe of possible results, again assuming that there no treatment effect. If (assuming no effect) we rarely see our result or results more extreme/larger/further apart, then we have evidence that the treatment was the cause for the difference, and not random chance.
To explain the paper: The authors used a different type of statistics called Bayesian Statistics to derive their results. This branch of statistics is philosophically different, though they have developed analogs of all the popular frequentist statistical tests. New results in Bayesian Statistics allow for direct comparison of the two branches of statistics and the authors have concluded a p=0.05 in frequentist statistics results in a 3.47 Bayes Factor in Bayesian Statistics. Bayes Factor is the ratio of the probability of seeing the data in this experiment assuming that the treatment DID have an effect vs. the probability of seeing the data assuming the treatment DID NOT have an effect. (Note: we are not looking at the entire universe of possible results, using only our single result) In other words, given the chances of seeing two heads in a row, we are saying that there is a treatment effect vs no treatment effect.