Comment This won't fix anything (Score 5, Insightful) 137
There's a trade-off between sensitivity and specificity. If you increase the threshold for "significance", you reduce the power to discover a significant effect when it truly does exist.
And a major part of the problem with scientific studies is that they are already underpowered. According to conventional wisdom, ideally, scientists should strive for a power of about 80% (i.e., an 80% chance of detecting an effect if it truly exists), but very few studies actually achieve power of this level. In many fields, the power is less than 50% and sometimes much less.
Underpowered studies result in two major problems:
1) Most obviously, an underpowered study results in a greater number of FALSE NEGATIVES. You fail to find a true effect. You will either publish your incorrect result of no effect. (And why should we consider published false positives to be any worse than false negatives?) Alternatively, perhaps you don't publish your study because you couldn't reach significance. This exacerbates the "file-drawer effect" and also results in wasted research dollars because the results aren't published.
2) Somewhat counterintuitively, underpowered studies are often also more likely to result in FALSE POSITIVES. This is because, when your power to detect a true effect is low, and if you test a large number of effects that are unlikely to be null, most of the hypotheses that you say are "significantly" non-null will actually be false positives. We would say that the "false discovery rate" tends to be very high when the power is low.
Reducing the level of significance will do little to address these problems, and in some cases may even exacerbate the problem.
The key is *to move away from the binary concept of "significance" altogether*. It's obviously artificial to have an arbitrary numerical cutoff for "matters" vs. "doesn't matter", and this is not what Ronald Fisher intended when he popularized the p-value or developed the concept of "significance".
What we should be doing is measuring and reporting effect sizes along with their credible intervals. While using priors that are based on our real state of knowledge. In other words, we should be doing Bayesian statistics.