"Well, if I generate (by simulation) a set of 200 variables — completely random and totally unrelated to each other — with about 1,000 data points for each, then it would be near impossible not to find in it a certain number of “significant” correlations of sorts. But these correlations would be entirely spurious. And while there are techniques to control the cherry-picking (such as the Bonferroni adjustment), they don’t catch the culprits — much as regulation didn’t stop insiders from gaming the system. You can’t really police researchers, particularly when they are free agents toying with the large data available on the web.
I am not saying here that there is no information in big data. There is plenty of information. The problem — the central issue — is that the needle comes in an increasingly larger haystack."