Forgot your password?
typodupeerror
User Journal

Daniel Dvorkin's Journal: Correlation, causation, and all that. 12

Journal by Daniel Dvorkin

So this cartoon has been going around my Facebook friends list ... I'm going to try to explain what's wrong with it, and I'll try to be succint, but I don't know how good a job I'll do, so bear with me. The short and snarky version is found in my Slashdot sig line, "The correlation between ignorance of statistics and using 'correlation is not causation' as an argument is close to 1," but that's kind of unfair and certainly isn't all the discussion this subject deserves.

First of all, yes, "correlation is not causation" is strictly true. That is, they are not the same thing. If events A and B tend to occur together, this does not mean that A causes B, or that B causes A. There may be a third, unobserved event C that causes both, or the observed correlation may simply be a coincidence. Bear this in mind.

But if you observe the correlation frequently enough to establish significance, you can be reasonably sure (arbitrarily sure, depending on how many times you make the observation) that it's not coincidence. So now you're back to one of three explanations: A causes B, B causes A, or there exists some C that causes both A and B. (Two caveats: whatever the causal relationships are, they may be very indirect, proceeding through events D, E, F, and G; and the word "significance" has a very precise meaning in this context, so check with your local statistician before using it.) An easy way to check for A-causes-B vs. B-causes-A is by looking at temporal relationships. If you are already wearing your seatbelt when you get in a car crash, you are far more likely to survive than if you aren't, but you have to have made the decision to put the seatbelt on before the crash occurs--it's the fact of you wearing your seatbelt that causes you to get through the crash okay, not the fact that you get through the crash okay that causes you to have been wearing your seatbelt. Unfortunately, the temporal relationships aren't always clear, and even if you can rule out B-causes-A on this basis, it still leaves you to choose between A-causes-B and C-causes-(A,B).

An awful lot of what science does is figuring out what C is, or even if it exists at all. This is where mechanistic knowledge of the universe comes into play. Suppose that emergency departments in particular city start seeing a whole bunch of patients with acute-onset fever and diarrhea. Shortly thereafter, ED's in nearby cities start seeing the same thing, and then the same in cities connected by air travel routes. Patient histories reveal that the diarrhea tends to start about six hours after the onset of fever. Does this mean the fever is causing the diarrhea? Probably not, because these days we know enough about the mechanisms of infectious disease to know that there are lots of pathogens that cause fever, then diarrhea. The epidemiologists' and physicians' job is then to figure out what the pathogen is, how it spreads, and hopefully how best to treat it; while they're doing that, the "correlation is not causation" fanatics will be sticking their fingers in their ears and chanting "la la la I can't hear you," and hoping desperately they don't end their days as dehydrated husks lying on a feces-soaked hospital bed.

The point here is that in most cases, correlation is all we can observe. (Some philosophers of science, a la David Hume, would argue that we never observe causation, but I'm willing to accept "cause of death: gunshot wound to head" and similar extreme cases as direct observation of causal relationships.) Not every patient exposed to the pathogen will get infected. Of those who do, not all will show symptoms. Some symptomatic patients will just get the fever, some will just get the diarrhea. Some will get them at the same time, or the diarrhea first. Medical ethics boards tend to frown on doing controlled experiments with infectious diseases on human subjects, so you have to make what inferences you can with the data you have.

Even with all these limitations, correlation--in this case between exposure and symptoms--is still a powerful tool for uncovering the causal relationships. Most of what we know about human health comes from exactly this kind of analysis, and the same is true for the observational sciences generally. Astronomy, geology, paleontology, large chunks of physics and biology ... they're all built on observations of correlation, and smart inference from those observations. So if you want to know how the universe works, don't rely on any one-liners, no matter how satisfying, to guide your understanding.

This discussion has been archived. No new comments can be posted.

Correlation, causation, and all that.

Comments Filter:
  • I previously wrote about the misuse of "correlation does not imply causation" in my own journal [slashdot.org].

    it's the fact of you wearing your seatbelt that causes you to get through the crash okay, not the fact that you get through the crash okay that causes you to have been wearing your seatbelt.

    That or C. Defensive driving habits cause you to both wear your seat belt and drive in such a way as to minimize crash damage.

    • Nice journal post! Good, lively discussion, too.

      That or C. Defensive driving habits cause you to both wear your seat belt and drive in such a way as to minimize crash damage.

      True enough. I guess I was kind of assuming the kind of crash where someone blindsides you and it doesn't matter how well (or poorly) you're driving. To be sure, the best way to survive a crash is not to be in one in the first place, and if you can't do that, react in a way that causes the actual collision to take place at the lowest possible speed.

      • Nice journal post!

        Thank you.

        Good, lively discussion, too.

        For the record, the signature with hyperbole was this:

        --
        Correlation implies 25% likelihood of causation. Either A causes B, B causes A, C causes A and B, or chance.

        And after I toned it down:

        --
        Found correlation? Consider all four possibilities: A causes B, B causes A, C causes A and B, or chance.

        • by mcgrew (92797) *

          Correlation implies 25% likelihood of causation.

          50% chance before discovering temporal data. 25% chance a causes b, 25% chance b causes a. But if you have correlation, causation is unknown without further data.

          Most of my bosses have been statisticians holding PhDs in the field (I manage the databases, NOMAD is my favorite language but the bastards make me use Access now, I'm glad I'm retiring) and I learned a lot from them.

          BTW, both journals were excellent.

          • Without some reasonable sample of the universe of correlative and causative relationships, I don't think we can even go so far as to say that half the time A causes B and half the time B causes A--maybe we're more likely to observe relationships that go one way than the other? In my disease example, we started observing symptoms long before we discovered pathogens, for example, and some symptoms were identified earlier than others (I'm willing to bet we knew what diarrhea was long before we knew about feve

            • Most of my bosses have been statisticians holding PhDs in the field [who] make me use Access

              There is simply no good use case for Access, IMO.

              It's a journal, so I hope we're allowed a bit more leeway to drift the topic a bit more than in a story: For me, Access was a bridge toward learning SQL. I applied the knowledge to completely replace my employer's Access+VBA+MS SQL Server Express-powered application with an enhanced workalike in web+PHP+MySQL. Currently, this workalike powers Phil's Hobby Shop [philshobbyshop.com]. But in practice, what better tool is there for rapid prototyping of CRUD forms and point-and-click creation of the SQL joins that feed into a report

              • Hey, drift all you want. ;)

                And fair enough. I learned SQL the old-fashioned way [grumble mutter snore] and I'm much more verbal than visual (or, as mathematicians say, much more an algebraist than a geometer) so when I had to use Access briefly on the job, it made me want to dig my eyeballs out with a spoon. Fortunately I was able to convince my boss to go with a F/OSS stack instead--one of the many virtues of working for a small company.

        • Yeah, I think the toned-down version is better; you can't make inferences about the relative probabilities of each possible relationship without data specific to the situation. But it's certainly a point worth making, either way.

  • An intriguing counterpoint: Do Short Skirts Really Mean Better Times? [go.com]

    Humans have a remarkable ability to find patterns in random noise, to correlate where there may not be causation. Sometimes that's a good thing. Sometimes not. We have methods (reliable prediction mostly) for attempting to reduce the amount of irrelevant correlation - but it doesn't always work because randomness is, well, random.

    It makes life, and science, interesting.

    • Humans have a remarkable ability to find patterns in random noise, to correlate where there may not be causation.

      Sounds to me like in this case, there really isn't even that much correlation: "Hemlines were starting to come down in '27 and that was two years before the market crash", etc. I suspect that lot of the examples people grab to illustrate "correlation is not causation" are really examples of confirmation bias, and if you look at the actual numbers there isn't a significant correlation to begin with.

      There's also the problem that if you look at the universe of all possible data sets, there's an effectively i

      • by symbolset (646467) *

        Your theory of a totally dispassionate observer being required to correlate properly is interesting. Do you have some proof such a being has ever existed? Knowing people as I do, I would want strong proof.

        If you look at the universe of all possible data sets [wikipedia.org] all things not only are possible but do in some interpretations, exist. A creative person would say that the process of actualization is to guide your subjective self through the dimensions of the possible - no matter how improbable - to achieve you

        • Oh, I'm not claiming any observer can be dispassionate. We should do the best we can, is all, and hopefully avoid the really obvious dumb mistakes.

          Point taken, about the many-worlds interpretation. Perhaps I should have said "our universe of all possible data sets ..." If you you assume a new universe is born every time a quantum event takes place, of course, the number of possible correlations is even more absurdly large.

"I'm not a god, I was misquoted." -- Lister, Red Dwarf

Working...