E-Mail 'The myth of hypothesis-driven science' To A Friend

* Required Field

Separate multiple entries with a comma. Maximum 5 entries.

Separate multiple entries with a comma. Maximum 5 entries.

E-Mail Image Verification

Loading ... Loading ...

This post was published on 04/12/10 in Science.

Send this post to a friend Send this post to a friend


You can follow the comments on this post via this RSS feed.

Tags: , .

  1. Comment by Lee Rudolph, 05/12/10, 04:59:

    EP writes of “rummaging through books written 300 years ago by the fathers of data mining and scraping, John Graunt and William Petty”

    I hope you’re making sure that the Wellcome Library gets the texts of all those books scanned, OCRd, and up on the web. “Old manuscript rooms” are all very, well, “glorious”, but I find it damned convenient to sit here at home and rummage through millions of books and journals (not all, alas, free access–yet; luckily I have courtesy access to a number of paysites, and some skill at seeing a bit more of certain Google Books than Google or its partners want to be easily seen). I can almost make my writings look scholarly now!

    An interesting sidelight, by the way, is how easy it is in this brave new world to detect plagiarism (or arguably, in a few cases, independent origination of a good phrase…) including self-plagiarism (incorporating the same couple of paragraphs in a dozen purportedly distinct publications), as well as memetic evolution (if, as I have just had occasion to do for purely *scholarly* purposes, you track down the rise of the phrase “butterfly effect”, you’ll find that over time both the location of the original wing-flap, and the location and nature of the eventual weather event, drift away from Lorentz’s original Brazil and Texan tornado) and the propagation of citation errors. Fun!

  2. Comment by Donald Pollock, 06/12/10, 06:25:

    This comment came to Elizabeth by e-mail, but seemed like a point well worth posting:

    I enjoyed your Prospect essay on data mining, etc, so I hope you don’t find it
    ungracious of me to point out that Karl Popper actually rejected the notion of
    verifiability in scientific hypotheses. Popper’s crusade was for
    falsifiability: hypotheses are scientific when they are expressed in a form that
    allows them to be falsified, not verified. To use the common example: to say
    that all swans are white says something interesting, and is scientific because
    it would take only a single black (or orange) swan to falsify the statement. To
    say that one swan is white is not especially interesting, and is not a valuable
    scientific hypothesis, even though it is easy to verify. For Popper,
    observations can be verified, but it is hypotheses that are the mark of science,
    and only if they can be falsified.

    You may be misled by Conjectures and Refutations. There, Popper realized that
    the history of science conflicts with his philosophy of science: scientists work
    to verify hypotheses, not to falsify them. So Popper introduced the notion of
    Corroboration, a kind of strange, almost statistical measure of how close any
    hypothesis is to being verified – perfect verification being impossible, it
    always being the case that the next swan could be black. Corroboration simply
    doesn’t work, as Popper outlines it, but ironically it may be nearer to what
    scientists do.

  3. Comment by Nathan V, 21/12/10, 06:20:

    ‘“Fishing” is only a problem if the datasets are too small or the sampling design too weak to support the results.’

    I feel like this is an unfair description of the risks of data dredging, and hence of the risks of hypothesis-free science. It’d be true if “statistically significant” meant “true”; it’d be true if every question ever asked were published and well-known. Neither of those are the case.

    Every time you ask a question, there’s a 1/20 you’re going to get a spurious result. So if you ask lots of questions, you’re going to get some positives. It doesn’t matter if your sample is the whole world.

    Now, if you design a study with a hypothesis, or several hypotheses, in mind, you end up with some armor against this. After all, you can ask 20 questions, but correct for the number of questions you ask, by demanding a better p. And there’s the assumption of prior data dredging to come up with the hypothesis in the first place, which gives you a little more certainty.

    But when you poll “Big Data,” there’s no need to say exactly how many questions you’re asking. In fact, it’s almost impossible to know exactly how many questions you’re asking. That’s why it’s searching for hypotheses, and not experimentation.

    Let’s say Astra Zeneca has perfect health data for everybody in the world, and they want to show that Seroquel is a good drug. If they can find twenty populations that have used the drug (schizophrenic, bipolar, etc), they can find one spurious association. Obviously, the nulls will never be published; in fact, it’d be impossible for any of us to tell exactly how many populations AZ had to look at to find statistically significant benefit.

    That’s (one of the reasons) why we have trial registries.

    This isn’t limited to pharmaceutical companies. One big story of ’08 (I think?) was of a doctor who looked, retrospectively, at diet info for pregnant moms and discovered that eating breakfast cereal made one more likely to conceive a boy. The problem, of course, was that there were over 200 different food stuffs looked at, virtually guaranteeing a statistically significant correlation with at least one of the foods.

    I feel as if you’re describing a naive view of hypothesis generation in your article: that of a scientist, sitting at a desk, trying really hard to come up with an idea to test. Obviously, that is not how things work. Hypotheses arise in response to data. Those wonderful associations you find in databases are not trash without hypothesis-free science; they are the source of hypotheses.

Comments are closed at this time.