04/12/10

The myth of hypothesis-driven science

At a conference in Mexico recently, I ran into Wired editor Chris Anderson. His essay on the petabyte age, published a couple of years ago, sounded the death knell for scientific method. I was seduced by the argument at the time, as well as by the beautiful graphics that accompanied the piece. Visualising Big Data can be a pleasure, as this graphic of edits of Wikipedia pages shows.

But when I started to dig around, I found that there’s nothing new about Big Data. People have been complaining about the data deluge since the 1600s.

“One of the diseases of this age is the multiplicity of books; they doth so overcharge the world that it is not able to digest the abundance of idle matter that is every day hatched andbrought forth into the world,” thundered Barnaby Rich in 1613. He himself contributed 26 books to the multiplicity and eventually gave his name to the Barnaby Rich effect: “a high output of scientific writings accompanied by complaints on the excessive productivity of other authors.”

What about the fact that new technologies are allowing us just to throw gobs of data at the wall, see what sticks, and turn that into a new theory, rather than starting with a hypothesis and laboriously collecting the data to confirm or refute it? In an essay just out in Prospect I’m forced to conclude that hypothesis driven science has always been a bit of a myth, shaped more by the way science is funded than by the need to create or maintain rigour.

I had fun writing the essay because it gave me an excuse to sit in the rare manuscripts room of the glorious Wellcome Library, rummaging through books written 300 years ago by the fathers of data mining and scraping, John Graunt and William Petty. As I note in the essay:

In one of his “Essays on Political Arithmetick,” Petty took death rates collected for another purpose, stirred them with a couple of wild assumptions on population, and seasoned them with a dash of prejudice to conclude that British hospitals were much less likely to kill their patients than French ones, where “Half the said numbers did not die by natural necessity but by the evil administration of the hospital.” In a precursor to the World Bank’s habit of pricing productivity lost by ill-health, Petty goes on to calculate the cost of the unnecessary deaths, valuing the French at £60 each, “being about the value of Ariger Slaves (which is less than the intrinsik value of People at Paris).”

English commentator trashes French health system. Indeed, there’s nothing new about the way we use data…

Be Sociable, Share!

This post was published on 04/12/10 in Science.

Send this post to a friend Send this post to a friend

3 comments

You can follow the comments on this post via this RSS feed.

Tags: , .

  1. Comment by Lee Rudolph, 05/12/10, 04:59:

    EP writes of “rummaging through books written 300 years ago by the fathers of data mining and scraping, John Graunt and William Petty”

    I hope you’re making sure that the Wellcome Library gets the texts of all those books scanned, OCRd, and up on the web. “Old manuscript rooms” are all very, well, “glorious”, but I find it damned convenient to sit here at home and rummage through millions of books and journals (not all, alas, free access–yet; luckily I have courtesy access to a number of paysites, and some skill at seeing a bit more of certain Google Books than Google or its partners want to be easily seen). I can almost make my writings look scholarly now!

    An interesting sidelight, by the way, is how easy it is in this brave new world to detect plagiarism (or arguably, in a few cases, independent origination of a good phrase…) including self-plagiarism (incorporating the same couple of paragraphs in a dozen purportedly distinct publications), as well as memetic evolution (if, as I have just had occasion to do for purely *scholarly* purposes, you track down the rise of the phrase “butterfly effect”, you’ll find that over time both the location of the original wing-flap, and the location and nature of the eventual weather event, drift away from Lorentz’s original Brazil and Texan tornado) and the propagation of citation errors. Fun!

  2. Comment by Donald Pollock, 06/12/10, 06:25:

    This comment came to Elizabeth by e-mail, but seemed like a point well worth posting:

    I enjoyed your Prospect essay on data mining, etc, so I hope you don’t find it
    ungracious of me to point out that Karl Popper actually rejected the notion of
    verifiability in scientific hypotheses. Popper’s crusade was for
    falsifiability: hypotheses are scientific when they are expressed in a form that
    allows them to be falsified, not verified. To use the common example: to say
    that all swans are white says something interesting, and is scientific because
    it would take only a single black (or orange) swan to falsify the statement. To
    say that one swan is white is not especially interesting, and is not a valuable
    scientific hypothesis, even though it is easy to verify. For Popper,
    observations can be verified, but it is hypotheses that are the mark of science,
    and only if they can be falsified.

    You may be misled by Conjectures and Refutations. There, Popper realized that
    the history of science conflicts with his philosophy of science: scientists work
    to verify hypotheses, not to falsify them. So Popper introduced the notion of
    Corroboration, a kind of strange, almost statistical measure of how close any
    hypothesis is to being verified – perfect verification being impossible, it
    always being the case that the next swan could be black. Corroboration simply
    doesn’t work, as Popper outlines it, but ironically it may be nearer to what
    scientists do.

  3. Comment by Nathan V, 21/12/10, 06:20:

    ‘“Fishing” is only a problem if the datasets are too small or the sampling design too weak to support the results.’

    I feel like this is an unfair description of the risks of data dredging, and hence of the risks of hypothesis-free science. It’d be true if “statistically significant” meant “true”; it’d be true if every question ever asked were published and well-known. Neither of those are the case.

    Every time you ask a question, there’s a 1/20 you’re going to get a spurious result. So if you ask lots of questions, you’re going to get some positives. It doesn’t matter if your sample is the whole world.

    Now, if you design a study with a hypothesis, or several hypotheses, in mind, you end up with some armor against this. After all, you can ask 20 questions, but correct for the number of questions you ask, by demanding a better p. And there’s the assumption of prior data dredging to come up with the hypothesis in the first place, which gives you a little more certainty.

    But when you poll “Big Data,” there’s no need to say exactly how many questions you’re asking. In fact, it’s almost impossible to know exactly how many questions you’re asking. That’s why it’s searching for hypotheses, and not experimentation.

    Let’s say Astra Zeneca has perfect health data for everybody in the world, and they want to show that Seroquel is a good drug. If they can find twenty populations that have used the drug (schizophrenic, bipolar, etc), they can find one spurious association. Obviously, the nulls will never be published; in fact, it’d be impossible for any of us to tell exactly how many populations AZ had to look at to find statistically significant benefit.

    That’s (one of the reasons) why we have trial registries.

    This isn’t limited to pharmaceutical companies. One big story of ’08 (I think?) was of a doctor who looked, retrospectively, at diet info for pregnant moms and discovered that eating breakfast cereal made one more likely to conceive a boy. The problem, of course, was that there were over 200 different food stuffs looked at, virtually guaranteeing a statistically significant correlation with at least one of the foods.

    I feel as if you’re describing a naive view of hypothesis generation in your article: that of a scientist, sitting at a desk, trying really hard to come up with an idea to test. Obviously, that is not how things work. Hypotheses arise in response to data. Those wonderful associations you find in databases are not trash without hypothesis-free science; they are the source of hypotheses.

Comments are closed at this time.