Beware the Big Data Gospel

Weekly Article
Shutterstock.com
Feb. 26, 2015

Earlier this month, I published an article on CNN.com that examined and described the limits of big data as an instrument of progress. I won’t rehash the arguments of that article here (I do hope you’ll read it), but I want to respond to two critiques of the piece from Marco Lübbecke, a professor of operations research, on CNN.com and Thomas Davenport, author of Big Data at Work, on a Wall Street Journal blog.

Lübbecke’s counter-claim that “big data saves lives” is emotionally manipulative and unsupported by any evidence he puts forth. The antithesis Davenport sets up between “data and analytics” on the one hand and “unaided human intuition” on the other is a dangerously misleading simplification that wrongly conflates intuitive knowledge with arbitrary subjectivity. Davenport also fails to account for the fact that data compilation and analysis is performed by human beings and is therefore neither automatic nor objective. As philosopher Michael Polanyi thoughtfully characterized tacit knowledge, “we can know more than we can tell.” Polanyi’s point remains true no matter how many zetabytes of storage and petaflops of processing power one has at one’s disposal.

Quantitative analysis of data has been central to what we’ve come to call science since well before the word “science” existed. Babylonian astronomers gathered data about solar eclipses 2600 years ago, and used that data to predict future eclipses. As science and technology have evolved over the millennia, so too have tools for gathering data and for analyzing it. Many of those tools are invaluable to the scientific endeavor. Adherents, like Lübbecke and Davenport, to the church of big data, fail to see what legal scholar Julie Cohen points out: “Big Data is the ultimate expression of a mode of rationality that equates information with truth and more information with more truth, and that denies the possibility that information processing designed simply to identify ‘patterns’ might be systematically infused with a particular ideology.” My attempt is to interrogate that ideology, not to deny that the creation and analysis of quantitative data is a necessary part of science.

A close reading of the examples Lübbecke puts forth to illustrate the life-saving potential of Big Data illustrate the hollowness of Davenport’s claim that “at the core of analytical decision-making is not soft fad, but hard science.” Lübbecke cites polio vaccination campaigns as an “outstanding example of the way that Big Data saves lives.” His evidence is a paper co-authored by scientists at the Centers for Disease Control (CDC), but his reading conflates the very real ability of vaccines to save lives with the life-saving ability of convoluted analytical techniques about how effective vaccines are. The relevant question is the virtue of analytical techniques about how vaccines ought to be optimally applied, not the virtue of the vaccines themselves.

The CDC paper concludes that “sustained intense immunization efforts” are better than “wavering commitments” to immunization. I don’t doubt this is true enough. Common sense would dictate that sustained efforts are better than wavering commitments. But what value does data-driven analysis add to this proposition? Analytic techniques, Lübbecke points out, yield the claim that the Global Polio Eradication initiative (GPEI) has and will save between $40 and $50 billion from 1988 to 2035, and that Vitamin A delivered along with polio vaccines accounts for a further savings of between $17 billion and $90 billion.

Do the numbers $17-90 billion tell us anything that the words “lots of money” do not? The CDC journal article goes on to quote the director of Rotary International’s anti-polio campaign: “We regularly use the $40–50 billion estimate of net benefits of the GPEI as we raise funds to finish polio eradication.” This goes to the point I was trying to make in the original piece. It’s not that anyone should really have confidence that the polio eradication campaign saved $45 billion +/- $5 billion. It’s that saying so is an effective fundraising technique. Pretending that a range of $17-90 billion conveys more information than “a lot of money” is where an uncritical acceptance of the virtue of data goes off the rails.

Lübbecke is correct in his generic call for careful analysis; but he doesn’t follow through on his own prescription.

It’s simply a category mistake to attempt to come up with a specific number for the economic impact of polio eradication. It is not as if there is some accurate figure, say $47,253,238,334, which more sophisticated methodology will allow us to pin down. No such number exists, and all the economists in all the business schools can’t reliably find it. A world in which fewer people die of polio is a different world and, I would argue, a better one. The true case for vaccination is a moral one that rests on lives saved and people saved from the ravages of polio, not on a dollar figure of benefit to the economy.

However, the polio example isn’t as laughable as Lübbecke’s other example purporting to demonstrate the life-saving benefits of big data: a blog post by Edward Kaplan of Yale that discusses a business-school study of the number of “counterterror agents” the US needs. Lübbecke endorses the model that Kaplan uses, in which the “number of counterterror agents drives the rate with which [terrorist] plots are detected.” But Kaplan’s model is ludicrously oversimplified. He doesn’t clearly define who “counterterror agents” are. Do police officers, DEA agents, and bureaucrats with the Department of Homeland Security count? Do customs agents? Do US Marshals?

Is the probability that a terrorist plot is uncovered really simply a function of the number of agents, as in Kaplan’s model, and not of factors like the agents’ intelligence, legal constraints and technological tools? Kaplan and Lübbecke highlight that though the “model suggests an optimal staffing level of only 2,080 agents,” in 2004 the FBI had 2,398 agents “dedicated to counterterrorism”. (Lübbecke incorrectly states that 2004 is the most recent year for which FBI staffing figures are publicly available, though a quick search finds this Department of Justice report, which gives a number of 3,445 FBI agents “addressing counterterrorism matters” in 2009.) In any case, the juxtaposition of the number Kaplan’s model spits out with the number of FBI counterterrorism agents in 2004 is hardly, as Kaplan characterizes it, “interesting,” let alone of tangible life-saving benefit, as Lübbecke claims.

The paradox at the heart of the argument Lübbecke and other cheerleaders for big data make is that they claim to place great value on evidence as opposed to intuition. But rather than present analytical evidence for the value of “evidence,” they merely assert that it is tremendously useful and expect us to believe them.

Lübbecke’s examples point to the silliness of big data’s claim to epistemic superiority, but they don’t adequately illustrate the damage that can be done by big data evangelists like Davenport. To understand that damage, one must parse the political economy of data creation and analysis, something I began to do in the earlier piece. In short, using data along the lines Davenport advocates imposes costs on society unequally. As my colleague Seeta Gangadharan has written, “There’s a real threat that the negative effects of algorithmic decision-making will disproportionately burden the poorest and most marginalized among us.”

These are not new fights. Steven Shapin, an historian of science, was writing about the 17th century when he remarked, “it is just when the authority of long-established institutions erodes that the solutions to such questions about knowledge come to have special point and urgency…Method, broadly construed, is the preferred remedy for problems of intellectual disorder.” Blind faith in the superiority of “big data” or “well-designed analytics” does not resolve underlying intellectual discord about how society ought to guard itself against terrorism or structure its economy.

Lübbecke and Davenport seek objective certainty where it is not attainable. They do not seriously wrestle with the limitations of data-driven analysis but merely make a fetish of it.