Wednesday, February 20, 2013

Data can't do everything. So what?

*Sigh*  This article again.  The one that says, "Data can't do everything."  This time, David Brooks happens to be the one writing it, but it could have been anybody, really.  Brooks gives a list of things that he feels data does poorly ("context", "big problems", "the social"), and then concludes with this gem:
"This is not to argue that big data isn’t a great tool. It’s just that, like any tool, it’s good at some things and not at others."
Well, duh!

I'm tired of reading the many incarnations of this article, for two reasons.

  1. It's obvious. Good data analysts (and anybody with half a brain) is already aware of these kinds of limitations.
  2. It doesn't move the debate forward. In fact, it clouds the issue.
The debate about data is a debate about scope: "What can and can't be accomplished with data?" This isn't a question that can be resolved using vague generalities.  For example, the following logic (based on one of Brooks' rules of thumb) doesn't work: "Well, building a platform where millions of people can share ideas in real time (e.g. twitter) is a 'big problem,' so I guess it can't be solved with data.  But convincing my toddler to stop throwing milk at dinner is a 'small problem,' so bring on the statistics!"

If you want to know whether data can help answer a question, you have to look at the structure of the data: What variables are available? What are the units of analysis? How are the data structured across time? Are there any plausible sources of exogenous variation (e.g. instrumental variables or "natural experiments")? These are the right questions to ask. Hazy adjectives like "big" or "social" simply aren't useful.

It's as if Brooks is claiming he can fix your car without opening the hood. "You can fix red SUVs by flushing out the engine." "You can answer big, social questions by relying on values." A real mechanic would get inside the machine and actually see how it works. "Hmm... for this particular big, social question, you have lots of data on X and Y, and a little bit on Z, and this portion was captured as part of an experimental design.  That means we can infer A, but we can't infer B..."

"I still say we're both entitled to our own methods of fixing the car."

Data can't do everything.  Not even close.  But we live in a world swimming in data of increasingly useful types.  It seems reasonable to think that we'll be able to do more with that data once we figure out what it's good for.  And we can't do that by burning the strawman of omnipotent data, or by trading in mushy platitudes.  We need to get specific about real questions and data structure.


  1. It sounds like you and Brooks actually agree. He likes data but recognizes its constraints, as do you. But you sound bitter in this post--bitter that people write articles about the constraints of data and analysis.

  2. @Eric: Yes and no. I agree that data can't do everything, but I disagree with the boundaries for usefulness that Brooks defines.

    "Bitter" isn't really the right word for how I feel (although on second reading, I can see how you'd think that.) "Exasperated" is a better descriptor. Since my professional identity is tied up with making good use of data, I find this kind of Monday morning quarterbacking silly. On good days, I find it amusing. On bad days, it's just frustrating.