Wednesday, February 20, 2013

Data can't do everything. So what?


*Sigh*  This article again.  The one that says, "Data can't do everything."  This time, David Brooks happens to be the one writing it, but it could have been anybody, really.  Brooks gives a list of things that he feels data does poorly ("context", "big problems", "the social"), and then concludes with this gem:
"This is not to argue that big data isn’t a great tool. It’s just that, like any tool, it’s good at some things and not at others."
Well, duh!

I'm tired of reading the many incarnations of this article, for two reasons.

  1. It's obvious. Good data analysts (and anybody with half a brain) is already aware of these kinds of limitations.
  2. It doesn't move the debate forward. In fact, it clouds the issue.
The debate about data is a debate about scope: "What can and can't be accomplished with data?" This isn't a question that can be resolved using vague generalities.  For example, the following logic (based on one of Brooks' rules of thumb) doesn't work: "Well, building a platform where millions of people can share ideas in real time (e.g. twitter) is a 'big problem,' so I guess it can't be solved with data.  But convincing my toddler to stop throwing milk at dinner is a 'small problem,' so bring on the statistics!"

If you want to know whether data can help answer a question, you have to look at the structure of the data: What variables are available? What are the units of analysis? How are the data structured across time? Are there any plausible sources of exogenous variation (e.g. instrumental variables or "natural experiments")? These are the right questions to ask. Hazy adjectives like "big" or "social" simply aren't useful.

It's as if Brooks is claiming he can fix your car without opening the hood. "You can fix red SUVs by flushing out the engine." "You can answer big, social questions by relying on values." A real mechanic would get inside the machine and actually see how it works. "Hmm... for this particular big, social question, you have lots of data on X and Y, and a little bit on Z, and this portion was captured as part of an experimental design.  That means we can infer A, but we can't infer B..."

"I still say we're both entitled to our own methods of fixing the car."

Data can't do everything.  Not even close.  But we live in a world swimming in data of increasingly useful types.  It seems reasonable to think that we'll be able to do more with that data once we figure out what it's good for.  And we can't do that by burning the strawman of omnipotent data, or by trading in mushy platitudes.  We need to get specific about real questions and data structure.

Tuesday, February 5, 2013

...and we're back!


I'm back! Life took a sudden turn this summer: I went to the Bay Area for a wedding, lined up some informational interviews around data science, and found myself hired by a fantastic little startup, which then got acquired by a fantastic big startup.  I spent the summer working feverishly on my dissertation, then moved out to San Francisco at the beginning of December. It's been an awesome and surprising ride: like crashing a bicycle into a pool full of bacon.



Up until now, I couldn't tell the story, because the acquisition of Massive Health (the fantastic little startup) by Jawbone (the fantastic big startup) wasn't finalized or public until yesterday. Also, I was super busy.

With these issues resolved, I mean to reopen this blog. The focus is still data, especially opportunities for better living through data, and the day-to-day work of professional data science.

Really, there are two questions I expect to come back to over and over:
  • What is data science, how is it practiced, and how should it be practiced?
  • How can personal data make life better for people?  Like me, for example.
One thing I won't write much about is products and data systems at Jawbone. The company is doing awesome, forward-looking R&D in several areas, but we are supposed to keep a lid on it until the proper moment. Because of that, I'll focus more on process, possibilities, introspection, and nifty stuff popping up around data science writ large.

If you have thoughts of questions or ideas on these topics, please engage! The world is full of data, and we're still learning how to make it useful. I'm looking forward to the conversation.  Cheers!