Monday, March 11, 2013

Definitions for data science

Since I'm rebooting this blog, this seems like a good moment to lay out a framework for data science.  I'll tackle definitions now, and process next time.

Pinning down scope and definitions is important for data science, because the field is growing rapidly, with a sense that the sky is the limit.  Without priorities and a grasp of what data science isn't, we run the risk of overreaching, wasting our time, and leaving everyone disappointed.  I won't claim that my definition is the only definition, or even the best definition.  But it works for me, and it has some virtues worth discussing.

Essentially, I think of data science as "answering questions with data," or more precisely, "providing empirical answers to well-posed questions." By empirical, I mean "based on information that all participants can observe in common."  By well-posed, I mean "admitting a definitive answer": once we see the right answer, we can all agree that it's the right one. In the language of formal logic, a well-posed question is one that admits a deductively valid conclusion.  So, data = empirical, science = questions.

The main difference between my definition and most of the others floating around (e.g. here) is that I focus on the goal of data science (answering questions), not the tools or methods for getting there (e.g. data munging, predictive analytics, writing mapReduce queries).

I find that defining data science by goals instead of tools adds clarity, for two reasons.  First, goals usually provide a more defined boundary than tools.  Almost none of the tools of data science are unique to data science. Software engineers do lots of "hacking"; forecasters do lots of statistical modeling; DB admins use plenty of NoSQL.  None of these things on its own provides a bright line for determining who is a data scientist or not, so we have to take a fuzzy average over lots of categories, and end up with a large gray area of jobs that are "kind of" data science. In contrast, it's usually pretty clear if your goal is answering questions (a.k.a. "providing insight," "running analytics," "informing decisions") or not.

Second, focusing on goals lets us differentiate approaches by effectiveness. Without a clear understanding of the job of data science, it's impossible to tell the difference between professionals who choose the right tools to get the job done, and bandwaggoners who are just playing with every shiny new toy.  Since the bandwagoning has started already, I think we'll be well served to differentiate between effective data scientists and the tools they use.

Analogy: I'm in the hospital for an appendectomy as I write this, going under the knife in a few hours.  I find it much more comforting to think of the doctors in terms of goals ("People who help you regain their health") than tools and methods ("People who cut holes in you with scalpels and wires").  Similarly, I'd be much happier hiring a data scientist who is good at answering questions, than one who is good with mongoDB, or Bayesian models.  Having the tools is necessary but not sufficient to accomplish the goals.

With those ideas on the table, here are some comparisons I'd like to explore in the future:

  • How is data science different from "big data"?
  • How is data science different from statistics?
  • How is data science different from data analysis?
  • How is data science different from science in general?
  • How is data science different from software engineering?

What do you think?  Discuss.

No comments:

Post a Comment