Thursday, December 29, 2011

Software design for analytics: A manifesto in alpha

If you take the lean startup ideas of quick iteration, learning, and hypothesis checking seriously, then it makes sense to build your software in a way that lends itself to doing analytics.  Lately, I've been doing a lot of both (software development and analytics), so I've been thinking about how to help them play nice together.

Seems to me that MVP/MVC (or even MVVM, if you're into that kind of thing) are good at the following:
  • User experience
  • Database performance
  • Debugging
However, when doing analytics, my needs are different. I need to extract data from the system in a way that lets me answer useful questions. Instead of thinking about performance, scalability, etc., I'm thinking about independent and dependent variables, categories and units of analysis, and causal inference.  The "system requirements" for this kind of thing are very different.

Having worked with (and built) several different systems at this point, I've realized that some designs make analytics easier, and some make them much, much harder. And nothing in general-purpose guidelines for good software design guarantees good design for analytics.

Since so much of what I do is analytics, I'd like to ferret out some best practices for that kind of development.  I don't have any settled ideas yet, but I thought I'd put some observations on paper.

Some general ideas:
  1. Merges are a pain point. When doing analytics, I spend a large fraction of my time merging and converting data. Seems like there ought to be some good practices/tools to take away some of the pain.
  2. Visualization is also a pain point, but I'm less optimistic about fixing it.  There's a lot of art to good visualization.
  3. Units of analysis might be a good place to begin/focus thinking.  They tend to change less often than variables, and many design issues for archiving, merging, reporting, and hypothesis testing focus on units of analysis.
  4. The most important unit of analysis is probably the user, because most leap-of-faith assumptions center on users and markets, and because people are just plain complicated.  In some situations (e.g. B2B), the unit of analysis might be a group or organization, but even then, users are going to play an important role.
  5. Make it easy to keep track of where the data come from!  Any time you change the
  6. From a statistical perspective, we probably want to assume independence all over the place for simplicity -- but be aware that that's what we're doing!  For instance, it might make sense to assume that sessions are independent, even though they're actually linked across users.
  7. User segmentation seems like an underexploited area.  That is, most UI optimization is done using A-B testing, which optimizes for the "average user."  But in many cases, it could be very useful to try to segment the population into sub-populations, and figure out how their needs are different.  This won't work when we only have short interactions with anonymous users.  But if we have some history or background data (e.g. FB graph info), if could be a very powerful tool.
  8. Corollary: grab user data whenever it's cheap.

I'll close with questions.  What else belongs in this list?  Are there other people who are thinking about similar issues?  What process and technical solutions could help?  NoSQL and functional programming come to mind, but I haven't thought through the details.

No comments:

Post a Comment