compSocSci: Design patterns for data-centric software

Monday, June 4, 2012

Design patterns for data-centric software

I wrote a few days ago about software design patterns, including the thought that we're going to discover new patterns for data-centric software. Let me unpack that concept.

First, by data-centric software, I don't mean software intended for data analysis (e.g. R, excel, or google charts). I mean any software that collects and/or responds to data in the course of doing whatever else it does.

Web analytics are a great example of this. The primary purpose of a web page is to serve content. But at the same time, it's easy to track pageviews and traffic. Compared to an untracked web site, a site instrumented with google analytics is more data-centric, because it's generating data in the background.

As I read it, the original design patterns are intended mainly to minimize long-term development costs. The key question is "How should code be structured to make it easy to read, debug, maintain, extend, etc?" It's all about saving developers' time in the long run.*

Five years after the original set of design patterns was popularized, another book was published, focusing on design patterns for distributed software. This time, the key questions expanded to include bandwidth and concurrency: "How should we structure code to make the best use out of distributed computing resources?"

I think we're due for another expansion, because data-centric code introduces another optimization target: useful information.** Just as the list of patterns expanded to deal with networking and multiprocessing, it will expand again as data processing and analytics become integral to software design.***

Off the top of my head, here's a quick list of data-centric patterns.

A/B testing
Funnel analysis
Recommender systems (very broad category!)
Top hits (most visited, emailed, etc.)
Automatic bug reports
Likes, +1s, Retweets

This list isn't complete, and it's clear that best practice is still evolving. For example, A/B testing has been industry-standard for a long time, but I recently read a good argument that a multi-armed bandit algorithm is better than A/B testing, because it gathers all the same information, plus integrating that feedback directly into the site design. It's a very natural extension and improvement over an older data-centric design pattern. I'm sure that many other such improvements are possible.

Anyway, I think it's still too early to try to write a comprehensive list. But I'd still like to expand this list to cover as many cases as possible. What else belongs here?

*A few of the patterns address things like limited memory and processing power, but they're the exceptions.

** Defining useful opens up a whole new can of worms, which I won't get into here.

***This relates back to the concept that I've written about before: software design for analytics.

compSocSci

Monday, June 4, 2012

Design patterns for data-centric software

No comments:

Post a Comment