compSocSci: June 2012

Friday, June 29, 2012

Word cloud of Knight News Challenge Data proposals

Last night, I scraped the ~800 "data" proposals from the Knight News Challenge and turned them into word soup*. As Mike says: Sorry, science! Still, you get a sense of the themes shared across the proposals.

I'm excited about the contest, and the (realistically, slim) chance that our civil-o-meter proposal will get funded. This is a really nifty time to be working in this area.

If you like the idea holding politicians and newsmakers to a fair and accurate standard for civility, please like us on tumblr, or tweet about us using the #newschallenge hashtag.

*I used python for the scraping and R for the very lightweight NLP. The layout is by wordle.

Thursday, June 21, 2012

A shameless plug for a worthy cause

Please "like" my proposal for a political civil-o-meter here
If you don't have a tumblr account already,
you'll need to take two minutes and create one.

Details
I've just put in an application for funding through the Knight Foundation's civic media news challenge. They want to "accelerate media innovation by funding breakthrough ideas in news and information." This round in the grant competition focuses on the role of data in civic engagement -- right up my alley.

To meet that challenge, I'm proposing a political civil-o-meter -- a crowdsourced site to generate fair and accurate civility ratings for political speech (think campaign ads, newspaper op-eds, and blog posts). Most of the tools to build such a site will already be developed as part of my dissertation; this grant would help me make them available to the public. This site would provide a really cool way to explore civility in public discourse, and hold public officials and media personalities accountable for the civility (or lack thereof) of what they say.

I'd appreciate it if you'd head on over to the Knight Foundation's tumblr blog and "like" the civil-o-meter proposal. (If you don't have a tumblr account already, you'll need to create one -- a quick, painless, and spam-free process.) Even if you don't like the proposal or just don't get it, you can ask clarifying questions in the comments section, and I'll do my best to explain things better. Awards aren't made strictly on the basis of voting, but I figure a little extra attention in this category can't hurt.

Thanks!

Monday, June 4, 2012

Design patterns for data-centric software

I wrote a few days ago about software design patterns, including the thought that we're going to discover new patterns for data-centric software. Let me unpack that concept.

First, by data-centric software, I don't mean software intended for data analysis (e.g. R, excel, or google charts). I mean any software that collects and/or responds to data in the course of doing whatever else it does.

Web analytics are a great example of this. The primary purpose of a web page is to serve content. But at the same time, it's easy to track pageviews and traffic. Compared to an untracked web site, a site instrumented with google analytics is more data-centric, because it's generating data in the background.

As I read it, the original design patterns are intended mainly to minimize long-term development costs. The key question is "How should code be structured to make it easy to read, debug, maintain, extend, etc?" It's all about saving developers' time in the long run.*

Five years after the original set of design patterns was popularized, another book was published, focusing on design patterns for distributed software. This time, the key questions expanded to include bandwidth and concurrency: "How should we structure code to make the best use out of distributed computing resources?"

I think we're due for another expansion, because data-centric code introduces another optimization target: useful information.** Just as the list of patterns expanded to deal with networking and multiprocessing, it will expand again as data processing and analytics become integral to software design.***

Off the top of my head, here's a quick list of data-centric patterns.

A/B testing
Funnel analysis
Recommender systems (very broad category!)
Top hits (most visited, emailed, etc.)
Automatic bug reports
Likes, +1s, Retweets

This list isn't complete, and it's clear that best practice is still evolving. For example, A/B testing has been industry-standard for a long time, but I recently read a good argument that a multi-armed bandit algorithm is better than A/B testing, because it gathers all the same information, plus integrating that feedback directly into the site design. It's a very natural extension and improvement over an older data-centric design pattern. I'm sure that many other such improvements are possible.

Anyway, I think it's still too early to try to write a comprehensive list. But I'd still like to expand this list to cover as many cases as possible. What else belongs here?

*A few of the patterns address things like limited memory and processing power, but they're the exceptions.

** Defining useful opens up a whole new can of worms, which I won't get into here.

***This relates back to the concept that I've written about before: software design for analytics.

Friday, June 1, 2012

Software design patterns

Following a tip from an experienced software developer, I've been reading up on software design patterns: flyweights, factories, facades, etc. These are general patterns for object-oriented programming that show up again and again. The original canon included 23 patterns; that list has since expanded to include patterns for networking and multiprocessing.

These design patterns remind me of Go proverbs -- high-level heuristics for better strategy, sometimes contradictory. Knowing them can be extremely helpful, but it's no guarantee that you can deploy them correctly. (Here's a good list of common go proverbs.)

Anyway, reading the original Design Patterns book, I've had three main reactions:

1. Data-centric software development is going to discover its own list of software design patterns.

2. There are patterns for research design, just like there are patterns for software design.

3. I already know most of the software patterns -- yay!*

Since I just can't sleep tonight, I figured I'd queue up a few blog posts talking about the first two. Look for those in a couple days.

*Given my very ad hoc background in software design, I've been pleasantly surprised to find that most of the software design patterns are already familiar. For example, python is already very good with iterators and decorators. And working with web frameworks has taught me a lot about factories. And many of the others are much less important in python because objects are dynamically typed. Anyway, it's nice to discover that I've picked a lot of this up by osmosis. (Pat self on the back.)