Friday, December 30, 2011

Authorship of Ron Paul newsletters -- statistical text analysis and Bayesianism

In the midst of all the hullabaloo about Ron Paul's decades-old racists newsletters, and his denial that he "never read that stuff," I ran across an interesting attempt to adding some data to the discussion.  Filed under "pointless data exercises" and "politics," blogger Peter Larson has used text analysis to compare his blog, Ron Paul's recent speeches, and the original newsletters.  He calls his results a smoking gun, with a question mark tacked on, and argues that Ron Paul wrote most of the original newsletters.

Before I go on, let me say that this is a really cool application.  Instead of the he-said-she-said debate that's running in the media, this piece brings some actual data to bear on the conversation.  Bravo!

That said, now I'm going to harp on statistics and inference.  The problem with Larson's analysis is that he never addresses the question, "If Ron Paul didn't write the newsletters, who did?"  Without answering that question, and putting some probabilities behind it, it's going to be very hard for text analysis to prove the issue one way or another. (Larson admits this on his blog.) Right now, his analysis proves that between himself and Ron Paul, Paul is much more likely to have written the letters.  Not exactly a smoking gun.

That said, it's interesting data.  For the record, I'm mostly convinced.  In my mind, the statistics are flawed, but they still lend some weight against Paul.  Oddly enough, Larson's finding that several of the letters were probably *not* written by Ron Paul was particularly persuasive.  It feels human and messy, the way I would expect this kind of thing to be.

Final thoughts, mainly intended for my statistically minded friends: yes, this is an unabashedly Bayesian perspective.  I'm demanding priors, which probably can't be specified to anyone's satisfaction.

IMHO, the frequentist approach has even deeper problems.  From a frequentist perspective (which is where Larson's original, and deeply confusing p-values come from.) we use Paul's recent speeches and writings to estimate some parameters of his current text-generating process.  We then compare the newsletters to that process and estimate the probability that the older text was generated by the same process.

Problem: we *know* the old text was not generated by the same process.  It was written (allegedly) by a younger Ron Paul, on different topics, speaking into a different political climate.  Without a broader framework, it's impossible to determine whether the differences are important.  The Bayesian approach provides a direct way of assessing that framework.  The frequentist approach doesn't -- at least not that I can see, without jumping through a lot of hoops -- and in the meantime, it obscures the test that's actually being conducted.

Thursday, December 29, 2011

Job posting on Polmeth: Research Officer in Quantitative Text Analysis

This came across the polmeth mailing list a couple hours ago.  Looks interesting, but the pay isn't exactly competitive for candidates with "a postgraduate degree in computer science, computational linguistics, or possibly a cognate social science discipline."

Job opening:
Research Officer in Quantitative Text Analysis

Duration: 24 months
Start Date: 1 March 2012 or as soon as possible thereafter
Salary: £31,998 – £38,737 p.a. incl.

Applications are invited for the post of Research Officer, to assist with a principal research officer for the European Research Council funded grant Quantitative Text Analysis for the Social Sciences (QUANTESS), working with Professor Kenneth Benoit (PI).

The research officer’s general duties and responsibilities will be to work with text and the computer organization, storage, processing, and analysis of text. These tasks will involve a combination of programming, database work, and statistical computation. The texts will be drawn from social, political, legal, and commercial examples, and will have a primarily social science focus. The research officer will be expected to work with existing tools for text analysis, actively participate in the development of new tools, and participate in the application of these tools to the social scientific analysis of textual data.

The successful applicant will be expected to possess advanced skills and experience with computer programming, especially the ability use a language used in text processing such as Python; familiarity with SQL; and experience with the R statistical package or the ability to learn R.

The successful candidate should have a postgraduate degree in computer science, computational linguistics, or possibly a cognate social science discipline and have an interest in text analysis and quantitative linguistics, have a knowledge of social science statistics and have worked in a research environment previously.

To apply for this post please go to and select “Visit the ONLINE RECRUITMENT SYSTEM web page”. If you have any queries about applying on the online system, please call 020 7955 6656 or email quoting reference 1223243.

Closing date for receipt of applications is: 31 January 2012 by 23.59 (UK time).

Please access the attached hyperlink for an important electronic communications disclaimer:

            Political Methodology E-Mail List
  Editors: Diana O'Brien        <>
           Jon C. Rogowski <>
       Send messages to
 To join the list, cancel your subscription, or modify
          your subscription settings visit:


Software design for analytics: A manifesto in alpha

If you take the lean startup ideas of quick iteration, learning, and hypothesis checking seriously, then it makes sense to build your software in a way that lends itself to doing analytics.  Lately, I've been doing a lot of both (software development and analytics), so I've been thinking about how to help them play nice together.

Seems to me that MVP/MVC (or even MVVM, if you're into that kind of thing) are good at the following:
  • User experience
  • Database performance
  • Debugging
However, when doing analytics, my needs are different. I need to extract data from the system in a way that lets me answer useful questions. Instead of thinking about performance, scalability, etc., I'm thinking about independent and dependent variables, categories and units of analysis, and causal inference.  The "system requirements" for this kind of thing are very different.

Having worked with (and built) several different systems at this point, I've realized that some designs make analytics easier, and some make them much, much harder. And nothing in general-purpose guidelines for good software design guarantees good design for analytics.

Since so much of what I do is analytics, I'd like to ferret out some best practices for that kind of development.  I don't have any settled ideas yet, but I thought I'd put some observations on paper.

Some general ideas:
  1. Merges are a pain point. When doing analytics, I spend a large fraction of my time merging and converting data. Seems like there ought to be some good practices/tools to take away some of the pain.
  2. Visualization is also a pain point, but I'm less optimistic about fixing it.  There's a lot of art to good visualization.
  3. Units of analysis might be a good place to begin/focus thinking.  They tend to change less often than variables, and many design issues for archiving, merging, reporting, and hypothesis testing focus on units of analysis.
  4. The most important unit of analysis is probably the user, because most leap-of-faith assumptions center on users and markets, and because people are just plain complicated.  In some situations (e.g. B2B), the unit of analysis might be a group or organization, but even then, users are going to play an important role.
  5. Make it easy to keep track of where the data come from!  Any time you change the
  6. From a statistical perspective, we probably want to assume independence all over the place for simplicity -- but be aware that that's what we're doing!  For instance, it might make sense to assume that sessions are independent, even though they're actually linked across users.
  7. User segmentation seems like an underexploited area.  That is, most UI optimization is done using A-B testing, which optimizes for the "average user."  But in many cases, it could be very useful to try to segment the population into sub-populations, and figure out how their needs are different.  This won't work when we only have short interactions with anonymous users.  But if we have some history or background data (e.g. FB graph info), if could be a very powerful tool.
  8. Corollary: grab user data whenever it's cheap.

I'll close with questions.  What else belongs in this list?  Are there other people who are thinking about similar issues?  What process and technical solutions could help?  NoSQL and functional programming come to mind, but I haven't thought through the details.

Friday, December 23, 2011

How to download every episode of This American Life

I recently discovered this lovely script by one Sean Furukawa.  (I don't know him.)  The script downloads every episode of This American Life in mp3 format.  This American Life is far and away my favorite radio show -- Ira Glass is consistently the best storyteller on air.

Anyway, I'm intimidated by perl (too hard too read, too many nonalphnumeric characters) so I rewrote the script in python. The first run will take a long-ish time, since it's downloading all 450+ existing episodes of the program. Subsequent executing of the script will be faster, since it only has to download new episodes. Enjoy!

By the way, AFAIK, this type of webcrawling is completely legal.  The content is already streamable from the TAL website; you're just downloading it er, a little faster than usual.

That said, if you use this script, I'd recommend making a tax-deductible contribution to This American Life -- it's a great program, worthy of support.  The "donate" button is in the upper-right corner of the This American Life webpage.


# Adapted from:
# Translated from perl to python by Abe Gong
# Dec. 2011

import urllib, glob, datetime

def now():
 """Get the current date and time as a string."""
 return"%Y-%m-%d %H:%M:%S")

def log( S ):
  """Write a line to the log file, and print it for good measure."""
  logfile.write(S + '\n')
  print S

#Start up a log file
logfile = file( 'tal_log.txt', 'a' )

#Load all the episodes that have already been downloaded; keep the filenames in a list
episodes = [ f.split('/')[-1] for f in glob.glob('episodes/*.mp3') ]
#print episodes

#As of today (12/11/2011) there are 452 episodes, so a count up to 500 should last a long while.
for i in range(1,500):

 #Choose the appropriate filename
  filename = str(i)+'.mp3'
  #Add the URL prefix
  url = ''+filename
  #Check to see is the file has already been downloaded
  if not filename in episodes:
    #Log the attempt
    log( now() + '\ttrying\t' + url )
    #Try to download it
    code = urllib.urlopen( url ).getcode()
    if code == 200:
      urllib.urlretrieve( url, filename='episodes/'+filename )
      #Log the result -- success!
      log( now() + '\tsaved\t' + filename )
      log( now() + '\tfile not found' )

Friday, December 16, 2011

More about the world-wide scavenger hunt

I posted a couple days ago on a 90-minute world-wide scavenger hunt that I helped run.  Here's a link to the actual site.  (EDIT: I've taken down the original site.)  Also, I've open-sourced the code on github.

If you're interested in trying Stanley Darpa or a close variant, let me know. The challenge was a lot of fun, and I'd love to run it again sometime.

Tuesday, December 13, 2011

A worldwide scavenger hunt in 90 minutes!

I helped run a really fun, informal social experiment today. When students in Scott Page's undergraduate "intro to complexity" course showed up to class, they were given a challenge: find one person from each of 60 cities around the world, and take a picture of them before the end of class.  Six teams of 15 students had 90 minutes -- and no prior warning -- to compete.

Me, proving that I live in Ann Arbor for Team Bass.
(All the teams were named after mathematicians).

We've called it Stanley Darpa, in homage to Stanley Milgrom's small world experiment, and the more recent DARPA red balloon challenge (more here). The winning teams used a combination of social networking ("My sister just moved there!"), crowdsourcing ("I'll ask all my facebook friends if they know anybody in Shanghai"), and desperate dashing around campus ("IS ANYBODY IN THIS CAFETERIA FROM FORT WAYNE?") to get as many pictures as possible.

I helped come up with the concept, but my main role was to develop the software. I built a simple django site where students could quickly upload pictures as they found them. To ramp up the excitement, I also built tools to let teams track scores and compare finds in real time.  These added a strategic dimension to the game: e.g. "Team Markov is leading -- we need to find people from their cities so they won't get so many bonus points!"

The Stanley-Darpa countdown page

In true Mission Control style, we put everything up on a projector screen and watched results roll in over the course of the class.  It was slow at first, but as teams split up and started canvassing, the competition really heated up. In the end, the six teams found people from 47 cities.  The winning team alone found 30. Surprisingly, many of the hardest cities to find were nearby: Youngstown, OH; Gull Lake, MI; Cadillac; MI.

Tomorrow, I'll try to post a link to the actual site, but first I need to scrub the pictures, to make sure that nobody gets their ID stolen.

This was really fun -- I'd love to do it  again sometime.

Tuesday, December 6, 2011

Fixing rabbitmq after switching servers

I ran into some backend headaches with django, celery, and rabbitmq over the weekend. The EC2 instance I was using crashed, so I had to switch servers. When I did, rabbitmq-server, which the broker behind celery, which is the task queue behind django -- broke.  The broker broke.  And considering that there are still parts of django I'm wrapping my head around, chasing down a system error three levels deep was quiet the pain.

Happily, I got some very good help from the rabbitmq-discuss list:

The database RabbitMQ uses is bound to the machine's hostname, so if you copied the database dir to another machine, it won't work.  If this is the case, you have to set up a machine with the same hostname as before and transfer any outstanding messages to the new machine.  If there's nothing important in rabbit, you could just clear everything by removing the RabbitMQ files in /var/lib/rabbitmq.

I deleted everything in /var/lib/rabbitmq/mnesia/rabbit/ and it started up without trouble.  Hooray!

I also got a warning about using outdated software:

1.7.2 is a *very* old version of the broker.  You should really upgrade, as a lot of new features and a lot of bugfixes have gone in since then.

Now, as best I can recall, I originally installed rabbitmq with apt-get, running in Ubuntu 10.04. To fix this, I got the newest .deb from the rabbitMQ site and installed it.  Here are the steps:

sudo apt-get install erlang-nox
sudo apt-get -f install
sudo apt-get install erlang-nox
sudo dpkg -i rabbitmq-server_2.7.0-1_all.deb

Very happy to be out of the woods on this one. I'm not a sysadmin and I don't aspire to be.

Saturday, December 3, 2011

Getting started in git

With some coaxing from my brother, I finally entered the world of git today. Working notes.

Nice how-to for webfaction

I had this problem...

Getting started on github

The inner workings of the distributed system are still a bit of a mystery, but I gather I'm not the only one in that boat.  For now, it works: I can take changes in development and push/pull them to my production server.  Progress!

dev$ git commit -a
dev$ git push origin
prd$ git git pull origin master

Friday, December 2, 2011

A totally new model for education.

Scott Page, one of my dissertation advisors and a genuine polymath, is teaching a class on Model Thinking next semester.  The class sounds very nifty, but it's the way that it's presented that blows me away.  I'm convinced we're at a watershed moment.

Scott's class is one of a couple dozen or so follow-up classes to the Stanford AI class that attracted 140,000 students.  They're being delivered using a totally new model of education.
  • They're free.  Yes, free.
  • They're scalable -- any number of students can sign up.
  • They're taught by rock star scientists and professors -- people at the cutting edge of their fields.
  • They're graded. Students submit work (multiple-choice, mostly) and get grades and a certificate of completion at the end of the course*.
This combination is completely new.  There are already plenty of lectures and how-to videos on YouTube (free and scalable), and the Teaching Company has been publishing lecture series for quite a while (many of them taught by leading scholars), and the Kahn academy has been experimenting with new ways of deploying content and structuring classes.  But no-one has done all these things together.

This is just the latest in a series of innovations that are going to turn education -- public, private, higher, you name it -- upside down.  Why settle for a half-prepped lecture from a busy assistant professor when you can get the same content--better--online for free?  If you're the teacher, why bother to prep the lecture when someone else has already given it?

* Yes, yes. The grading is pretty rudimentary, but it can't be that long until smart people figure out how to do better.It's a problem I'd be interested in working on.

Thursday, December 1, 2011

Production mode in django: lxml, celery, the works.

Firing up production mode in django.  These refs were helpful.