compSocSci: 2011

Friday, December 30, 2011

Authorship of Ron Paul newsletters -- statistical text analysis and Bayesianism

In the midst of all the hullabaloo about Ron Paul's decades-old racists newsletters, and his denial that he "never read that stuff," I ran across an interesting attempt to adding some data to the discussion. Filed under "pointless data exercises" and "politics," blogger Peter Larson has used text analysis to compare his blog, Ron Paul's recent speeches, and the original newsletters. He calls his results a smoking gun, with a question mark tacked on, and argues that Ron Paul wrote most of the original newsletters.

Before I go on, let me say that this is a really cool application. Instead of the he-said-she-said debate that's running in the media, this piece brings some actual data to bear on the conversation. Bravo!

That said, now I'm going to harp on statistics and inference. The problem with Larson's analysis is that he never addresses the question, "If Ron Paul didn't write the newsletters, who did?" Without answering that question, and putting some probabilities behind it, it's going to be very hard for text analysis to prove the issue one way or another. (Larson admits this on his blog.) Right now, his analysis proves that between himself and Ron Paul, Paul is much more likely to have written the letters. Not exactly a smoking gun.

That said, it's interesting data. For the record, I'm mostly convinced. In my mind, the statistics are flawed, but they still lend some weight against Paul. Oddly enough, Larson's finding that several of the letters were probably *not* written by Ron Paul was particularly persuasive. It feels human and messy, the way I would expect this kind of thing to be.

Final thoughts, mainly intended for my statistically minded friends: yes, this is an unabashedly Bayesian perspective. I'm demanding priors, which probably can't be specified to anyone's satisfaction.

IMHO, the frequentist approach has even deeper problems. From a frequentist perspective (which is where Larson's original, and deeply confusing p-values come from.) we use Paul's recent speeches and writings to estimate some parameters of his current text-generating process. We then compare the newsletters to that process and estimate the probability that the older text was generated by the same process.

Problem: we *know* the old text was not generated by the same process. It was written (allegedly) by a younger Ron Paul, on different topics, speaking into a different political climate. Without a broader framework, it's impossible to determine whether the differences are important. The Bayesian approach provides a direct way of assessing that framework. The frequentist approach doesn't -- at least not that I can see, without jumping through a lot of hoops -- and in the meantime, it obscures the test that's actually being conducted.

Thursday, December 29, 2011

Job posting on Polmeth: Research Officer in Quantitative Text Analysis

This came across the polmeth mailing list a couple hours ago. Looks interesting, but the pay isn't exactly competitive for candidates with "a postgraduate degree in computer science, computational linguistics, or possibly a cognate social science discipline."

Job opening:
Research Officer in Quantitative Text Analysis

Duration: 24 months
Start Date: 1 March 2012 or as soon as possible thereafter
Salary: £31,998 – £38,737 p.a. incl.

Applications are invited for the post of Research Officer, to assist with a principal research officer for the European Research Council funded grant Quantitative Text Analysis for the Social Sciences (QUANTESS), working with Professor Kenneth Benoit (PI).

The research officer’s general duties and responsibilities will be to work with text and the computer organization, storage, processing, and analysis of text. These tasks will involve a combination of programming, database work, and statistical computation. The texts will be drawn from social, political, legal, and commercial examples, and will have a primarily social science focus. The research officer will be expected to work with existing tools for text analysis, actively participate in the development of new tools, and participate in the application of these tools to the social scientific analysis of textual data.

The successful applicant will be expected to possess advanced skills and experience with computer programming, especially the ability use a language used in text processing such as Python; familiarity with SQL; and experience with the R statistical package or the ability to learn R.

The successful candidate should have a postgraduate degree in computer science, computational linguistics, or possibly a cognate social science discipline and have an interest in text analysis and quantitative linguistics, have a knowledge of social science statistics and have worked in a research environment previously.

To apply for this post please go to http://www.lse.ac.uk/JobsatLSE and select “Visit the ONLINE RECRUITMENT SYSTEM web page”. If you have any queries about applying on the online system, please call 020 7955 6656 or email hr.jobs@lse.ac.uk quoting reference 1223243.

Closing date for receipt of applications is: 31 January 2012 by 23.59 (UK time).

Please access the attached hyperlink for an important electronic communications disclaimer: http://lse.ac.uk/emailDisclaimer

******************************

****************************
Political Methodology E-Mail List
Editors: Diana O'Brien <dzobrien@wustl.edu>
Jon C. Rogowski <rogowski.jon@gmail.com>
**********************************************************
Send messages to polmeth@artsci.wustl.edu
To join the list, cancel your subscription, or modify
your subscription settings visit:

http://polmeth.wustl.edu/polmeth.php

**********************************************************

Software design for analytics: A manifesto in alpha

If you take the lean startup ideas of quick iteration, learning, and hypothesis checking seriously, then it makes sense to build your software in a way that lends itself to doing analytics. Lately, I've been doing a lot of both (software development and analytics), so I've been thinking about how to help them play nice together.

Seems to me that MVP/MVC (or even MVVM, if you're into that kind of thing) are good at the following:

User experience
Database performance
Debugging

However, when doing analytics, my needs are different. I need to extract data from the system in a way that lets me answer useful questions. Instead of thinking about performance, scalability, etc., I'm thinking about independent and dependent variables, categories and units of analysis, and causal inference. The "system requirements" for this kind of thing are very different.

Having worked with (and built) several different systems at this point, I've realized that some designs make analytics easier, and some make them much, much harder. And nothing in general-purpose guidelines for good software design guarantees good design for analytics.

Since so much of what I do is analytics, I'd like to ferret out some best practices for that kind of development. I don't have any settled ideas yet, but I thought I'd put some observations on paper.

Some general ideas:

Merges are a pain point. When doing analytics, I spend a large fraction of my time merging and converting data. Seems like there ought to be some good practices/tools to take away some of the pain.
Visualization is also a pain point, but I'm less optimistic about fixing it. There's a lot of art to good visualization.
Units of analysis might be a good place to begin/focus thinking. They tend to change less often than variables, and many design issues for archiving, merging, reporting, and hypothesis testing focus on units of analysis.
The most important unit of analysis is probably the user, because most leap-of-faith assumptions center on users and markets, and because people are just plain complicated. In some situations (e.g. B2B), the unit of analysis might be a group or organization, but even then, users are going to play an important role.
Make it easy to keep track of where the data come from! Any time you change the
From a statistical perspective, we probably want to assume independence all over the place for simplicity -- but be aware that that's what we're doing! For instance, it might make sense to assume that sessions are independent, even though they're actually linked across users.
User segmentation seems like an underexploited area. That is, most UI optimization is done using A-B testing, which optimizes for the "average user." But in many cases, it could be very useful to try to segment the population into sub-populations, and figure out how their needs are different. This won't work when we only have short interactions with anonymous users. But if we have some history or background data (e.g. FB graph info), if could be a very powerful tool.
Corollary: grab user data whenever it's cheap.

I'll close with questions. What else belongs in this list? Are there other people who are thinking about similar issues? What process and technical solutions could help? NoSQL and functional programming come to mind, but I haven't thought through the details.

Friday, December 23, 2011

How to download every episode of This American Life

I recently discovered this lovely script by one Sean Furukawa. (I don't know him.) The script downloads every episode of This American Life in mp3 format. This American Life is far and away my favorite radio show -- Ira Glass is consistently the best storyteller on air.

Anyway, I'm intimidated by perl (too hard too read, too many nonalphnumeric characters) so I rewrote the script in python. The first run will take a long-ish time, since it's downloading all 450+ existing episodes of the program. Subsequent executing of the script will be faster, since it only has to download new episodes. Enjoy!

By the way, AFAIK, this type of webcrawling is completely legal. The content is already streamable from the TAL website; you're just downloading it er, a little faster than usual.

That said, if you use this script, I'd recommend making a tax-deductible contribution to This American Life -- it's a great program, worthy of support. The "donate" button is in the upper-right corner of the This American Life webpage.


#!/usr/bin/python 

# Adapted from: http://www.seanfurukawa.com/?p=246
# Translated from perl to python by Abe Gong
# Dec. 2011

import urllib, glob, datetime

def now():
 """Get the current date and time as a string."""
 return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")

def log( S ):
  """Write a line to the log file, and print it for good measure."""
  logfile.write(S + '\n')
  print S

#Start up a log file
logfile = file( 'tal_log.txt', 'a' )

#Load all the episodes that have already been downloaded; keep the filenames in a list
episodes = [ f.split('/')[-1] for f in glob.glob('episodes/*.mp3') ]
#print episodes

#As of today (12/11/2011) there are 452 episodes, so a count up to 500 should last a long while.
for i in range(1,500):

 #Choose the appropriate filename
  filename = str(i)+'.mp3'
  #Add the URL prefix
  url = 'http://audio.thisamericanlife.org/jomamashouse/ismymamashouse/'+filename
  
  #Check to see is the file has already been downloaded
  if not filename in episodes:
    #Log the attempt
    log( now() + '\ttrying\t' + url )
    
    #Try to download it
    code = urllib.urlopen( url ).getcode()
    if code == 200:
      urllib.urlretrieve( url, filename='episodes/'+filename )
    
      #Log the result -- success!
      log( now() + '\tsaved\t' + filename )
    else:
      log( now() + '\tfile not found' )

Friday, December 16, 2011

More about the world-wide scavenger hunt

I posted a couple days ago on a 90-minute world-wide scavenger hunt that I helped run. Here's a link to the actual site. (EDIT: I've taken down the original site.) Also, I've open-sourced the code on github.

If you're interested in trying Stanley Darpa or a close variant, let me know. The challenge was a lot of fun, and I'd love to run it again sometime.

A worldwide scavenger hunt in 90 minutes!

I helped run a really fun, informal social experiment today. When students in Scott Page's undergraduate "intro to complexity" course showed up to class, they were given a challenge: find one person from each of 60 cities around the world, and take a picture of them before the end of class. Six teams of 15 students had 90 minutes -- and no prior warning -- to compete.

Me, proving that I live in Ann Arbor for Team Bass.
(All the teams were named after mathematicians).

We've called it Stanley Darpa, in homage to Stanley Milgrom's small world experiment, and the more recent DARPA red balloon challenge (more here). The winning teams used a combination of social networking ("My sister just moved there!"), crowdsourcing ("I'll ask all my facebook friends if they know anybody in Shanghai"), and desperate dashing around campus ("IS ANYBODY IN THIS CAFETERIA FROM FORT WAYNE?") to get as many pictures as possible.

I helped come up with the concept, but my main role was to develop the software. I built a simple django site where students could quickly upload pictures as they found them. To ramp up the excitement, I also built tools to let teams track scores and compare finds in real time. These added a strategic dimension to the game: e.g. "Team Markov is leading -- we need to find people from their cities so they won't get so many bonus points!"

The Stanley-Darpa countdown page

In true Mission Control style, we put everything up on a projector screen and watched results roll in over the course of the class. It was slow at first, but as teams split up and started canvassing, the competition really heated up. In the end, the six teams found people from 47 cities. The winning team alone found 30. Surprisingly, many of the hardest cities to find were nearby: Youngstown, OH; Gull Lake, MI; Cadillac; MI.

Tomorrow, I'll try to post a link to the actual site, but first I need to scrub the pictures, to make sure that nobody gets their ID stolen.

This was really fun -- I'd love to do it again sometime.

Tuesday, December 6, 2011

Fixing rabbitmq after switching servers

I ran into some backend headaches with django, celery, and rabbitmq over the weekend. The EC2 instance I was using crashed, so I had to switch servers. When I did, rabbitmq-server, which the broker behind celery, which is the task queue behind django -- broke. The broker broke. And considering that there are still parts of django I'm wrapping my head around, chasing down a system error three levels deep was quiet the pain.

Happily, I got some very good help from the rabbitmq-discuss list:

The database RabbitMQ uses is bound to the machine's hostname, so if you copied the database dir to another machine, it won't work. If this is the case, you have to set up a machine with the same hostname as before and transfer any outstanding messages to the new machine. If there's nothing important in rabbit, you could just clear everything by removing the RabbitMQ files in /var/lib/rabbitmq.

I deleted everything in /var/lib/rabbitmq/mnesia/rabbit/ and it started up without trouble. Hooray!

I also got a warning about using outdated software:

1.7.2 is a *very* old version of the broker. You should really upgrade, as a lot of new features and a lot of bugfixes have gone in since then.

Now, as best I can recall, I originally installed rabbitmq with apt-get, running in Ubuntu 10.04. To fix this, I got the newest .deb from the rabbitMQ site and installed it. Here are the steps:

wget http://www.rabbitmq.com/releases/rabbitmq-server/v2.7.0/rabbitmq-server_2.7.0-1_all.deb
sudo apt-get install erlang-nox
sudo apt-get -f install
sudo apt-get install erlang-nox
sudo dpkg -i rabbitmq-server_2.7.0-1_all.deb

Very happy to be out of the woods on this one. I'm not a sysadmin and I don't aspire to be.

Saturday, December 3, 2011

Getting started in git

With some coaxing from my brother, I finally entered the world of git today. Working notes.

Nice how-to for webfaction
http://www.nitinh.com/2011/02/github-setup-on-webfaction-how-to/

I had this problem...
http://stackoverflow.com/questions/2702731/git-fails-when-pushing-commit-to-github

Getting started on github
http://help.github.com/linux-set-up-git/

The inner workings of the distributed system are still a bit of a mystery, but I gather I'm not the only one in that boat. For now, it works: I can take changes in development and push/pull them to my production server. Progress!

dev$ git commit -a
dev$ git push origin
prd$ git git pull origin master

Friday, December 2, 2011

A totally new model for education.

Scott Page, one of my dissertation advisors and a genuine polymath, is teaching a class on Model Thinking next semester. The class sounds very nifty, but it's the way that it's presented that blows me away. I'm convinced we're at a watershed moment.

Scott's class is one of a couple dozen or so follow-up classes to the Stanford AI class that attracted 140,000 students. They're being delivered using a totally new model of education.

They're free. Yes, free.
They're scalable -- any number of students can sign up.
They're taught by rock star scientists and professors -- people at the cutting edge of their fields.
They're graded. Students submit work (multiple-choice, mostly) and get grades and a certificate of completion at the end of the course*.

This combination is completely new. There are already plenty of lectures and how-to videos on YouTube (free and scalable), and the Teaching Company has been publishing lecture series for quite a while (many of them taught by leading scholars), and the Kahn academy has been experimenting with new ways of deploying content and structuring classes. But no-one has done all these things together.

This is just the latest in a series of innovations that are going to turn education -- public, private, higher, you name it -- upside down. Why settle for a half-prepped lecture from a busy assistant professor when you can get the same content--better--online for free? If you're the teacher, why bother to prep the lecture when someone else has already given it?

* Yes, yes. The grading is pretty rudimentary, but it can't be that long until smart people figure out how to do better.It's a problem I'd be interested in working on.

Thursday, December 1, 2011

Production mode in django: lxml, celery, the works.

Firing up production mode in django. These refs were helpful.

LXML
http://answers.bitnami.org/questions/1765/use-lxml-with-djangostack

Celery
http://django-celery.readthedocs.org/en/latest/getting-started/first-steps-with-django.html

Tuesday, November 29, 2011

Configuring django with wsgi

In the last couple weeks, I've fallen in love with django. Here are some notes (just for reference) on getting django to talk with R, numpy, and WSGI

#Getting the latest version of R on an older version of ubuntu:
http://cran.r-project.org/bin/linux/ubuntu/README

#Using PIP and VirtualEnv with django
http://www.saltycrane.com/blog/2009/05/notes-using-pip-and-virtualenv-django/

sudo apt-get install python-setuptools python-dev build-essential
pip install numpy
sudo pip install numpy
sudo pip install scipy
sudo pip install rpy2

I never quite cracked this one. It seems that by default, pip doesn't work with the bitnami djangostack. I just used easy_install, with good results.

#Configuring WSGI
http://bitnami.org/forums/forums/djangostack/topics/getting-started
http://serverfault.com/questions/327380/beginner-installing-first-app-on-ec2-bitnami-stack

Neither of these worked perfectly for me, but they got me to the point where I could figure it out.

#Serving static files with WSGI
https://docs.djangoproject.com/en/dev/howto/static-files/
https://docs.djangoproject.com/en/dev/howto/deployment/modwsgi/
https://docs.djangoproject.com/en/dev/howto/deployment/modwsg/#serving-files

Trust the django docs to have a good built-in explanation.

Monday, November 28, 2011

CSAAW presentation: "How to plan a heist: Challenges, models, and tactics for researching information flow"

Here are slides from my CSAAW presentation before Thanksgiving. Information flow is a topic near and dear to my heart, and I liked the "heist" twist on the presentation. Someday, I'd like to come back to this presentation and add some polish.

Title:
How to plan a heist: Challenges, models, and tactics for researching information flow

Abstract:
By definition, all social systems involve information flow: complex stimuli that connect individuals and coordinate action. However, like executing a heist, measuring and making inferences about these flows is a difficult, thorny problem. This presentation will describe these challenges (hidden networks, subtle signals), introduce mathematical tools for analyzing them (Pearl's graphical causal models, Shannon's mutual information), and discuss tactics for moving forward (experiments, causal and behavioral aggregation, and action space mining). These principles may also be useful for breaking into banks, pilfering paintings, and conning mob bosses.

Gong info heist

View more presentations from Abe Gong.

Friday, November 25, 2011

using mechanical turk for research

From orgtheory on using mechanical turk for research:

Perspectives on Psychological Science has a short piece on using Amazon’s Mechanical Turk as a subject pool: “Amazon’s Mechanical Turk: A New Source for Inexpensive, Yet High-Quality, Data?”
As Google Scholar shows, Mechanical Turk is being used in lots of clever ways.
Mechanical Turk has been called a digital sweatshop. Here are two perspectives – an Economic Letters piece: “The condition of the Turking class: Are online employers fair and honest?” And, a piece calling for intervention: “Working the crowd: Employment and labor law in the crowdsourcing industry.”
Here’s the Mechanical Turk page. Here are some research-related tasks that you can get paid for.

I've done a lot of work with mturk. It's great for large-scale repetitive tasks, but expect to spend a fair amount of effort setting up and doing quality control. One of the really interesting things about the Google Scholar links is the number of different journals and fields using mturk for research.

Monday, November 21, 2011

Crowdsourcing science

Great article in the Boston Globe about crowdsourcing science.

About FoldIt, the protein-folding game:

“Although much attention has recently been given to the potential of crowdsourcing and game playing, this is the first instance we are aware of in which online gamers solved a longstanding scientific problem.”

About perverse academic incentives:

The larger challenge is that most scientific data is proprietary. A scientist works long and hard to generate original data. [...] She is not going to want share this data with others, particularly strangers.

[...]

“It is essential that scientists be rewarded when they share.”

About new challenges:

It takes a certain brilliance, and a lot of work, to recognize problems that can be shared with a crowd, and set up the systems needed for strangers to work together productively.

About untapped potential:

“We have used just a tiny fraction of the human attention span that goes into an episode of Jerry Springer.”

Friday, November 11, 2011

Running rails with postgres

On my brother's advice, I spent a few hours configuring rails to run with postgresql instead of just SQLite. It took a couple hours, but I figure it will be worth it, since postgres is a production-quality database, not just a bunch of text files.

See Ruby on rails in EC2 in 20 minutes for the first part of the setup -- I use the same bitnami AMI to get started.

For the most part, I followed the steps here, but I had to do a little extra work to get the pg gem to install properly.

Here's the exact sequence of command line configuration steps:

    1 sudo apt-get install postgresql
    2 sudo apt-get update
    3 rails new icouncil -d postgresql
    4 sudo apt-get install libpq-dev
    5 bundle install
    6 ls
    7 cd icouncil/
    8 ls
    9 bundle install
   10 sudo su postgres
   11 psql
   12 sudo apt-get install postgresql
   13 sudo su postgres
   14 apt-get install gedit
   15 sudo apt-get install gedit
   16 sudo vi /etc/postgresql/8.4/main/pg_hba.conf
   17 sudo /etc/init.d/postgresql-8.4 restart
   18 psql postgres -U icouncil
   19 cd config
   20 vi database.yml
   21 cd ..
   22 rake db:create:all
   23 which psql
   24 sudo gem install pg -- --with-pg-dir=/usr/bin
   25 sudo gem install pg -- --with-pg-dir=/usr
   26 rake db:create:all
   27 rake db:create
   28 vi Gemfile
   29 bundle install
   30 rake db:create
   31 rails server
   32 history

If you want to do it even cleaner than me, I suspect this sequence would work--haven't tried it yet, though.

sudo apt-get update
sudo apt-get install gedit
sudo apt-get install libpq-dev
sudo apt-get install postgresql

sudo su postgres

sudo vi /etc/postgresql/8.4/main/pg_hba.conf
sudo /etc/init.d/postgresql-8.4 restart
psql postgres -U icouncil

rails new icouncil -d postgresql
cd icouncil/
vi Gemfile
cd config
vi database.yml
sudo gem install pg -- --with-pg-dir=/usr

bundle install
rake db:create
rails server

Thursday, November 10, 2011

snowCrawl bug fixes and easier downloading

I've made a few small fixes to the snowCrawl python library. (See the updates page for details.)

The most important change is that I've created zipped and tar-gzipped versions of the repository for download. Now you can get snowcrawl even if you don't speak subversion. See the downloads page for one-click downloads.

Wednesday, November 9, 2011

Why philosophers should care about computational complexity

Read this paper a month or two back. It's a truly fascinating discussion of the mathematics of computational complexity ("How hard is it to find an answer to a problem in class X using algorithm Y?") and its implications for some puzzles in philosophy.

Turns out that understanding complexity classes can add a lot to the way we think about our own minds. Examples include:

Do I "know" what 1,847,234 times 345 is?
Is it possible to cheat on the Turing test?
Can waterfalls think?
When can you prove you know something without revealing *how* you know it?
If my grandkids learn to time travel, do I have to worry they might kill me?

There's math, but Aaronson has gone to a lot of effort to keep it manageable and well-explained. Highly recommend it.

(BTW, the paper has been out there for a few months. I'm posting now because it's related--a little--to the review of computational social science I've been working on.)

Monday, November 7, 2011

Resources on NLP sentence diagramming

Here are some notes from a recent search for resources on automatic sentence diagramming. I was looking for code/software to diagram sentences automatically, ideally in python.

Vocab
Also, as far as I can tell, linguists, grammarians, and English majors call sentence diagrams "Reed-Kellogg" diagrams. NLP and computer science types call the diagrams "parse trees" or "concrete syntax trees," and produce them (usually) using probabalistic context free grammars (PCFGs).

Search result
If you want code that just works, the Stanford Parser looks like the place to start. It's in java, but I'm sure that won't be too much of a problem, because you can call it from the command line.

Python's NLTK might also work, but there are lots of different parsers, and it's not clear whether and how much training they require.

There are various other software packages out there -- some of them online -- but I doubt they'd support much volume, and batching would be a pain.

Background
http://en.wikipedia.org/wiki/Sentence_diagram
http://en.wikipedia.org/wiki/Parse_tree
http://en.wikipedia.org/wiki/Context-free_grammar

NLTK
http://nltk.googlecode.com/svn/trunk/doc/book/ch08.html
http://www.cs.bgu.ac.il/~elhadad/nlp11/nltk-pcfg.html
http://www.ibm.com/developerworks/linux/library/l-cpnltk/index.html

Stanford Parser
http://nlp.stanford.edu/downloads/lex-parser.shtml
http://nlp.stanford.edu:8080/parser/index.jsp
http://projects.csail.mit.edu/spatial/Stanford_Parser

Other Software & Code
http://www.sendraw.ucf.edu/
http://1aiway.com/nlp4net/services/enparser/reedkelloggtree.aspx
http://faculty.washington.edu/dillon/GramResources/

Funny blog posts about politics and grammar, or the lack thereof
http://boingboing.net/2009/02/17/how-obamas-sentences.html
http://www.slate.com/articles/life/the_good_word/2008/10/diagramming_sarah.html

Saturday, November 5, 2011

What ideas do twitter users invoke to support/oppose the #OWS protests?

I've been getting used to the twitter API. As a quick test, I downloaded a bunch of tweets about #OWS. Then I uploaded 1,000 of them to mturk and asked turkers to classify them as supportive or opposed to the #OWS movement.

Now I'm running some text analysis on the tweets. I'll be posting code and writing up results here over the next few days. Questions are welcome!

For starters, here are words people use in support/opposition to the #OWS movement.

Support		Oppose
love	0.48	homeless	-0.93
politics	0.46	focus	-0.9
opwallstreet	0.43	crowd	-0.75
congress	0.43	handouts	-0.74
stand	0.4	capitalist	-0.71
bank	0.39	irony	-0.59
brutal	0.36	called	-0.51
class	0.32	act	-0.49
strike	0.31	scanner	-0.46
poll	0.31	happened	-0.44
evict	0.31	received	-0.36
p21	0.31	getting	-0.33
stay	0.29	quotwe	-0.32
global	0.29	dont	-0.32
help	0.25	cont	-0.31
justice	0.23	john	-0.3
income	0.23	home	-0.29
senatorsanders	0.23	paul	-0.28
moveon	0.22	hear	-0.24
occupywallst	0.21	weoccupyamerica	-0.23
solidarity	0.2	protests	-0.22
call	0.2	free	-0.19
cop	0.2	tents	-0.19
allowed	0.17	protesting	-0.14
peaceful	0.16	occupylsx	-0.13

Here's the same data, rendered as word clouds, so it looks artsy. This really is the same data: sizes in the wordcloud are determined by the weights of the classifier -- regression betas, for you mathy people out there. Color and coordinates are arbitrary. So these wordclouds are exactly the same info as the tables above, just presented in a more visually appealing format.

In support:

Opposed:

As I peer at these tea leaves, I see a solidarity-oriented "stand together against brutal capitalist injustice" theme in the support words, and a libertarian "quit your whining and get to work" theme in the oppose words. What do you make of it?

Caveats and details of the method
This analysis is based on 1,000 tweets drawn from Monday, Tuesday and Wednesday of this week, so some of the themes might be specific to the events of those days. Also, there was quite a bit of noise in the sentiment coding. That will probably wash out in a large enough sample, but I don't know if 1,000 if large enough. Finally, support on twitter was running about 85% in favor of the protests, so the assessments of opposing words are probably less robust.

Tuesday, November 1, 2011

Python html cleaner

A friend asked for help cleaning html files today --- getting rid of scripts, styles, tags, and whatnot.

Here's a quick 14-line python script to do the job. It takes all the html files in the ./docs directory and writes them out as clean text to the ./output directory.

Note the lxml dependency!

from lxml.html import clean
import glob, re

cleaner = clean.Cleaner( style=True, scripts=True, comments=True, safe_attrs_only=True )

filenames = glob.glob('docs/*')
for f in filenames:
    print '='*80
    text = file(f,'r').read()
    text = cleaner.clean_html( text )    #Remove scripts, styles, etc.
    text = re.sub('<.*?>', '', text )    #Remove html tags
    text = re.sub('\s+', ' ', text )    #Remove whitespace
    print text
    file( 'output/'+f.split('/')[-1].split('.')[0]+'.txt', 'w').write( text )

Monday, October 31, 2011

What is computational social science? Feedback wanted!

At the JITP conference on "the future of computational social science" this spring, the question was raised, "What exactly is computational social science?" After some semantic dithering, there was an awkward pause, and then someone changed the subject.

Having thought about the question in the time since, I'm ready to give a better definition:

Computational social science is research that answers questions in social science using specialized knowledge from computer science.

This definition---especially the "specialized knowledge from computer science" bit---leads in some interesting directions. This post talks through a few of them.

Specialized Knowledge
Here's what I mean by "specialized knowledge." Practically speaking, computer science treats memory, bandwidth, storage, and especially computation as limited resources. Researchers in the field attempt to allocate those resources efficiently through algorithm design and efficient system architecture. Computer scientists also think a lot about best practices for deploying hardware and developing software.

As a result, this definition partly ties the definition of computational social science to the current state of software. As easy-to-use software becomes available, some areas will cease to require specialized knowledge. As that happens, ``regular'' social science will encroach on areas that were originally computational.

One scenario for future work in computational social science is that a small community of computer-literate developers will supply the rest of the field with software to perform specialized tasks. A similar process took place a generation ago when software packages like SPSS and STATA became available for statistical work.

Another scenario is that resources and skills for compSocSci will be concentrated in the private sector. Many academic researchers see this as a bad thing. The selfish logic of patents and trade secrets could easily lead to hoarding of proprietary data and code, and hold up the progress of scientific discovery. On the other hand, hoarding happens in the academy too, and some tech firms are big proponents of open source, so I think it's an open question which institutional structures will work best.

*Not* Computational
There are at least two types of research that are not computational under my definition. This is a good thing. We want compSocSci to be a big tent, but we also want the term to mean something. Meaning something implies that there are things that aren't computational.

First, research is not computational just because it uses computers. Thus, most researchers using Lexis-Nexis, STATA, and even Amazon's Mechanical Turk are nor doing computational social science, because those applications can be used without specialized knowledge from computer science. I would classify most work in R as computational, because it requires knowledge of programming.

Second, research on technology and politics is not necessarily computational. For instance, Karpf's ethnographic work on bloggers and other Internet activists is excellent, but not really computational. (His blogosphere authority index is an exception, since it required knowledge of web development.) Other examples include content analysis of websites, using web surveys to collect data, rhetorical analysis of Youtube videos, and so on. In this work, information technology appears in the research question, but not the methodology.

Where next?
There are at least five computational areas that can make huge contributions to social science:

Information retrieval: techniques for acquiring and storing Big Data
Machine learning, NLP, and complex network analysis: statistical techniques for drawing inferences from new types of data
Simulation: using computers to explore the behavior of systems.
Web design and human-computer interaction:
Computability and complexity theory: two branches of mathematics investigating the nature of information and computation

I'm planning to write about all five of these over the month or so.

But first, does this definition work? Are there borderline cases I've missed? There's enough interest in compSocSci that I think it's worth thinking through these issues. I'd love to get feedback on these ideas.

Discuss.

Friday, October 28, 2011

Ruby on rails on EC2 in 20 minutes

I wound up with an extra 20 minutes the other day, so I looked to see how hard it would be to set up a rails server on an EC2 instance.

Answer: very not hard.

I haven't done much with rails in the past (One week of code with my brother looking over my shoulder.)

I grabbed the RubyStack 2.3-0 Dev (Ubuntu 10.04) AMI from http://bitnami.org/stack/rubystackrails.

Once it initialized, I ssh'ed in and ran:

rails new blog
cd blog

I followed the directions here to edit the gemfile, then ran:

rake db:create
rails server

Tada! Instant rails server.

PS - Yes, yes I know about heroku. In this case, building the app as an EC2 AMI is an essential part of the project. I want non-programmers to be able to clone the instance, and sharing a public AMI makes things pretty easy.

Thursday, October 27, 2011

JSON editors

I've been doing a lot of work with json lately, and realized that it would be handy to have a program to edit and validate json syntax.

Turns out I'm not the first person to see this need. Here are two nice, in-browser json editors. Nifty!

Wednesday, October 26, 2011

RegExr

Here's a great little site that lets you test regular expressions. The splash page says:

Welcome to RegExr 0.3b, an intuitive tool for learning, writing, and testing Regular Expressions. Key features include:

real time results: shows results as you type

code hinting: roll over your expression to see info on specific elements

detailed results: roll over a match to see details & view group info below

built in regex guide: double click entries to insert them into your expression

online & desktop: regexr.com or download the desktop version for Mac, Windows, or Linux

save your expressions: My Saved expressions are saved locally

search Community expressions and add your own

create Share Links to send your expressions to co-workers or link to them on Twitter or your blog [ex. http://RegExr.com?2rjl6]

Built by gskinner.com with Flex 3 [adobe.com/go/flex] and Spelling Plus Library for text highlighting [gskinner.com/products/spl].

Tuesday, October 25, 2011

Repost: Analysis of Steve Jobs tribute messages.

Here. The analysis is interesting, even if the presentation is kind of slow. Full source code (python and nltk, natch) is included.

HT FlowingData

Monday, October 24, 2011

Computational social science

Welcome to my computational social science blog! In fact, so few other blogs cover this topic, that it's pretty close to being "the" computational social science blog.

My goal is to document ideas about computational social science as I encounter them in the course of my work. I'll post links, papers, scripts, software, and other bits and pieces. Over time, I hope this will grow into a useful repository for people interested in using computers to study social dynamics. If the blog gathers enough like-minded readership, it might also turn into a good place for discussion.

I'll be setting up formating, etc. in my spare time over the next couple weeks. Please let me know if you run into rough edges.

Friday, August 5, 2011

Just submitted my first academic paper!

The ethics of blind review don't let me say what paper or which journal (even though any reviewer with an ounce of initiative and access to Google can figure it out.) I can say that it took almost 18 months and a ridiculous amount of computer time. But it's done!

Now the waiting and nail-biting begins.