Tuesday, November 29, 2011

Configuring django with wsgi

In the last couple weeks, I've fallen in love with django. Here are some notes (just for reference) on getting django to talk with R, numpy, and WSGI

#Getting the latest version of R on an older version of ubuntu:

#Using PIP and VirtualEnv with django

sudo apt-get install python-setuptools python-dev build-essential
pip install numpy
sudo pip install numpy
sudo pip install scipy
sudo pip install rpy2

I never quite cracked this one. It seems that by default, pip doesn't work with the bitnami djangostack. I just used easy_install, with good results.

#Configuring WSGI

Neither of these worked perfectly for me, but they got me to the point where I could figure it out.

#Serving static files with WSGI

Trust the django docs to have a good built-in explanation.

Monday, November 28, 2011

CSAAW presentation: "How to plan a heist: Challenges, models, and tactics for researching information flow"

Here are slides from my CSAAW presentation before Thanksgiving. Information flow is a topic near and dear to my heart, and I liked the "heist" twist on the presentation. Someday, I'd like to come back to this presentation and add some polish.

How to plan a heist: Challenges, models, and tactics for researching information flow

By definition, all social systems involve information flow: complex stimuli that connect individuals and coordinate action. However, like executing a heist, measuring and making inferences about these flows is a difficult, thorny problem. This presentation will describe these challenges (hidden networks, subtle signals), introduce mathematical tools for analyzing them (Pearl's graphical causal models, Shannon's mutual information), and discuss tactics for moving forward (experiments, causal and behavioral aggregation, and action space mining). These principles may also be useful for breaking into banks, pilfering paintings, and conning mob bosses.

Friday, November 25, 2011

using mechanical turk for research

From orgtheory on using mechanical turk for research:

Perspectives on Psychological Science has a short piece on using Amazon’s Mechanical Turk as a subject pool: “Amazon’s Mechanical Turk: A New Source for Inexpensive, Yet High-Quality, Data?”

As Google Scholar shows, Mechanical Turk is being used in lots of clever ways.

Mechanical Turk has been called a digital sweatshop. Here are two perspectives – an Economic Letters piece: “The condition of the Turking class: Are online employers fair and honest?” And, a piece calling for intervention: “Working the crowd: Employment and labor law in the crowdsourcing industry.”

Here’s the Mechanical Turk page. Here are some research-related tasks that you can get paid for.

I've done a lot of work with mturk. It's great for large-scale repetitive tasks, but expect to spend a fair amount of effort setting up and doing quality control. One of the really interesting things about the Google Scholar links is the number of different journals and fields using mturk for research.

Monday, November 21, 2011

Crowdsourcing science

Great article in the Boston Globe about crowdsourcing science.

About FoldIt, the protein-folding game:
“Although much attention has recently been given to the potential of crowdsourcing and game playing, this is the first instance we are aware of in which online gamers solved a longstanding scientific problem.”

About perverse academic incentives:
The larger challenge is that most scientific data is proprietary. A scientist works long and hard to generate original data. [...] She is not going to want share this data with others, particularly strangers.
“It is essential that scientists be rewarded when they share.” 
About new challenges:
It takes a certain brilliance, and a lot of work, to recognize problems that can be shared with a crowd, and set up the systems needed for strangers to work together productively.
About untapped potential:

“We have used just a tiny fraction of the human attention span that goes into an episode of Jerry Springer.”

Friday, November 11, 2011

Running rails with postgres

On my brother's advice, I spent a few hours configuring rails to run with postgresql instead of just SQLite. It took a couple hours, but I figure it will be worth it, since postgres is a production-quality database, not just a bunch of text files.

See Ruby on rails in EC2 in 20 minutes for the first part of the setup -- I use the same bitnami AMI to get started.

For the most part, I followed the steps here, but I had to do a little extra work to get the pg gem to install properly.

Here's the exact sequence of command line configuration steps:

    1  sudo apt-get install postgresql
    2  sudo apt-get update
    3  rails new icouncil -d postgresql
    4  sudo apt-get install libpq-dev
    5  bundle install
    6  ls
    7  cd icouncil/
    8  ls
    9  bundle install
   10  sudo su postgres
   11  psql
   12  sudo apt-get install postgresql
   13  sudo su postgres
   14  apt-get install gedit
   15  sudo apt-get install gedit
   16  sudo vi /etc/postgresql/8.4/main/pg_hba.conf
   17  sudo /etc/init.d/postgresql-8.4 restart
   18  psql postgres -U icouncil
   19  cd config
   20  vi database.yml
   21  cd ..
   22  rake db:create:all
   23  which psql
   24  sudo gem install pg -- --with-pg-dir=/usr/bin
   25  sudo gem install pg -- --with-pg-dir=/usr
   26  rake db:create:all
   27  rake db:create
   28  vi Gemfile
   29  bundle install
   30  rake db:create
   31  rails server
   32  history

If you want to do it even cleaner than me, I suspect this sequence would work--haven't tried it yet, though.

sudo apt-get update
sudo apt-get install gedit
sudo apt-get install libpq-dev
sudo apt-get install postgresql

sudo su postgres

sudo vi /etc/postgresql/8.4/main/pg_hba.conf
sudo /etc/init.d/postgresql-8.4 restart
psql postgres -U icouncil

rails new icouncil -d postgresql
cd icouncil/
vi Gemfile
cd config
vi database.yml
sudo gem install pg -- --with-pg-dir=/usr

bundle install
rake db:create
rails server

Thursday, November 10, 2011

snowCrawl bug fixes and easier downloading

I've made a few small fixes to the snowCrawl python library. (See the updates page for details.)

The most important change is that I've created zipped and tar-gzipped versions of the repository for download. Now you can get snowcrawl even if you don't speak subversion. See the downloads page for one-click downloads.

Wednesday, November 9, 2011

Why philosophers should care about computational complexity

Read this paper a month or two back.  It's a truly fascinating discussion of the mathematics of computational complexity ("How hard is it to find an answer to a problem in class X using algorithm Y?") and its implications for some puzzles in philosophy.

Turns out that understanding complexity classes can add a lot to the way we think about our own minds. Examples include:
  • Do I "know" what 1,847,234 times 345 is?
  • Is it possible to cheat on the Turing test?
  • Can waterfalls think?
  • When can you prove you know something without revealing *how* you know it?
  • If my grandkids learn to time travel, do I have to worry they might kill me?
There's math, but Aaronson has gone to a lot of effort to keep it manageable and well-explained. Highly recommend it.

(BTW, the paper has been out there for a few months. I'm posting now because it's related--a little--to the review of computational social science I've been working on.)

Monday, November 7, 2011

Resources on NLP sentence diagramming

Here are some notes from a recent search for resources on automatic sentence diagramming. I was looking for code/software to diagram sentences automatically, ideally in python.

Also, as far as I can tell, linguists, grammarians, and English majors call sentence diagrams "Reed-Kellogg" diagrams. NLP and computer science types call the diagrams "parse trees" or "concrete syntax trees," and produce them (usually) using probabalistic context free grammars (PCFGs).
Search result
If you want code that just works, the Stanford Parser looks like the place to start. It's in java, but I'm sure that won't be too much of a problem, because you can call it from the command line.

Python's NLTK might also work, but there are lots of different parsers, and it's not clear whether and how much training they require.

There are various other software packages out there -- some of them online -- but I doubt they'd support much volume, and batching would be a pain.



Stanford Parser

Other Software & Code

Funny blog posts about politics and grammar, or the lack thereof

Saturday, November 5, 2011

What ideas do twitter users invoke to support/oppose the #OWS protests?

I've been getting used to the twitter API.  As a quick test, I downloaded a bunch of tweets about #OWS. Then I uploaded 1,000 of them to mturk and asked turkers to classify them as supportive or opposed to the #OWS movement.

Now I'm running some text analysis on the tweets. I'll be posting code and writing up results here over the next few days. Questions are welcome!

For starters, here are words people use in support/opposition to the #OWS movement.


love 0.48
homeless -0.93
politics 0.46
focus -0.9
opwallstreet 0.43
crowd -0.75
congress 0.43
handouts -0.74
stand 0.4
capitalist -0.71
bank 0.39
irony -0.59
brutal 0.36
called -0.51
class 0.32
act -0.49
strike 0.31
scanner -0.46
poll 0.31
happened -0.44
evict 0.31
received -0.36
p21 0.31
getting -0.33
stay 0.29
quotwe -0.32
global 0.29
dont -0.32
help 0.25
cont -0.31
justice 0.23
john -0.3
income 0.23
home -0.29
senatorsanders 0.23
paul -0.28
moveon 0.22
hear -0.24
occupywallst 0.21
weoccupyamerica -0.23
solidarity 0.2
protests -0.22
call 0.2
free -0.19
cop 0.2
tents -0.19
allowed 0.17
protesting -0.14
peaceful 0.16
occupylsx -0.13

Here's the same data, rendered as word clouds, so it looks artsy.  This really is the same data: sizes in the wordcloud are determined by the weights of the classifier -- regression betas, for you mathy people out there. Color and coordinates are arbitrary. So these wordclouds are exactly the same info as the tables above, just presented in a more visually appealing format.

In support:


As I peer at these tea leaves, I see a solidarity-oriented "stand together against brutal capitalist injustice" theme in the support words, and a libertarian "quit your whining and get to work" theme in the oppose words. What do you make of it?

Caveats and details of the method
This analysis is based on 1,000 tweets drawn from Monday, Tuesday and Wednesday of this week, so some of the themes might be specific to the events of those days. Also, there was quite a bit of noise in the sentiment coding. That will probably wash out in a large enough sample, but I don't know if 1,000 if large enough. Finally, support on twitter was running about 85% in favor of the protests, so the assessments of opposing words are probably less robust.

Tuesday, November 1, 2011

Python html cleaner

A friend asked for help cleaning html files today --- getting rid of scripts, styles, tags, and whatnot.

Here's a quick 14-line python script to do the job. It takes all the html files in the ./docs directory and writes them out as clean text to the ./output directory.

Note the lxml dependency!

from lxml.html import clean
import glob, re

cleaner = clean.Cleaner( style=True, scripts=True, comments=True, safe_attrs_only=True )

filenames = glob.glob('docs/*')
for f in filenames:
    print '='*80
    text = file(f,'r').read()
    text = cleaner.clean_html( text )    #Remove scripts, styles, etc.
    text = re.sub('<.*?>', '', text )    #Remove html tags
    text = re.sub('\s+', ' ', text )    #Remove whitespace
    print text
    file( 'output/'+f.split('/')[-1].split('.')[0]+'.txt', 'w').write( text )