Tuesday, November 29, 2011
#Getting the latest version of R on an older version of ubuntu:
#Using PIP and VirtualEnv with django
sudo apt-get install python-setuptools python-dev build-essential
pip install numpy
sudo pip install numpy
sudo pip install scipy
sudo pip install rpy2
I never quite cracked this one. It seems that by default, pip doesn't work with the bitnami djangostack. I just used easy_install, with good results.
Neither of these worked perfectly for me, but they got me to the point where I could figure it out.
#Serving static files with WSGI
Trust the django docs to have a good built-in explanation.
Monday, November 28, 2011
CSAAW presentation: "How to plan a heist: Challenges, models, and tactics for researching information flow"
How to plan a heist: Challenges, models, and tactics for researching information flow
By definition, all social systems involve information flow: complex stimuli that connect individuals and coordinate action. However, like executing a heist, measuring and making inferences about these flows is a difficult, thorny problem. This presentation will describe these challenges (hidden networks, subtle signals), introduce mathematical tools for analyzing them (Pearl's graphical causal models, Shannon's mutual information), and discuss tactics for moving forward (experiments, causal and behavioral aggregation, and action space mining). These principles may also be useful for breaking into banks, pilfering paintings, and conning mob bosses.
Friday, November 25, 2011
Perspectives on Psychological Science has a short piece on using Amazon’s Mechanical Turk as a subject pool: “Amazon’s Mechanical Turk: A New Source for Inexpensive, Yet High-Quality, Data?”
As Google Scholar shows, Mechanical Turk is being used in lots of clever ways.
Mechanical Turk has been called a digital sweatshop. Here are two perspectives – an Economic Letters piece: “The condition of the Turking class: Are online employers fair and honest?” And, a piece calling for intervention: “Working the crowd: Employment and labor law in the crowdsourcing industry.”
I've done a lot of work with mturk. It's great for large-scale repetitive tasks, but expect to spend a fair amount of effort setting up and doing quality control. One of the really interesting things about the Google Scholar links is the number of different journals and fields using mturk for research.
Monday, November 21, 2011
About FoldIt, the protein-folding game:
“Although much attention has recently been given to the potential of crowdsourcing and game playing, this is the first instance we are aware of in which online gamers solved a longstanding scientific problem.”
About perverse academic incentives:
The larger challenge is that most scientific data is proprietary. A scientist works long and hard to generate original data. [...] She is not going to want share this data with others, particularly strangers.
“It is essential that scientists be rewarded when they share.”About new challenges:
It takes a certain brilliance, and a lot of work, to recognize problems that can be shared with a crowd, and set up the systems needed for strangers to work together productively.About untapped potential:
“We have used just a tiny fraction of the human attention span that goes into an episode of Jerry Springer.”
Friday, November 11, 2011
See Ruby on rails in EC2 in 20 minutes for the first part of the setup -- I use the same bitnami AMI to get started.
For the most part, I followed the steps here, but I had to do a little extra work to get the pg gem to install properly.
Here's the exact sequence of command line configuration steps:
1 sudo apt-get install postgresql
2 sudo apt-get update
3 rails new icouncil -d postgresql
4 sudo apt-get install libpq-dev
5 bundle install
7 cd icouncil/
9 bundle install
10 sudo su postgres
12 sudo apt-get install postgresql
13 sudo su postgres
14 apt-get install gedit
15 sudo apt-get install gedit
16 sudo vi /etc/postgresql/8.4/main/pg_hba.conf
17 sudo /etc/init.d/postgresql-8.4 restart
18 psql postgres -U icouncil
19 cd config
20 vi database.yml
21 cd ..
22 rake db:create:all
23 which psql
24 sudo gem install pg -- --with-pg-dir=/usr/bin
25 sudo gem install pg -- --with-pg-dir=/usr
26 rake db:create:all
27 rake db:create
28 vi Gemfile
29 bundle install
30 rake db:create
31 rails server
If you want to do it even cleaner than me, I suspect this sequence would work--haven't tried it yet, though.
sudo apt-get update
sudo apt-get install gedit
sudo apt-get install libpq-dev
sudo apt-get install postgresql
sudo su postgres
sudo vi /etc/postgresql/8.4/main/pg_hba.conf
sudo /etc/init.d/postgresql-8.4 restart
psql postgres -U icouncil
rails new icouncil -d postgresql
sudo gem install pg -- --with-pg-dir=/usr
Thursday, November 10, 2011
The most important change is that I've created zipped and tar-gzipped versions of the repository for download. Now you can get snowcrawl even if you don't speak subversion. See the downloads page for one-click downloads.
Wednesday, November 9, 2011
Turns out that understanding complexity classes can add a lot to the way we think about our own minds. Examples include:
- Do I "know" what 1,847,234 times 345 is?
- Is it possible to cheat on the Turing test?
- Can waterfalls think?
- When can you prove you know something without revealing *how* you know it?
- If my grandkids learn to time travel, do I have to worry they might kill me?
(BTW, the paper has been out there for a few months. I'm posting now because it's related--a little--to the review of computational social science I've been working on.)
Monday, November 7, 2011
Also, as far as I can tell, linguists, grammarians, and English majors call sentence diagrams "Reed-Kellogg" diagrams. NLP and computer science types call the diagrams "parse trees" or "concrete syntax trees," and produce them (usually) using probabalistic context free grammars (PCFGs).
If you want code that just works, the Stanford Parser looks like the place to start. It's in java, but I'm sure that won't be too much of a problem, because you can call it from the command line.
Python's NLTK might also work, but there are lots of different parsers, and it's not clear whether and how much training they require.
There are various other software packages out there -- some of them online -- but I doubt they'd support much volume, and batching would be a pain.
Other Software & Code
Funny blog posts about politics and grammar, or the lack thereof
Saturday, November 5, 2011
Now I'm running some text analysis on the tweets. I'll be posting code and writing up results here over the next few days. Questions are welcome!
For starters, here are words people use in support/opposition to the #OWS movement.
Here's the same data, rendered as word clouds, so it looks artsy. This really is the same data: sizes in the wordcloud are determined by the weights of the classifier -- regression betas, for you mathy people out there. Color and coordinates are arbitrary. So these wordclouds are exactly the same info as the tables above, just presented in a more visually appealing format.
As I peer at these tea leaves, I see a solidarity-oriented "stand together against brutal capitalist injustice" theme in the support words, and a libertarian "quit your whining and get to work" theme in the oppose words. What do you make of it?
Caveats and details of the method
This analysis is based on 1,000 tweets drawn from Monday, Tuesday and Wednesday of this week, so some of the themes might be specific to the events of those days. Also, there was quite a bit of noise in the sentiment coding. That will probably wash out in a large enough sample, but I don't know if 1,000 if large enough. Finally, support on twitter was running about 85% in favor of the protests, so the assessments of opposing words are probably less robust.
Tuesday, November 1, 2011
Here's a quick 14-line python script to do the job. It takes all the html files in the ./docs directory and writes them out as clean text to the ./output directory.
Note the lxml dependency!
from lxml.html import clean
import glob, re
cleaner = clean.Cleaner( style=True, scripts=True, comments=True, safe_attrs_only=True )
filenames = glob.glob('docs/*')
for f in filenames:
text = file(f,'r').read()
text = cleaner.clean_html( text ) #Remove scripts, styles, etc.
text = re.sub('<.*?>', '', text ) #Remove html tags
text = re.sub('\s+', ' ', text ) #Remove whitespace
file( 'output/'+f.split('/')[-1].split('.')+'.txt', 'w').write( text )