Friday, December 30, 2011
Before I go on, let me say that this is a really cool application. Instead of the he-said-she-said debate that's running in the media, this piece brings some actual data to bear on the conversation. Bravo!
That said, now I'm going to harp on statistics and inference. The problem with Larson's analysis is that he never addresses the question, "If Ron Paul didn't write the newsletters, who did?" Without answering that question, and putting some probabilities behind it, it's going to be very hard for text analysis to prove the issue one way or another. (Larson admits this on his blog.) Right now, his analysis proves that between himself and Ron Paul, Paul is much more likely to have written the letters. Not exactly a smoking gun.
That said, it's interesting data. For the record, I'm mostly convinced. In my mind, the statistics are flawed, but they still lend some weight against Paul. Oddly enough, Larson's finding that several of the letters were probably *not* written by Ron Paul was particularly persuasive. It feels human and messy, the way I would expect this kind of thing to be.
Final thoughts, mainly intended for my statistically minded friends: yes, this is an unabashedly Bayesian perspective. I'm demanding priors, which probably can't be specified to anyone's satisfaction.
IMHO, the frequentist approach has even deeper problems. From a frequentist perspective (which is where Larson's original, and deeply confusing p-values come from.) we use Paul's recent speeches and writings to estimate some parameters of his current text-generating process. We then compare the newsletters to that process and estimate the probability that the older text was generated by the same process.
Problem: we *know* the old text was not generated by the same process. It was written (allegedly) by a younger Ron Paul, on different topics, speaking into a different political climate. Without a broader framework, it's impossible to determine whether the differences are important. The Bayesian approach provides a direct way of assessing that framework. The frequentist approach doesn't -- at least not that I can see, without jumping through a lot of hoops -- and in the meantime, it obscures the test that's actually being conducted.
Thursday, December 29, 2011
Research Officer in Quantitative Text Analysis
Duration: 24 months
Start Date: 1 March 2012 or as soon as possible thereafter
Salary: £31,998 – £38,737 p.a. incl.
Applications are invited for the post of Research Officer, to assist with a principal research officer for the European Research Council funded grant Quantitative Text Analysis for the Social Sciences (QUANTESS), working with Professor Kenneth Benoit (PI).
The research officer’s general duties and responsibilities will be to work with text and the computer organization, storage, processing, and analysis of text. These tasks will involve a combination of programming, database work, and statistical computation. The texts will be drawn from social, political, legal, and commercial examples, and will have a primarily social science focus. The research officer will be expected to work with existing tools for text analysis, actively participate in the development of new tools, and participate in the application of these tools to the social scientific analysis of textual data.
The successful applicant will be expected to possess advanced skills and experience with computer programming, especially the ability use a language used in text processing such as Python; familiarity with SQL; and experience with the R statistical package or the ability to learn R.
The successful candidate should have a postgraduate degree in computer science, computational linguistics, or possibly a cognate social science discipline and have an interest in text analysis and quantitative linguistics, have a knowledge of social science statistics and have worked in a research environment previously.
To apply for this post please go to http://www.lse.ac.uk/JobsatLSE and select “Visit the ONLINE RECRUITMENT SYSTEM web page”. If you have any queries about applying on the online system, please call 020 7955 6656 or email firstname.lastname@example.org quoting reference 1223243.
Closing date for receipt of applications is: 31 January 2012 by 23.59 (UK time).
Please access the attached hyperlink for an important electronic communications disclaimer: http://lse.ac.uk/
Political Methodology E-Mail List
Editors: Diana O'Brien <email@example.com>
Jon C. Rogowski <firstname.lastname@example.org>
Send messages to email@example.com
To join the list, cancel your subscription, or modify
your subscription settings visit:
Seems to me that MVP/MVC (or even MVVM, if you're into that kind of thing) are good at the following:
- User experience
- Database performance
Having worked with (and built) several different systems at this point, I've realized that some designs make analytics easier, and some make them much, much harder. And nothing in general-purpose guidelines for good software design guarantees good design for analytics.
Since so much of what I do is analytics, I'd like to ferret out some best practices for that kind of development. I don't have any settled ideas yet, but I thought I'd put some observations on paper.
Some general ideas:
- Merges are a pain point. When doing analytics, I spend a large fraction of my time merging and converting data. Seems like there ought to be some good practices/tools to take away some of the pain.
- Visualization is also a pain point, but I'm less optimistic about fixing it. There's a lot of art to good visualization.
- Units of analysis might be a good place to begin/focus thinking. They tend to change less often than variables, and many design issues for archiving, merging, reporting, and hypothesis testing focus on units of analysis.
- The most important unit of analysis is probably the user, because most leap-of-faith assumptions center on users and markets, and because people are just plain complicated. In some situations (e.g. B2B), the unit of analysis might be a group or organization, but even then, users are going to play an important role.
- Make it easy to keep track of where the data come from! Any time you change the
- From a statistical perspective, we probably want to assume independence all over the place for simplicity -- but be aware that that's what we're doing! For instance, it might make sense to assume that sessions are independent, even though they're actually linked across users.
- User segmentation seems like an underexploited area. That is, most UI optimization is done using A-B testing, which optimizes for the "average user." But in many cases, it could be very useful to try to segment the population into sub-populations, and figure out how their needs are different. This won't work when we only have short interactions with anonymous users. But if we have some history or background data (e.g. FB graph info), if could be a very powerful tool.
- Corollary: grab user data whenever it's cheap.
I'll close with questions. What else belongs in this list? Are there other people who are thinking about similar issues? What process and technical solutions could help? NoSQL and functional programming come to mind, but I haven't thought through the details.
Friday, December 23, 2011
Anyway, I'm intimidated by perl (too hard too read, too many nonalphnumeric characters) so I rewrote the script in python. The first run will take a long-ish time, since it's downloading all 450+ existing episodes of the program. Subsequent executing of the script will be faster, since it only has to download new episodes. Enjoy!
By the way, AFAIK, this type of webcrawling is completely legal. The content is already streamable from the TAL website; you're just downloading it er, a little faster than usual.
That said, if you use this script, I'd recommend making a tax-deductible contribution to This American Life -- it's a great program, worthy of support. The "donate" button is in the upper-right corner of the This American Life webpage.
#!/usr/bin/python # Adapted from: http://www.seanfurukawa.com/?p=246 # Translated from perl to python by Abe Gong # Dec. 2011 import urllib, glob, datetime def now(): """Get the current date and time as a string.""" return datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S") def log( S ): """Write a line to the log file, and print it for good measure.""" logfile.write(S + '\n') print S #Start up a log file logfile = file( 'tal_log.txt', 'a' ) #Load all the episodes that have already been downloaded; keep the filenames in a list episodes = [ f.split('/')[-1] for f in glob.glob('episodes/*.mp3') ] #print episodes #As of today (12/11/2011) there are 452 episodes, so a count up to 500 should last a long while. for i in range(1,500): #Choose the appropriate filename filename = str(i)+'.mp3' #Add the URL prefix url = 'http://audio.thisamericanlife.org/jomamashouse/ismymamashouse/'+filename #Check to see is the file has already been downloaded if not filename in episodes: #Log the attempt log( now() + '\ttrying\t' + url ) #Try to download it code = urllib.urlopen( url ).getcode() if code == 200: urllib.urlretrieve( url, filename='episodes/'+filename ) #Log the result -- success! log( now() + '\tsaved\t' + filename ) else: log( now() + '\tfile not found' )
Friday, December 16, 2011
If you're interested in trying Stanley Darpa or a close variant, let me know. The challenge was a lot of fun, and I'd love to run it again sometime.
Tuesday, December 13, 2011
|Me, proving that I live in Ann Arbor for Team Bass.|
(All the teams were named after mathematicians).
We've called it Stanley Darpa, in homage to Stanley Milgrom's small world experiment, and the more recent DARPA red balloon challenge (more here). The winning teams used a combination of social networking ("My sister just moved there!"), crowdsourcing ("I'll ask all my facebook friends if they know anybody in Shanghai"), and desperate dashing around campus ("IS ANYBODY IN THIS CAFETERIA FROM FORT WAYNE?") to get as many pictures as possible.
I helped come up with the concept, but my main role was to develop the software. I built a simple django site where students could quickly upload pictures as they found them. To ramp up the excitement, I also built tools to let teams track scores and compare finds in real time. These added a strategic dimension to the game: e.g. "Team Markov is leading -- we need to find people from their cities so they won't get so many bonus points!"
|The Stanley-Darpa countdown page|
In true Mission Control style, we put everything up on a projector screen and watched results roll in over the course of the class. It was slow at first, but as teams split up and started canvassing, the competition really heated up. In the end, the six teams found people from 47 cities. The winning team alone found 30. Surprisingly, many of the hardest cities to find were nearby: Youngstown, OH; Gull Lake, MI; Cadillac; MI.
Tomorrow, I'll try to post a link to the actual site, but first I need to scrub the pictures, to make sure that nobody gets their ID stolen.
This was really fun -- I'd love to do it again sometime.
Tuesday, December 6, 2011
Happily, I got some very good help from the rabbitmq-discuss list:
The database RabbitMQ uses is bound to the machine's hostname, so if you copied the database dir to another machine, it won't work. If this is the case, you have to set up a machine with the same hostname as before and transfer any outstanding messages to the new machine. If there's nothing important in rabbit, you could just clear everything by removing the RabbitMQ files in /var/lib/rabbitmq.
I deleted everything in /var/lib/rabbitmq/mnesia/rabbit/ and it started up without trouble. Hooray!
I also got a warning about using outdated software:
1.7.2 is a *very* old version of the broker. You should really upgrade, as a lot of new features and a lot of bugfixes have gone in since then.
Now, as best I can recall, I originally installed rabbitmq with apt-get, running in Ubuntu 10.04. To fix this, I got the newest .deb from the rabbitMQ site and installed it. Here are the steps:
sudo apt-get install erlang-nox
sudo apt-get -f install
sudo apt-get install erlang-nox
sudo dpkg -i rabbitmq-server_2.7.0-1_all.deb
Very happy to be out of the woods on this one. I'm not a sysadmin and I don't aspire to be.
Saturday, December 3, 2011
Nice how-to for webfaction
I had this problem...
Getting started on github
The inner workings of the distributed system are still a bit of a mystery, but I gather I'm not the only one in that boat. For now, it works: I can take changes in development and push/pull them to my production server. Progress!
dev$ git commit -a
dev$ git push origin
prd$ git git pull origin master
Friday, December 2, 2011
Scott's class is one of a couple dozen or so follow-up classes to the Stanford AI class that attracted 140,000 students. They're being delivered using a totally new model of education.
- They're free. Yes, free.
- They're scalable -- any number of students can sign up.
- They're taught by rock star scientists and professors -- people at the cutting edge of their fields.
- They're graded. Students submit work (multiple-choice, mostly) and get grades and a certificate of completion at the end of the course*.
This is just the latest in a series of innovations that are going to turn education -- public, private, higher, you name it -- upside down. Why settle for a half-prepped lecture from a busy assistant professor when you can get the same content--better--online for free? If you're the teacher, why bother to prep the lecture when someone else has already given it?
* Yes, yes. The grading is pretty rudimentary, but it can't be that long until smart people figure out how to do better.It's a problem I'd be interested in working on.
Thursday, December 1, 2011
Tuesday, November 29, 2011
#Getting the latest version of R on an older version of ubuntu:
#Using PIP and VirtualEnv with django
sudo apt-get install python-setuptools python-dev build-essential
pip install numpy
sudo pip install numpy
sudo pip install scipy
sudo pip install rpy2
I never quite cracked this one. It seems that by default, pip doesn't work with the bitnami djangostack. I just used easy_install, with good results.
Neither of these worked perfectly for me, but they got me to the point where I could figure it out.
#Serving static files with WSGI
Trust the django docs to have a good built-in explanation.
Monday, November 28, 2011
CSAAW presentation: "How to plan a heist: Challenges, models, and tactics for researching information flow"
How to plan a heist: Challenges, models, and tactics for researching information flow
By definition, all social systems involve information flow: complex stimuli that connect individuals and coordinate action. However, like executing a heist, measuring and making inferences about these flows is a difficult, thorny problem. This presentation will describe these challenges (hidden networks, subtle signals), introduce mathematical tools for analyzing them (Pearl's graphical causal models, Shannon's mutual information), and discuss tactics for moving forward (experiments, causal and behavioral aggregation, and action space mining). These principles may also be useful for breaking into banks, pilfering paintings, and conning mob bosses.
Friday, November 25, 2011
Perspectives on Psychological Science has a short piece on using Amazon’s Mechanical Turk as a subject pool: “Amazon’s Mechanical Turk: A New Source for Inexpensive, Yet High-Quality, Data?”
As Google Scholar shows, Mechanical Turk is being used in lots of clever ways.
Mechanical Turk has been called a digital sweatshop. Here are two perspectives – an Economic Letters piece: “The condition of the Turking class: Are online employers fair and honest?” And, a piece calling for intervention: “Working the crowd: Employment and labor law in the crowdsourcing industry.”
I've done a lot of work with mturk. It's great for large-scale repetitive tasks, but expect to spend a fair amount of effort setting up and doing quality control. One of the really interesting things about the Google Scholar links is the number of different journals and fields using mturk for research.
Monday, November 21, 2011
About FoldIt, the protein-folding game:
“Although much attention has recently been given to the potential of crowdsourcing and game playing, this is the first instance we are aware of in which online gamers solved a longstanding scientific problem.”
About perverse academic incentives:
The larger challenge is that most scientific data is proprietary. A scientist works long and hard to generate original data. [...] She is not going to want share this data with others, particularly strangers.
“It is essential that scientists be rewarded when they share.”About new challenges:
It takes a certain brilliance, and a lot of work, to recognize problems that can be shared with a crowd, and set up the systems needed for strangers to work together productively.About untapped potential:
“We have used just a tiny fraction of the human attention span that goes into an episode of Jerry Springer.”
Friday, November 11, 2011
See Ruby on rails in EC2 in 20 minutes for the first part of the setup -- I use the same bitnami AMI to get started.
For the most part, I followed the steps here, but I had to do a little extra work to get the pg gem to install properly.
Here's the exact sequence of command line configuration steps:
1 sudo apt-get install postgresql
2 sudo apt-get update
3 rails new icouncil -d postgresql
4 sudo apt-get install libpq-dev
5 bundle install
7 cd icouncil/
9 bundle install
10 sudo su postgres
12 sudo apt-get install postgresql
13 sudo su postgres
14 apt-get install gedit
15 sudo apt-get install gedit
16 sudo vi /etc/postgresql/8.4/main/pg_hba.conf
17 sudo /etc/init.d/postgresql-8.4 restart
18 psql postgres -U icouncil
19 cd config
20 vi database.yml
21 cd ..
22 rake db:create:all
23 which psql
24 sudo gem install pg -- --with-pg-dir=/usr/bin
25 sudo gem install pg -- --with-pg-dir=/usr
26 rake db:create:all
27 rake db:create
28 vi Gemfile
29 bundle install
30 rake db:create
31 rails server
If you want to do it even cleaner than me, I suspect this sequence would work--haven't tried it yet, though.
sudo apt-get update
sudo apt-get install gedit
sudo apt-get install libpq-dev
sudo apt-get install postgresql
sudo su postgres
sudo vi /etc/postgresql/8.4/main/pg_hba.conf
sudo /etc/init.d/postgresql-8.4 restart
psql postgres -U icouncil
rails new icouncil -d postgresql
sudo gem install pg -- --with-pg-dir=/usr
Thursday, November 10, 2011
The most important change is that I've created zipped and tar-gzipped versions of the repository for download. Now you can get snowcrawl even if you don't speak subversion. See the downloads page for one-click downloads.
Wednesday, November 9, 2011
Turns out that understanding complexity classes can add a lot to the way we think about our own minds. Examples include:
- Do I "know" what 1,847,234 times 345 is?
- Is it possible to cheat on the Turing test?
- Can waterfalls think?
- When can you prove you know something without revealing *how* you know it?
- If my grandkids learn to time travel, do I have to worry they might kill me?
(BTW, the paper has been out there for a few months. I'm posting now because it's related--a little--to the review of computational social science I've been working on.)
Monday, November 7, 2011
Also, as far as I can tell, linguists, grammarians, and English majors call sentence diagrams "Reed-Kellogg" diagrams. NLP and computer science types call the diagrams "parse trees" or "concrete syntax trees," and produce them (usually) using probabalistic context free grammars (PCFGs).
If you want code that just works, the Stanford Parser looks like the place to start. It's in java, but I'm sure that won't be too much of a problem, because you can call it from the command line.
Python's NLTK might also work, but there are lots of different parsers, and it's not clear whether and how much training they require.
There are various other software packages out there -- some of them online -- but I doubt they'd support much volume, and batching would be a pain.
Other Software & Code
Funny blog posts about politics and grammar, or the lack thereof
Saturday, November 5, 2011
Now I'm running some text analysis on the tweets. I'll be posting code and writing up results here over the next few days. Questions are welcome!
For starters, here are words people use in support/opposition to the #OWS movement.
Here's the same data, rendered as word clouds, so it looks artsy. This really is the same data: sizes in the wordcloud are determined by the weights of the classifier -- regression betas, for you mathy people out there. Color and coordinates are arbitrary. So these wordclouds are exactly the same info as the tables above, just presented in a more visually appealing format.
As I peer at these tea leaves, I see a solidarity-oriented "stand together against brutal capitalist injustice" theme in the support words, and a libertarian "quit your whining and get to work" theme in the oppose words. What do you make of it?
Caveats and details of the method
This analysis is based on 1,000 tweets drawn from Monday, Tuesday and Wednesday of this week, so some of the themes might be specific to the events of those days. Also, there was quite a bit of noise in the sentiment coding. That will probably wash out in a large enough sample, but I don't know if 1,000 if large enough. Finally, support on twitter was running about 85% in favor of the protests, so the assessments of opposing words are probably less robust.
Tuesday, November 1, 2011
Here's a quick 14-line python script to do the job. It takes all the html files in the ./docs directory and writes them out as clean text to the ./output directory.
Note the lxml dependency!
from lxml.html import clean
import glob, re
cleaner = clean.Cleaner( style=True, scripts=True, comments=True, safe_attrs_only=True )
filenames = glob.glob('docs/*')
for f in filenames:
text = file(f,'r').read()
text = cleaner.clean_html( text ) #Remove scripts, styles, etc.
text = re.sub('<.*?>', '', text ) #Remove html tags
text = re.sub('\s+', ' ', text ) #Remove whitespace
file( 'output/'+f.split('/')[-1].split('.')+'.txt', 'w').write( text )
Monday, October 31, 2011
Having thought about the question in the time since, I'm ready to give a better definition:
Computational social science is research that answers questions in social science using specialized knowledge from computer science.This definition---especially the "specialized knowledge from computer science" bit---leads in some interesting directions. This post talks through a few of them.
Here's what I mean by "specialized knowledge." Practically speaking, computer science treats memory, bandwidth, storage, and especially computation as limited resources. Researchers in the field attempt to allocate those resources efficiently through algorithm design and efficient system architecture. Computer scientists also think a lot about best practices for deploying hardware and developing software.
As a result, this definition partly ties the definition of computational social science to the current state of software. As easy-to-use software becomes available, some areas will cease to require specialized knowledge. As that happens, ``regular'' social science will encroach on areas that were originally computational.
One scenario for future work in computational social science is that a small community of computer-literate developers will supply the rest of the field with software to perform specialized tasks. A similar process took place a generation ago when software packages like SPSS and STATA became available for statistical work.
Another scenario is that resources and skills for compSocSci will be concentrated in the private sector. Many academic researchers see this as a bad thing. The selfish logic of patents and trade secrets could easily lead to hoarding of proprietary data and code, and hold up the progress of scientific discovery. On the other hand, hoarding happens in the academy too, and some tech firms are big proponents of open source, so I think it's an open question which institutional structures will work best.
There are at least two types of research that are not computational under my definition. This is a good thing. We want compSocSci to be a big tent, but we also want the term to mean something. Meaning something implies that there are things that aren't computational.
First, research is not computational just because it uses computers. Thus, most researchers using Lexis-Nexis, STATA, and even Amazon's Mechanical Turk are nor doing computational social science, because those applications can be used without specialized knowledge from computer science. I would classify most work in R as computational, because it requires knowledge of programming.
Second, research on technology and politics is not necessarily computational. For instance, Karpf's ethnographic work on bloggers and other Internet activists is excellent, but not really computational. (His blogosphere authority index is an exception, since it required knowledge of web development.) Other examples include content analysis of websites, using web surveys to collect data, rhetorical analysis of Youtube videos, and so on. In this work, information technology appears in the research question, but not the methodology.
There are at least five computational areas that can make huge contributions to social science:
- Information retrieval: techniques for acquiring and storing Big Data
- Machine learning, NLP, and complex network analysis: statistical techniques for drawing inferences from new types of data
- Simulation: using computers to explore the behavior of systems.
- Web design and human-computer interaction:
- Computability and complexity theory: two branches of mathematics investigating the nature of information and computation
I'm planning to write about all five of these over the month or so.
But first, does this definition work? Are there borderline cases I've missed? There's enough interest in compSocSci that I think it's worth thinking through these issues. I'd love to get feedback on these ideas.
Friday, October 28, 2011
Answer: very not hard.
I haven't done much with rails in the past (One week of code with my brother looking over my shoulder.)
I grabbed the
Once it initialized, I ssh'ed in and ran:
rails new blogI followed the directions here to edit the gemfile, then ran:
rake db:createTada! Instant rails server.
PS - Yes, yes I know about heroku. In this case, building the app as an EC2 AMI is an essential part of the project. I want non-programmers to be able to clone the instance, and sharing a public AMI makes things pretty easy.
Thursday, October 27, 2011
Turns out I'm not the first person to see this need. Here are two nice, in-browser json editors. Nifty!
Wednesday, October 26, 2011
Welcome to RegExr 0.3b, an intuitive tool for learning, writing, and testing Regular Expressions. Key features include:
- real time results: shows results as you type
- code hinting: roll over your expression to see info on specific elements
- detailed results: roll over a match to see details & view group info below
- built in regex guide: double click entries to insert them into your expression
- online & desktop: regexr.com or download the desktop version for Mac, Windows, or Linux
- save your expressions: My Saved expressions are saved locally
- search Community expressions and add your own
- create Share Links to send your expressions to co-workers or link to them on Twitter or your blog [ex. http://RegExr.com?2rjl6]
Built by gskinner.com with Flex 3 [adobe.com/go/flex] and Spelling Plus Library for text highlighting [gskinner.com/products/spl].
Tuesday, October 25, 2011
Monday, October 24, 2011
My goal is to document ideas about computational social science as I encounter them in the course of my work. I'll post links, papers, scripts, software, and other bits and pieces. Over time, I hope this will grow into a useful repository for people interested in using computers to study social dynamics. If the blog gathers enough like-minded readership, it might also turn into a good place for discussion.
I'll be setting up formating, etc. in my spare time over the next couple weeks. Please let me know if you run into rough edges.
Friday, August 5, 2011
Now the waiting and nail-biting begins.