compSocSci: 2012

Friday, July 6, 2012

Check out (and like, tweet and +1!) the civilometer prototype site

In the spirit of "show > tell", I spent a good chunk of last week building a prototype site for my civilometer proposal for the Knight Foundation's news challenge. Please check it out and support us on twitter, facebook and google! Tweets with the #newschallenge hashtag would be particularly appreciated.

Here's the link: www.civilometer.com

Here's a screenshot:

A screenshot, for your viewing enjoyment.

For those of you joining the story late, the proposed project is a public-facing site for political civility. The site is designed as a data playground to hold politicians and newsmakers accountable for what they say. We would take in real-time media feeds, and apply scientific civility-measuring techniques from my dissertation. A suite of data visualization tools would enable users to ask data-driven questions about civility, and create and share cool graphs of their findings.

There's *a lot* that you could do with all this data. My hope is that by building a public site (rather than hiding our findings in obscure academic journals) we can inject a bit more accountability into public discourse. I'm really excited about the chance to build something genuinely productive with the research I've been doing the last five years of my life.

To make all this happen, I've applied for a grant from the Knight foundation. Part of their judging criteria is public support. Judging is happening right now. (*bites nails in trepidation*). If you like this idea, please head over to the site (www.civilometer.com), and tweet, like, and share the idea with everyone you know.

Thanks!

Warning: the site looks best in recent versions of Firefox and Chrome. I haven't really tested it on IE, or Safari. It looks decent on my kindle, though! If we get funded, I'll make sure it looks good for all you you poor corporate Microsoft slaves as well.

Friday, June 29, 2012

Word cloud of Knight News Challenge Data proposals

Last night, I scraped the ~800 "data" proposals from the Knight News Challenge and turned them into word soup*. As Mike says: Sorry, science! Still, you get a sense of the themes shared across the proposals.

I'm excited about the contest, and the (realistically, slim) chance that our civil-o-meter proposal will get funded. This is a really nifty time to be working in this area.

If you like the idea holding politicians and newsmakers to a fair and accurate standard for civility, please like us on tumblr, or tweet about us using the #newschallenge hashtag.

*I used python for the scraping and R for the very lightweight NLP. The layout is by wordle.

Thursday, June 21, 2012

A shameless plug for a worthy cause

Please "like" my proposal for a political civil-o-meter here
If you don't have a tumblr account already,
you'll need to take two minutes and create one.

Details
I've just put in an application for funding through the Knight Foundation's civic media news challenge. They want to "accelerate media innovation by funding breakthrough ideas in news and information." This round in the grant competition focuses on the role of data in civic engagement -- right up my alley.

To meet that challenge, I'm proposing a political civil-o-meter -- a crowdsourced site to generate fair and accurate civility ratings for political speech (think campaign ads, newspaper op-eds, and blog posts). Most of the tools to build such a site will already be developed as part of my dissertation; this grant would help me make them available to the public. This site would provide a really cool way to explore civility in public discourse, and hold public officials and media personalities accountable for the civility (or lack thereof) of what they say.

I'd appreciate it if you'd head on over to the Knight Foundation's tumblr blog and "like" the civil-o-meter proposal. (If you don't have a tumblr account already, you'll need to create one -- a quick, painless, and spam-free process.) Even if you don't like the proposal or just don't get it, you can ask clarifying questions in the comments section, and I'll do my best to explain things better. Awards aren't made strictly on the basis of voting, but I figure a little extra attention in this category can't hurt.

Thanks!

Monday, June 4, 2012

Design patterns for data-centric software

I wrote a few days ago about software design patterns, including the thought that we're going to discover new patterns for data-centric software. Let me unpack that concept.

First, by data-centric software, I don't mean software intended for data analysis (e.g. R, excel, or google charts). I mean any software that collects and/or responds to data in the course of doing whatever else it does.

Web analytics are a great example of this. The primary purpose of a web page is to serve content. But at the same time, it's easy to track pageviews and traffic. Compared to an untracked web site, a site instrumented with google analytics is more data-centric, because it's generating data in the background.

As I read it, the original design patterns are intended mainly to minimize long-term development costs. The key question is "How should code be structured to make it easy to read, debug, maintain, extend, etc?" It's all about saving developers' time in the long run.*

Five years after the original set of design patterns was popularized, another book was published, focusing on design patterns for distributed software. This time, the key questions expanded to include bandwidth and concurrency: "How should we structure code to make the best use out of distributed computing resources?"

I think we're due for another expansion, because data-centric code introduces another optimization target: useful information.** Just as the list of patterns expanded to deal with networking and multiprocessing, it will expand again as data processing and analytics become integral to software design.***

Off the top of my head, here's a quick list of data-centric patterns.

A/B testing
Funnel analysis
Recommender systems (very broad category!)
Top hits (most visited, emailed, etc.)
Automatic bug reports
Likes, +1s, Retweets

This list isn't complete, and it's clear that best practice is still evolving. For example, A/B testing has been industry-standard for a long time, but I recently read a good argument that a multi-armed bandit algorithm is better than A/B testing, because it gathers all the same information, plus integrating that feedback directly into the site design. It's a very natural extension and improvement over an older data-centric design pattern. I'm sure that many other such improvements are possible.

Anyway, I think it's still too early to try to write a comprehensive list. But I'd still like to expand this list to cover as many cases as possible. What else belongs here?

*A few of the patterns address things like limited memory and processing power, but they're the exceptions.

** Defining useful opens up a whole new can of worms, which I won't get into here.

***This relates back to the concept that I've written about before: software design for analytics.

Friday, June 1, 2012

Software design patterns

Following a tip from an experienced software developer, I've been reading up on software design patterns: flyweights, factories, facades, etc. These are general patterns for object-oriented programming that show up again and again. The original canon included 23 patterns; that list has since expanded to include patterns for networking and multiprocessing.

These design patterns remind me of Go proverbs -- high-level heuristics for better strategy, sometimes contradictory. Knowing them can be extremely helpful, but it's no guarantee that you can deploy them correctly. (Here's a good list of common go proverbs.)

Anyway, reading the original Design Patterns book, I've had three main reactions:

1. Data-centric software development is going to discover its own list of software design patterns.

2. There are patterns for research design, just like there are patterns for software design.

3. I already know most of the software patterns -- yay!*

Since I just can't sleep tonight, I figured I'd queue up a few blog posts talking about the first two. Look for those in a couple days.

*Given my very ad hoc background in software design, I've been pleasantly surprised to find that most of the software design patterns are already familiar. For example, python is already very good with iterators and decorators. And working with web frameworks has taught me a lot about factories. And many of the others are much less important in python because objects are dynamically typed. Anyway, it's nice to discover that I've picked a lot of this up by osmosis. (Pat self on the back.)

Thursday, May 31, 2012

Bay Area data science people, events

A quick favor: I'm headed out to Palo Alto for a family event in a couple weeks. While I'm there, I'd love to meet people and find out more about the Bay Area data science scene.

Where should I go? Who should I meet?

I'm free mainly on Monday the 11th through Wednesday the 13th, with some time on Tuesday evening here.

This picture is the first result for gImages: "going to the big city." I like it.

Tuesday, May 29, 2012

Live streaming of Northeastern/Harvard/MIT workshop on computational social science @ IQSS, May 30-June 1

Tomorrow, IQSS is running a conference on computational social science. I can't attend this year, but the conference organizers have kindly offered to livestream the sessions. Here's the email from David Lazer.

Hi all,

Please note that we will be live streaming the workshop on computational social science (program below). The url:

http://video.isites.harvard.edu/liveVideo/liveView.do?name=Comp_Soc_Science

The Twitter hashtag is: #compsocsci12. We will monitor this hashtag during the workshops to enable remote Q&A.

If you would like to embed the stream in your website, use this code:

<iframe src="http://video.isites.harvard.edu/liveVideo/liveEmbed.do?name=Comp_Soc_Science&width=auto&height=auto" width="640" height="360" style='border: 0px;'></iframe>

Please feel free to forward this e-mail on to interested parties, and if this has been forwarded to you, and you would like to be added to the list, please contact m.lee@neu.edu.

best,

David

Friday, May 18, 2012

Will crunch numbers for food

I don't like self-promotion. Makes me feel greasy, if you know what I mean. But graduation is looming, it's a boom year for big data, and there's no hiring pipeline from political science to fun tech jobs in tech. So I figure it's time to hang out my shingle as a data scientist.

Earlier this week, I bought the domain name abegong.com and worked up a digital resume. Like I said, I'm not a big self-promotion guru, so I'd be grateful for feedback (or job leads).

Thursday, May 17, 2012

Nifty tools for playing with words

Here are a bunch of sites I use to play with words -- whether brainstorming or trying to accomplish something specific with text analysis.

A rhyming dictionary. Helpfully splits up the word list by syllables, so you can finish that sonnet you've been working on.

Here's a nifty little site for generating portmanteaus (word splices): http://www.werdmerge.com/

http://www.leandomainsearch.com: generates themed domain names, and checks to make sure they're unclaimed by URL squatters.

Online lorem generator. Here's the same thing in python.

Markov text generation: http://www.beetleinabox.com/markov.html.

Permute words and letters. This seems less useful to me... It gives all the combinations, not just the ones that make some kind of sense.

Lavarand used to do random haikus and corporate memos, but it looks like they've broken down.

Google ngrams on AWS public data sets. These are combinations of words that commonly co-occur in English.

Yes, yes. And then there's wordle. Too pretty for the rest of us.

What else belong on this list?

Tuesday, May 15, 2012

Python mapreduce on EC2

Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce. Now let's get to hello world (or rather, countWords) with python scripts.

#!/usr/bin/env python
# mapper2.py

import sys, re
 
for line in sys.stdin:
    line = line.lower()
    words = line.split()
 
    #--- output tuples [word, 1] in tab-delimited format---
    for word in words: 
        print '%s\t%s' % (word, "1")

Here's the reducer script....

#!/usr/bin/env python
# reducer.py
 
import sys
 
# maps words to their counts
word2count = {}
 
# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
 
    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)
    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        continue
 
    try:
        word2count[word] = word2count[word]+count
    except:
        word2count[word] = count
 
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
    print '%s\t%s'% ( word, word2count[word] )

The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths. Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.

bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false

NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online. The scripts here were modified from: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program

Other sources:

http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html
http://www.princesspolymath.com/princess_polymath/?p=137
http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html

http://wiki.apache.org/hadoop/AmazonS3

http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/
http://hadoop.apache.org/common/docs/r0.20.2/streaming.html

Thursday, May 3, 2012

Running mapreduce on Amazon's publicly available datasets with python

on Monday, I had a preliminary interview at a really interesting tech startup. In the course of the conversation, the interviewer mentioned that he'd used some of the technical notes from compSocSci in his own work. And I thought nobody was reading!

Anyway, I've been sitting on some old EC2/hadoop/python notes for a while. The talk gave me the motivation to clean up and post them, just in case they can help somebody else. The goal here is threefold:

Fire up a hadoop cluster on EC2
Import data from an EBS volume with one of AWS' public data sets
Use hadoop streaming and python for quick scripting

In other words, we want to set up a tidy, scalable data pipeline as fast as possible. My target project is to do word counts on wikipedia pages -- the classic "hello world" of mapReduce. This isn't super-hard, but I haven't seen a good soup-to-nuts guide that brings all of these things together.

Phase 1:
Follow the notes below to get to the digits-of-pi test. Except for a little trouble with AWS keys, this all went swimmingly, so I see no need to duplicate. If you run into trouble with this part, we can troubleshoot in the comments.

http://wiki.apache.org/hadoop/AmazonEC2#Running_a_job_on_a_cluster
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/

Phase 2:
Now let's attach an external dataset. Here's the dataset we'll use: Wikipedia Extraction (WEX). It's a processed dump of the English language Wikipedia, hosted publicly on Amazon Web Services under snapshot ID snap-1781757e.

This dataset contains a dump of 1,000 popular English wikipedia articles. It's about 70GB. At Amazon's $.12/GB rate, maintaining this volume costs about $8 for a whole month -- cheap! If you want to scale up to full-size wikipedia (~500GB), you can do that too. After all, we're in big data land.

Here's the command sequence to create an EBS volume for this snapshot and attach it to an instance. You can look up the ids using ec2-describe-volumes and ec2-describe-instances, or get them from the AWS console at https://console.aws.amazon.com. (Hint: they're not vol-aaaaaaaa and i-bbbbbbbbb.)

ec2-create-volume -snapshot snap-1781757e -z us-east-1a
ec2-attach-volume vol-aaaaaaaa -i i-bbbbbbbb -d /dev/sdf

It took a while for these commands to execute. Attaching the volume got stuck in "attaching" status for several minutes. I finally got tired of waiting and mounted the volume, and then the status switched right away. Can't say whether that was cause-and-effect or coincidence, but it worked.

Once you've attached the EBS volume, login to the instance (instructions here) and mount the volume as follows. This should be pretty much instantaneous.

mkdir /mnt/wex_data
mount /dev/sdf /mnt/wex_data

Now import the data into the Hadoop file system:

cd /usr/local/hadoop/
hadoop fs -copyFromLocal /mnt/wex_data/rawd/freebase-wex-2009-01-12-articles.tsv wex-data

If you want, you can now remove and delete the EBS volume. The articles file is stored in the distributed filesystem across your EC2 instances in you hadoop cluster. The nice thing is that you can get to this point within less than an hour, meaning that you only have to pay a tiny fraction of the monthly storage cost.

ec2-detach-volume vol-aaaaaaaa -i i-bbbbbbbbb -d /dev/sdf
ec2-delete-volume vol-aaaaaaaa

I had some trouble detaching volumes until I used the force flag: -f. Maybe I was just being impatient again.

That's enough for the moment. I'll tackle python in my next post.

Wednesday, April 4, 2012

Summer Course in Computational Models of Political and Social Events

From the Polmeth listserv:

Summer Course in Computational Models of Political and Social Events

Claudio Cioffi-Revilla <ccioffi@GMU.EDU>
April 03, 2009 @ 02:24:20

Message

Kindly circulate among colleagues and students interested in computational
social science and/or social simulation:

http://lipari.cs.unict.it/LipariSchool/SocialScience/

The Lipari International School on Computational Social Sciences will take
place this summer, July 18 - 25, 2009, on the Mediterranean island of Lipari
just north of the coast of Sicily, Italy. Researchers interested in the
emerging field of computational social science, especially but not
exclusively computational political science theory and simulation models,
are encouraged to consider this as an opportunity to learn more and become
acquainted with a variety of research frontiers.


--
Claudio Cioffi-Revilla, Ph.D.

Professor of Computational Social Science
NAS Jefferson Science Fellow
Director, Center for Social Complexity
Krasnow Institute for Advanced Study, George Mason University
Research-1 Bldg MS 6B2, 4400 University Drive, Fairfax, VA 22030 U.S.A.
tel (703) 993-1402, fax (703) 993-1399, ccioffi@gmu.edu
Research & Teaching http://socialcomplexity.gmu.edu
http://gazette.gmu.edu/articles/11458/
 MASON Project http://cs.gmu.edu/~eclab/projects/mason/

"All truths are easy to understand once they are discovered; the difficulty
is to discover them."--Galileo Galilei

**********************************************************
             Political Methodology E-Mail List
   Editors: Melanie Goodrich, <melaniegoodrich@nyu.edu>
                    Xun Pang, <xpang@wustl.edu>
**********************************************************
        Send messages to polmeth@artsci.wustl.edu
  To join the list, cancel your subscription, or modify
           your subscription settings visit:

          http://polmeth.wustl.edu/polmeth.php

**********************************************************

Monday, April 2, 2012

Help wanted: What, exactly, is "entertainment"??

What does it mean to be entertaining? I realized last weekend that I don't really know. (Insert joke about clueless quants here.) I mean, I know "entertaining" when I see it, but defining it in a way that's measurable is tough.

For example, I imagine that most people would agree that the Onion is usually more entertaining than C-SPAN. But why? What is it about the content that's different?

Here are a few more examples. I imagine most people will agree with me on the Daily Show, Onion, and all the things in the less entertaining column. To my mind, on a scale from "not entertaining at all" to "very entertaining," Gawker, Fox News and MSNBC seem less entertaining than John Stuart, but a lot more entertaining than the New York Times. That is, they're trying harder to grab and hold onto viewers' attention. The use a lot of the same gimmicks.

More entertaining		Less entertaining
The Daily Show Gawker The Onion MSNBC (?) Fox News (?)		NYT WSJ WaPo CSPAN Foreign policy

All this matters because I'm trying to measure the difference between entertaining and non-entertaining content in political blogs. For this kind of research, it's not enough to say "Blog X seems more entertaining to me than blog Y." I need to measure entertainment, and show that other people can replicate my measurement (After all, repeated measurement is the starting point for all science.)

To get there, I've been trying to write a "codebook" (like a survey, but about text instead of opinions) to measure entertainment in blog posts and news coverage. Here's what I've got so far.

Can you think of things to add? I'd really appreciate your ideas and suggestions...

How well do these statements describe this article: very well, somewhat well, a little well, or not at all well?

	Not at all	A little	Somewhat	Very
This article is written to be entertaining.
This article is written in a serious tone.
This article includes jokes and/or other humor.
The tone of this article is sarcastic and/or ironic.
This article includes sexual references, imagery, or innuendo.
The writing in this article is engaging---it gets and holds the reader's attention.
The writing in this article is flat---it doesn't do much to hold the reader's attention.

Friday, March 30, 2012

A flowchart about disagreement. With haikus!

Here's a flowchart for resolving disagreements. I made this up after a disagreement that worked out badly -- I wanted to think through how we could have avoided the problem. Also, it was a chance to write haikus.

What do you think? How would you extend or revise this picture?

Monday, March 26, 2012

Why mturkers should have faces

Over the last week or so, I've written about several pain points for requesters using Amazon's mechanical turk. Most of the come down to issues of trust, learning, and communication -- very human problems.

I speculate that the problem is one of philosophy. The design and advertising of the system both suggest that the designers think of turkers as interchangeable computation nodes -- they certainly treat them that way. The market was not designed to take into account differences in skills among workers, strategizing by workers, or relationships between workers and requesters. It was built --- intentionally --- to be faceless*.

Essentially, Amazon intended mturk to be like the EC2 spot instance market for human piece work: cheap computation on demand. An API to human intelligence. It's a clever, groundbreaking idea, but the analogy only holds to a point. Unlike instances of cloud machines, turkers are human beings. They have different skills and interests. They learn as they go along. They sometimes cheat if you let them. Treating them as faceless automata only works for the simplest of tasks.

A better, more human approach to crowdsourcing would acknowledge that people are different. It would take seriously the problems of motivation, ability, and trust that arise in work relationships. Providing tools for training, evaluation, and communication---plus fixing a few spots where the market is broken---would be a good start.

Let me finish by waving the entrepreneurial banner. I'm convinced there's a huge opportunity here, and I'm not alone. Mturk is version 1.0 of online crowdsourcing, but it would be silly to argue that crowdsourcing will never get better**. What's next? Do you want to work on it together?

* There's a whole ecosystem of companies and tools around mturk (e.g. Crowdflower, Houdini). I haven't explored this space thoroughly, but my read is that they're pretty happy with the way mturk is run. They like the facelessness of it. Even Panos Ipeirotis, whose work I like, seems to have missed a lot of these points -- he focuses on things like scale, API simplicity, and accuracy. Maybe I'm missing out on the bright new stars, though. Do you know of teams that have done more of the humanizing that I'm arguing for?

** Circa 530,000,000 BC: "Behold, the lungfish! The last word in land-going animal life!"

Thursday, March 22, 2012

Pain points in mturk

I posted a couple days ago on skimming and cherry-picking on mturk. Today I want to add to my list of pain points. These are things that I've consistently faced as I've looked at ways to integrate mturk into my workflow. Amazon, if you want even more of my research budget, please do the following. Competitors, if you can do these things, you can probably give Amazon a run for its money.

Here's my list:

1. Provide tools for training turkers.
Right now, HITs can only cover very simplistic tasks, because there's no good way to train turkers to do anything more complicated. There should be a way for requesters to train (e.g. with a website or video) and evaluate turkers before presenting them with tasks. It's not really fair to impose the up-front cost of training on either the turkers or requester alone, so maybe Amazon could allow requesters to pay turkers for training time, but hold the money in escrow until turkers successfully complete X number of HITs.

2. Make it easy to communicate with turkers.This suggestion goes hand-in-hand with the previous one. Right now it's very difficult to communicate with turkers. I understand that one of the attractions for the site is the low-maintainence relationship between requesters and turkers. But sometimes it would be nice to clear that barrier, in order to clarify a task, give constructive feedback, or maybe even -- call me crazy -- say "thank you" to the people who help you get your work done. It's possible now, but difficult. (Turkers consistently complain about this lack as well.)

3. Make it easy to accept results based on comparisons.
Monitoring HIT quality is a pain, but it's absolutely necessary, because a handful of turkers do cheat consistently. Some of them even have good acceptance ratings. I often get one or two HITs with very bad responses at the beginning of a batch. I suspect that these are cheaters testing to see if I'm going to accept their HITs without looking. In that case, they'd have a green light to pour lots and lots of junk responses into the task with little risk to their ratings.

As long as it's easy to get away with this approach, cheaters can continue to thrive. "Percentage of accepted tasks" is a useless metric when requesters don't actually screen tasks before accepting them. What you want is the percentage of tasks that were screened AND accepted. Some basic, built-in tools for assessing accuracy and reliability would make that possible, effectively purging the market of cheaters.

4. Provide a way for small batches to get more visibility.One of my main reasons for going to mturk is quick turnaround. In my experience, getting results quickly depends on two things: price, and visibility. Price is easy to control. I have no complaints there. But visibility depends largely on getting to the top of one of mturk's pages: especially most HITs or most recent. If you have 5,000 HITs, your task ends up on the front page and it will attract a lot of workers. But attracting attention to smaller tasks is harder. Mturk should make a way to queue small batches and ensure that they get their fair share of views**.

5. Prevent skimming and cherry pickingI've written about this in my last post. Suffice to say that mturk's system currently rewards turkers for skimming through batches of HITs to cherry pick the easy ones. This is not fair to other workers, wastes time overall, wreaks havok on most approaches for determining accuracy, and ruins the validity of some kinds of data. I can't blame turkers for being smart and strategic about the way they approach the site, but I can blame Amazon for making couterproductive behavior so easy. Add a "Turkers can only accept HITs in the order they're presented" flag to each batch and the problem would be solved!

Looking back over this list, I realize that it's become a kind of freakonomics*** for crowdsourcing. There are a lot of subtle ways that a crowdsouring market can fail, and devious people have discovered many of them. In the case of mturk, it's a market in a bottle, so you'd think we could do some smart market design and make the whole system more useful and fair for everyone.

* Right now, one strategy is to dole out the HITs one at a time, so that each one will constantly be at the top of the "most recent" page. But this makes it hard for turkers to get in a groove. It also requires infrastructure -- a server programmed to submit HITs one by one. Most importantly, it essentially amounts to a spam strategy, with all requesters trying to attract attention by being loud and obnoxious. You can't build an effective market around that approach.

** Sites like CrowdFlower are trying to address this need. I haven't used them much -- relying more on homegrown solutions -- so maybe this is a concern that's already been addressed.

*** The original freakonomics, about evidence of cheating in various markets, before the authors turned it into a franchise and let popularity run ahead of their evidence.

Monday, March 19, 2012

Market failure in mechanical turk: Skimming and cherry-picking

This is the first in a series of three posts -- a trilogy! -- about pain points on Amazon's mechanical turk, from a requester's perspective.

I'm a frequent user of mturk. I like the service, and spend a large fraction of my research budget there. That means I also feel its limitations pretty acutely. Today I want to write about a problem that I've noticed on mturk: skimming and cherry picking. (A few weeks ago, I complained about ubuntu. Why is it that we only hurt the computing systems we love?)

Here's the problem: even within a batch, not all HITs are equally difficult. I've discovered that some workers (smart ones) will skim quickly through a batch and cherrypick the easy HITs. For instance, given a list of blog posts to read and evaluate, some turkers will skip the long ones and only code the short ones.

Individually, skimming makes perfect sense. If you do, you can certainly make more dollars per hour. As a bonus, you might even get a higher acceptance rate on your HITs, because short HITs lend themselves to unambiguous evaluation. The system rewards strategic skimming.

But from a social perspective, skimming is counterproductive. It wastes time overall, because time spent skimming is time not spent completing tasks*. It's not really fair to other workers. It wreaks havoc on many approaches for determining accuracy. (As a requester, I've experienced this personally.) From a scientific standpoint, it can also ruin the validity of some kinds of data collection.

I first ran into clear evidence of skimming over a year ago. At first, I didn't want to say anything about it, because I didn't want to give anyone ideas. At this point, I see it all the time. One easy-to-observe bit of evidence: the hourly rate on most HITs will start high, and fall over time**. This is because skimmers grab the quick, easy tasks first, leaving slower tasks for later workers.

I can't really blame turkers for approaching their work in a clever way. Instead, I lay the blame on Amazon, for making counterproductive behavior so easy.

It's especially galling because it would be very easy to fix the problem. On the HIT design page, they should add a "Turkers can only accept HITs in the order they're presented" flag to each batch. For tasks with this flag checked, turkers would be be shown one HIT at a time. They'd be unable to view or accept others in the batch until they'd completed the HIT in front of them***. This would effectively deny turkers control over which HITs they choose to do within a batch****. It would end the party for skimmers, but make the market more efficient overall. A simple tweak to the market -- problem solved.

How about it Amazon?

* You can think about the social deadweight loss from skimming like this:
Let T be the total time all workers spend completing HITs. Skimming doesn't change T -- the total amount of task work is constant. But skimming itself is time consuming. Let S be the deadweight loss due to skimming on a given batch. Like T, the total wage for a given batch is also constant. Call it W.

In aggregate, the effective hourly wage for the whole batch without skimming is W/T. With any amount of skimming it is always less: W/(T+S). So although skimming may improve the hourly wage of the most aggressive cherry pickers, on the whole it always hurts the hourly wage of the mturk market as a whole.

** Yes, yes -- I know that this is not an acid test: there are other explanations for hourly rates that decline over the life of a task. Still, it's good corroborating evidence for an explanation that makes a lot of sense to begin with.

*** Only viewing one HIT at a time might make it harder for turkers to get a sense of what a given batch is like. There's a simple fix for this as well: allow turkers to see the next k tasks, where k is a small number chosen by the requester. This might make it harder to build a RESTful interface for turkers, though. I haven't thought it through in detail.

**** It's possible that requesters would abuse this power by doing a bait-and-switch: showing easy HITs first and then making them more difficult once workers have invested in learning the task. This seems like a minor concern---if the tasks get tough or boring, turkers can always vote with their feet. But if we're worried about it, there's an easy fix here as well: take control of the HIT sequence away from requesters, just like we took it away from workers. It would be very easy to randomize the order of tasks when the "no skimming" box is checked. Or allow requesters to click a separate "randomize tasks" box, with Amazon acting as credible intermediary for the transaction.

Friday, February 24, 2012

Pain points on Pulse newsreader

Pulse newsreader came preinstalled on the kindle fire I got for Christmas (thanks, Sam!). On the whole I like it, but a few glaring bugs and omissions really hold it back.

Bug: after pressing the star button on one post, it gets stuck for all the other posts on the blog. AFAICT, you can only star one post per blog per session. Clunky.

Bug: when I'm offline---I do a lot of my blog reading on the bus---Pulse doesn't seem to remember which blog posts I've read. I read it, swipe it, and then next time I'm on the grid it seems to pop right back up. I can't tell if it's doing this all the time or just a lot of the time, but it's a pain to hit the same articles two, three times, and up.

Feature request: I love being able to post to twitter with two clicks. (This is nice for Pulse too, because they get their name in the link.) Can you let me queue tweets in offline mode, then sync with twitter once I get back on the grid?

Feature request: alternatively, can you provide an API to starred items? If I could get at those (say as RSS), I could automate the posting to twitter myself.

Pulse, are you listening? Fix these and I will be your friend forever. Until then, I'm looking suggestions on blog readers...

Wednesday, February 22, 2012

Getting started with django, heroku, and mongolab

Long post here -- I just spent a couple hours going from zero to mongo with django and heroku. Here are my working notes.

For this test, I started from a pure vanilla django project:

django-admin startproject testproj

I made a single edit to settings.py, adding "gunicorn" to the list of INSTALLED_APPS.

For setting up django on heroku, I basically followed the description given in these very good notes by Ken Cochrane. I've done this before and there were no surprises, so I'll skip the stuff about setting up heroku and git. I did nothing special except install gunicorn.

Here's my Procfile:

web: python testproj/manage.py collectstatic --noinput; python testproj/manage.py run_gunicorn -b "0.0.0.0:$PORT"

This is overkill at the outset--we don't have any static files, so collecting them doesn't do anything.

And my requirements.txt:

Django==1.3.1
gunicorn==0.13.4
psycopg2==2.4.4
simplejson==2.3.2

That got me to the famous django opening screen, live on heroku:

It worked!
Congratulations on your first Django-powered page.

Just to emphasize that I've done nothing special so far, here's the tree for the git repo I'm using. (I've suppressed .pyc and *~ files.)

    .
    ├── Procfile
    ├── requirements.txt
    └── testproj
        ├── __init__.py
        ├── manage.py
        ├── settings.py
        └── urls.py

(Come to think of it, there should probably be a .gitignore file in there as well.)

Let's pick up from the point.

I added the mongolab starter tier (free up to 240 MB) to the heroku app. I did this from the heroku console, because I wanted to see what options were there. (Oddly, none of the dedicated tiers show up there.) In the future, I'll probably just use the command line:

heroku addons:add mongolab:starter

Next, I followed the instructions on heroku's documentation, and grabbed the URI:

heroku config | grep MONGOLAB_URI

The MONGOLAB_URI is in the following format:

mongodb://username:password@host:port/database

At this point, heroku's documents stopped being much help, because they don't cover python. So I switched over here instead. I wanted to understand all the steps, so I refrained from copy-pasting.

I installed a few supporting libraries

pip install pymongo django-mongodb-engine djangotoolbox

And added the appropriate lines to requirements.txt

pymongo==2.1.1
django-mongodb-engine==0.4.0
djangotoolbox==0.9.2

I then configured the database in settings.py.

DATABASES = {
    'default': {
        'ENGINE': 'django_mongodb_engine',
        'NAME': 'heroku_app1234567',
        'USER': 'heroku_app1234567',
        'PASSWORD': 'abcdefghijklmnopqrstuvwxyz',
        'HOST': 'ds031117.mongolab.com',
        'PORT': '31117',
    }
}

Note: With versions of django-mongodb-engine over 0.2, ENGINE should be 'django_mongodb_engine', not 'django_mongodb_engine'. Get this wrong and you'll see something like:

django.core.exceptions.ImproperlyConfigured:
'django_mongodb_engine.mongodb' isn't an available database backend.
Try using django.db.backends.XXX, where XXX is one of:
'dummy', 'mysql', 'oracle', 'postgresql', 'postgresql_psycopg2', 'sqlite3'
Error was: No module named mongodb.base

I never did anything about the "current bug" that Dennis mentions. Apparently it's been patched, whatever it was.

Quick check: so far, running locally (./manage.py runsersver) and pushing to heroku both work. The next step is to add a model or two. Time to fix that:

django-admin startapp onebutton

Here's models.py:

from django.db import models

class ButtonClick( models.Model ):
click_time = models.DateTimeField( auto_now=True )
animal = models.CharField( max_length=200 )

I was going to build a stupid-simple app with a single button to "catch" animals from a random list and store them in the DB, but it's getting late, so let's just jump to the proof of concept.

Going straight to the django shell on my computer...

$python manage.py shell
>>> from testproj.onebutton.models import ButtonClick as BC
>>> c1 = BC( animal="Leopard" )
>>> c1.save()
>>>

It works! Open up the mongolab console (through the add-ons tab in heroku) and the database shows a single record in onebutton_buttonclick collection.

We still haven't written views and templates, and (more importantly) validated that the heroku app can talk to the DB over at mongolab, but I'm going to call this good enough for now. Mission accomplished.

Saturday, February 18, 2012

Q&A about web scraping with python

Here are highlights from a recent email exchange with a tech-minded friend in the polisci department. His questions were similar to others I've been fielding recently, and very to-the-point, so I thought the conversation might be useful for others looking at getting into web scraping for research (and fun and profit).

> ... I have an unrelated question: What python libraries do you use to scrape websites?

urllib2, lxml, and re do almost everything I need. Sometimes I use wget to mirror a site, then use glob and lxml to pull out the salient pieces. For the special case of directed crawls, I built snowcrawl.

> I need to click on some buttons, follow some links, parse (sometimes ugly) html, and convert html tables to csv files.

Ah. That's harder. I've done less of this, but I'm told mechanize is good. I don't know of a good table-to-csv converter, but that's definitely a pain point in some crawls -- if you find anything good, I'd love to hear about it!

It strikes me that you could do some nice table-scraping with cleverly deployed xpath to pull out rows and columns -- the design would look a little like functional programming, although you'd still have to do use loops. Python is good for that, though.

> Is mechanize the way to go for browsing operations?

From hearsay, yes.

> What's your take on BeautifulSoup vs. lxml vs. html5lib for parsing?

I got into lxml early and it works wonderfully. The only pain point is (sometimes) installation, but I'm sure you can handle that. My impression is that lxml does everything BeautifulSoup does, faster and with slightly cleaner syntax, but that it's not so much better that everyone has switched. I don't know much about html5lib.

> Should I definitely learn xpath?

Definitely. The syntax is quite easy, very similar to jquery/css selectors. It also makes for faster development: get a browser plugin for xpath and you can test your searches directly on the pages you want to crawl. This will speed up your inner loop for development time tremendously -- much better than editing and re-running scripts, or running tests from a console.

HTH. Cheers!

Thursday, February 16, 2012

Working notes on cloud-based MongoDB with python

I've been thinking about getting into mongoDB for a good while. I'm looking for a platform that works, scales, and integrates with python with a minimum of hassle. Cheap would be nice too.

Tonight, Google and I sat down to do some nuts and bold research. Here are my notes. Have anything to add?

PS - Based on what I found, I'm thinking Heroku + MongoLab + PyMongo + Django is probably the best way to get my feet wet, since I'm already comfortable with django and heroku.

I'll be trying this in the near future -- will let you know how it goes.

Cloud hosts for mongoDB:

MongoLab
MongoHQ
MongoMachine -- bought by MongoHQ

Reviews here say MongoLab > MongoHQ w.r.t customer service
http://www.quora.com/Heroku/How-would-I-use-the-mongolab-add-on-with-python

python ORMs for mongo

mongonaut
mongoengine
mongokit/django-mongokit
pymongo (simple wrapper, no ORM)
ming
django-mongodb
django-nonrel

Strong recc for mongoengine > mongokit, esp for django developers.
    http://www.quora.com/MongoDB/Whats-the-best-MongoDB-ORM-for-Python

Says mongoEngine is faster than mongoKit
    http://www.peterbe.com/plog/mongoengine-vs.-django-mongokit

Slides also argue for mongoEngine
    http://www.peterbe.com/plog/using-mongodb-in-your-django-app/django-mongodb-html5-slides/html5.html

Says that pyMongo > mongoEngine
    http://stackoverflow.com/questions/2740837/which-python-api-should-be-used-with-mongo-db-and-django

mongoNaut is clearly not mature -- off the island!
    http://readthedocs.org/docs/django-mongonaut/en/latest/index.html

Instructions for setting up django and mongo, if that's your thing
    http://dennisgurnick.com/2010/07/06/bootstrapping-a-django-with-mongo-project/

PyMongo documentation
    http://api.mongodb.org/python/1.7/faq.html

MongoLab's example of integration, plus a small amount of stackoverflow chatter about it.
    https://github.com/mongolab/mongodb-driver-examples/blob/master/python/pymongo_simple_example.py
    http://stackoverflow.com/questions/8859532/how-can-i-use-the-mongolab-add-on-to-heroku-from-python

Decision reached!
    Heroku + MongoLab + PyMongo

Tuesday, February 14, 2012

Don't use netlogo! (4)

Quick search on simplyhired:

Jobs with python in the description: 26,549
Jobs with netlogo in the description: 1

Yes, I'm being snarky about netlogo. But if you're a person with any talent or ambition, why waste it on a "skill" that has no practical value?

Saturday, February 11, 2012

Announcing Tengolo, a python alternative to Netlogo!

Yesterday I wrote about discussion/fallout from my argument against using NetLogo for agent-based modeling. Today, contra the spirit of Mark Twain's "everyone complains about the weather, but nobody does anything about it," I want to give would-be modelers a constructive alternative to NetLogo.

Announcing Tengolo, an open-source library for agent-based modeling in python and matplotlib! Tengolo is open source, and currently hosted at github. Preliminary documentation is here.

Tengolo is designed to allow users to

Quickly express their ideas in code,
Get immediate feedback from the python shell, matplotlib GUI, and logs, AND
Scale up the scope of their experiments with batches for repeated trials, parameter sweeps, etc.

Tengolo is designed to do scratch the same itch as netlogo, without making users debase themselves with the ridiculously backwards Logo programming language. Instead, they can use python's clean syntax and enormous codebase to develop their models.

For the most part, the advantages of Tengolo are the advantages of python and matplotlib:

Clean, object-oriented code for

Quick learning
Rapid prototyping
Easy debugging
Great maintainability

An enormous codebase of snippets and outside libraries.
An active and supportive user community
Powerful, professional graphs and plots with a minimum of hassle
An intuitive GUI that lets you interact with your model in real time

Development so far
I've just begun development -- about 8 hours of work -- but the advantages are already starting to show. As a proof of concept, here's a screen shot and the script for a model I'm building in Tengolo. All told, the script is only 80 lines long, and does not contain a single turtle.

#!/usr/bin/python
"""
P-A model simulator for mixed motives paper
    Abe Gong - Feb 2102
"""

import numpy
import scipy.optimize

from tengolo.core import TengoloModel, TengoloView
from tengolo.widgets import contour, slider

class M4Model(TengoloModel):
    def __init__(self):
        self.beta  = .5
        self.x_bar = 2
        self.a     = 0
        self.b     = 1
        self.alpha = .5
        self.c     = .1

        delta = 0.025
        self.v = numpy.arange(0.025, 10, delta)
        self.w = numpy.arange(0.025, 10, delta)
        self.V, self.W = numpy.meshgrid(self.v,self.w)

        self.update()

    def update(self):
        self.U = self.calc_utility( self.V, self.W, self.linear_rate )
        (self.v_star, self.w_star) = self.calc_optimal_workload( self.linear_rate )
        (self.v_bar, self.w_bar) = self.calc_optimal_workload( self.flat_rate )
        print (self.v_star, self.w_star)

    def calc_utility(self, v, w, x_func):
        z = (v**(self.beta))*(w**(1-self.beta))
        x = x_func(z)

        u = self.alpha*numpy.log(v) + (1-self.alpha)*numpy.log(x) - self.c*(w+v)
        return u

    def calc_optimal_workload(self, x_func):
#        return scipy.optimize.fmin( lambda args : -1*self.calc_utility( args[0], args[1], x_func ), [2,2], disp=False )
        result = scipy.optimize.fmin_tnc( lambda args : -1*self.calc_utility( args[0], args[1], x_func ),
                [2,2],
                bounds = [(0,None),(0,None)],
                approx_grad=True,
                disp=False,
            )
        return result[0]

    def flat_rate(self, z):
        return self.x_bar

    def linear_rate(self, z):
        return self.x_bar + self.a + self.b*z



#Initialize model
my_model = M4Model()

#Initialize viewer
my_view = TengoloView(my_model)

#Attach controls and observers
my_view.add_observer( contour, [0.15, 0.30, 0.70, 0.60], args={
    "title":"Utility isoquants",
    "xlabel":"v (hrs)",
    "ylabel":"w (hrs)",
    "x":"V",
    "y":"W",
    "z":"U",
})
my_view.add_control( slider, "alpha", [0.15, 0.15, 0.70, 0.03], args={"range_min":0, "range_max":1} )
my_view.add_control( slider, "beta", [0.15, 0.10, 0.70, 0.03], args={"range_min":0, "range_max":1} )

#Render the view
my_view.render()

This particular model is game-theoretic, not agent-based, but the process of model design is essentially the same. I want to be able to build and edit the model, and get results as quickly as possible. The process should be creative, not bogged down with debugging. Along the way, I need to experiment with model parameters, and quickly see their impact on the behavior of the system as a whole.

As I said earlier, I've only just started down this road, but there's no turning back. If you have a model you'd like to port to Tenlogo, let me know. Cheers!

Friday, February 10, 2012

More about not using netlogo

I got a fair amount of pushback on my last post about NetLogo. The basic gist was that yes, logo is an antiquated programming language, but no, that's no reason to write off the Netlogo platform as a whole.

Here's a considered response from Rainer, who studies epidemiology and public health at U Michigan.:

Regarding your Netlogo sympathies ... I totally understand where you are coming from ... Python is my favorite programming language, I also know Java well. I build ABMs in Netlogo and RePast, as well as in Python.

It took me a while to get into NetLogo and I have to agree it is a horrible "programming language". One of the major advantages of NetLogo however is that it takes care of a lot of overhead (graphical output, controls, parameter sweeps, R integration). NetLogo 5 now has some form of functional programming, list comprehension, dictionaries and other features that I haven't fully explored.

I never thought I would be a NetLogo advocate but it does have its place in the world of simulation.

These are fair, very practical points. They are basically the same reasons we continue to use the QWERTY keyboard, even though Dvorak is probably a little faster and a lot less painful.

The difference is that keyboards are at the end of the adoption cycle, and ABMs are at the beginning. With modeling software, there's still time to change and avoid decades of deadweight legacy loss. Since NetLogo is primarily used as a pedagogical tool, it seems a shame that we are forcing new students to invest in a dead-end language.

With all that in mind, I've decided to become a bit of a gadfly with respect to Netlogo. More on this subject tomorrow... In the meantime, don't use it!

Wednesday, February 8, 2012

Key skills for job-hunting data scientists

There's a lot of buzz around data science, but (as I've posted about previously) the term is still murky.

One way to get a look at the emerging definition of data science is to search for "data science" jobs, and see what they have in common: What are the key skills for data scientists?

This wordle sounds like Romney: "jobs jobs jobs..."

I took a few minutes today to run that search -- automated, of course. Nothing rigorous or scientific, but the results are still plenty interesting.

Methods: I searched "data scientist" on simplyhired.com, then scraped the ~250 resulting links. All of the non-dead links (there weren't many dead ones) returned a job posting, usually on a company page, occasionally on another aggregator. I grabbed the html of each of these pages, and cleaned the html to get rid of scripts, styles, etc. I didn't do any fancy chrome or ad scraping, so take the results with a grain of salt.

First, I generated the obligatory word cloud. Thank you, wordle.

Then I skimmed a dozen of the pages, looking for keywords that seemed to pop up a lot: java, hadoop, python. For the most part, I focused on specific skills that companies are explicitly hiring for. I also tossed in a few other terms, just to see what would happen.

Here are counts of jobs mentioning keywords (Not keyword counts -- the number of separate job postings that include at least one reference to a given keyword.):

198	data
131	statist
130	java
107	hadoop
85	mining
68	python
49	visuali
39	cloud
37	mapreduce
35	c\+\+
24	amazon
22	ruby
18	bayes
15	ec2
13	jquery
13	fun
3	estimat

Evidently, data science postings put equal value on "fun" and "jquery."

Also, at a glance, Java beats python, beats C++ in terms of employability. It kind of makes me wish I'd been nicer to Java all these years.

One clear finding is that hadoop and MapReduce skills are in high demand. That's not news to anyone working in this area, but I was surprised at just how many jobs were looking for these skills. Almost half (107 of 238, 45%) of total job postings explicitly mention hadoop.

That percentage seems slightly out of whack to me, because there are plenty of valuable ways to mine data without using a MapReduce algorithm. Maybe Hadoop is a pinch point in the job market because there just aren't enough MapReduce-literate data miners out there? If that's the case, I would expect demand to come down (relative to supply) in the not-too-distant future -- MapReduce isn't that hard to learn.

Alternatively, there could be some bandwaggoning going on:

"Data is the Next Big Thing. We need to hire a data person."

"What exactly is a data person?"

"I don't know, but I hear they know how to program in 'ha-doop,' so put that in the posting."

As a third explanation, it may be that the meaning of "data science" is narrowing. Instead of encompassing all the things that can be done with data, perhaps it's coming to mean "mapreduce." If that's the case, then "data science" jobs would naturally include hadoop/mapreduce skills. IMO, that would be sad, because it would be an opportunity missed to change the way data flows into decisions in a more systemic way.

I'd be interested in hearing other explanations for the dominance of hadoop. Also, if you have other queries to run against this data set, I'm happy to try them out. What I've put up so far is just back-of-envelope coffee-break stuff.

Monday, February 6, 2012

Ubuntu fail (and fix): After the latest "update," unity crashes on alt-tab

I love ubuntu linux -- as loyal a user as they come -- but I still have to share this horror story. It's a good case study in the ups and downs of working in a fully open-source environment.

Last week, I installed routine updates to ubuntu on my main work computer. Unfortunately, there's a huge bug in the latest update: switching windows using alt-tab crashes Unity, the main GUI for ubuntu 11.10.

This makes it impossible to open new applications -- or even shut down without holding down the power button. It also disabled keyboard input to the terminal -- like severing the spine of the OS. Since alt-tab is a deeply ingrained reflex for me, the "update" made my computer almost entirely unusable.

For the record, this is by far the biggest bug I've run into so far with ubuntu.

The bug was reported early on, but I don't know how long it will take to fix. After four days, the status was "critical, unassigned," which I take to mean "we know it's a problem, but haven't got to it just yet."

In the meantime, I still had work to do. I posted to various forums (like here), but didn't get much in the way of specific help -- unusual for the ubuntu community. For the most part, I worked in a campus lab (which brought its own problems: missing software and no admin rights -- the reasons I'd come to rely on my laptop so much.) When I absolutely had to use my laptop, I sat on my left hand to avoid the temptation to switch between windows via the keyboard.

Today, I finally knuckled down to finding a workaround on my own. It took me about an hour to discover gnome-shell, the major competitor to Unity. From there, installation and configuration took less than 10 minutes. I don't like gnome's look as much as unity -- too much chrome -- but if it makes my system usable, I'll keep it.

Here's the site that gave straigthforward installation instructions for gnome-shell:
http://www.ubuntugeek.com/how-to-install-gnome-shell-in-ubuntu-11-10-oneiric-ocelot.html

Here are some other links that were helpful, but probably not necessary for the final fix:

Also, there was a brief time where I disabled unity, but didn't have gnome running yet. This meant that my only way to launch applications was from the terminal. Here are some commands that will save your sanity in this situation:

Ctrl-Alt-T : Load terminal, from anywhere. If you turn off unity, this is the only way to launch new programs.
firefox & : Load firefox from terminal. Now you can get online for help.
gnome-session-quit : Now you can log in and out.
shutdown : Now you can shutdown without yanking the power cord.

Wednesday, February 1, 2012

Follow up on personal elephants

Three thoughts from the talk about elephants and motivation I posted yesterday...

First, I found a website that talks about Buddhist symbolism, including elephants. The metaphor is perfect:

At the beginning of one's practice the uncontrolled mind is symbolised by a gray elephant who can run wild any moment and destroy everything on his way. After practising dharma and taming one's mind, the mind which is now brought under control is symbolised by a white elephant strong and powerful, who can be directed wherever one wishes and destroy all the obstacles on his way."

Second, the closing thought in the talk is from Kara S, about getting to know your personal elephant. Her comment reminded me of a conversation from Paulo Coelho's wonderful little book, The Alchemist:

    "My [elephant] is a traitor," the boy said to the alchemist, when they had paused to rest the horses. "It doesn't want me to go on."

    "That makes sense," the alchemist answered. "Naturally it's afraid that, in pursuing your dream, you might lose everything you've won."

    "Because you will never again be able to keep it quiet. Even if you pretend not to have heard what it tells you, it will always be there inside you, repeating to you what you're thinking about life and about the world."

    "You mean I should listen, even if it's treasonous?"

    "Treason is a blow that comes unexpectedly. If you know your [elephant] well, it will never be able to do that to you. Because you'll know its dreams and wishes, and will know how to deal with them."

    "You will never be able to escape from your [elephant]. So it's better to listen to what it has to say. That way, you'll never have to fear an unanticipated blow."

Coelho uses the word "heart," instead of "elephant," but I'm sure he won't mind the substitution.

Finally, I referenced a bunch of studies (in passing) in the talk, but didn't include any citations, because I ran out of time. If you happen to have links/refs to articles, books, videos, etc. in this area, can you paste them into the comments?

If you haven't seen the original slides yet, I'll leave these as teasers.

Monday, January 30, 2012

How to ride, eat, tame, etc. your personal elephant

This is a talk I gave at the annual "Hill Street TED" activity that my local congregation puts together. These are short talks in TED format, put together by members of the congregation to share aspects of their life and work that don't get talked about much at church.

Here, I've taken my slides from the talk, added a script, and revised the format to better fit the web. After going back and forth, I left in the Mormon references, even though I know some of them will be lost in translation. I also added a few slides based on comments and feedback from people at the talk. This let me put in more details and one-off ideas that just didn't fit into 10 minutes. Enjoy!

EDIT: Slideshare is giving me grief, so here are pdf versions of the talk with speaking notes, and without.

View more documents from Abe Gong.

Saturday, January 28, 2012

Announcing QuoteWars2012!

Just in time for the Florida primary, my brother and I have released a site about the 2012 elections: QuoteWars2012.

The site lets you quiz yourself on quotes by Obama, Romney, Gingrich, and other presidential hopefuls. Think of it as a gamified survey: like a survey, it collects data about public opinion, but it's also designed to be fun and informative.

The site in public beta -- fun and usable, with a few minor bugs. We're looking for feedback on how to expand and improve the site. Please play the game, forward far and wide, and let us know what you think!

Tuesday, January 17, 2012

Don't use netlogo

A follow-up to yesterday's post on picking programming languages: conditions under which you should program in NetLogo.

The short answer: None. It amazes me that people will put up with the pain of developing in a language where turtles are one of the primary object primitives.

A slightly longer answer: I can see why some people use NetLogo as a way to learn the basics of agent-based modeling, but I'll never be able to take it seriously as a research tool.

NetLogo is a legacy system with a lot of nifty-looking examples and modules, plus it can run in a browser. These are its strengths. But it's built on the Logo programming language, which is hopelessly outdated, and never intended for real number-crunching anyway. In other words, using NetLogo signals that you don't know how to do real programming.