Friday, February 24, 2012

Pain points on Pulse newsreader

Pulse newsreader came preinstalled on the kindle fire I got for Christmas (thanks, Sam!).  On the whole I like it, but a few glaring bugs and omissions really hold it back.
  • Bug: after pressing the star button on one post, it gets stuck for all the other posts on the blog.  AFAICT, you can only star one post per blog per session.  Clunky.
  • Bug: when I'm offline---I do a lot of my blog reading on the bus---Pulse doesn't seem to remember which blog posts I've read.  I read it, swipe it, and then next time I'm on the grid it seems to pop right back up.  I can't tell if it's doing this all the time or just a lot of the time, but it's a pain to hit the same articles two, three times, and up.
  • Feature request: I love being able to post to twitter with two clicks.  (This is nice for Pulse too, because they get their name in the link.)  Can you let me queue tweets in offline mode, then sync with twitter once I get back on the grid?
  • Feature request: alternatively, can you provide an API to starred items?  If I could get at those (say as RSS), I could automate the posting to twitter myself.
Pulse, are you listening?  Fix these and I will be your friend forever.  Until then, I'm looking suggestions on blog readers...

Wednesday, February 22, 2012

Getting started with django, heroku, and mongolab

Long post here -- I just spent a couple hours going from zero to mongo with django and heroku.  Here are my working notes.

For this test, I started from a pure vanilla django project:
django-admin startproject testproj

I made a single edit to settings.py, adding "gunicorn" to the list of INSTALLED_APPS.

For setting up django on heroku, I basically followed the description given in these very good notes by Ken Cochrane.  I've done this before and there were no surprises, so I'll skip the stuff about setting up heroku and git.  I did nothing special except install gunicorn.

Here's my Procfile:
web: python testproj/manage.py collectstatic --noinput; python testproj/manage.py run_gunicorn -b "0.0.0.0:$PORT"

This is overkill at the outset--we don't have any static files, so collecting them doesn't do anything.

And my requirements.txt:
Django==1.3.1
gunicorn==0.13.4
psycopg2==2.4.4
simplejson==2.3.2

That got me to the famous django opening screen, live on heroku:
It worked!
Congratulations on your first Django-powered page.

Just to emphasize that I've done nothing special so far, here's the tree for the git repo I'm using.  (I've suppressed .pyc and *~ files.)

    .
    ├── Procfile
    ├── requirements.txt
    └── testproj
        ├── __init__.py
        ├── manage.py
        ├── settings.py
        └── urls.py

(Come to think of it, there should probably be a .gitignore file in there as well.)

Let's pick up from the point.

I added the mongolab starter tier (free up to 240 MB) to the heroku app.  I did this from the heroku console, because I wanted to see what options were there.  (Oddly, none of the dedicated tiers show up there.)  In the future, I'll probably just use the command line:
heroku addons:add mongolab:starter

Next, I followed the instructions on heroku's documentation, and grabbed the URI:
heroku config | grep MONGOLAB_URI

The MONGOLAB_URI is in the following format:
mongodb://username:password@host:port/database

At this point, heroku's documents stopped being much help, because they don't cover python.  So I switched over here instead.  I wanted to understand all the steps, so I refrained from copy-pasting.

I installed a few supporting libraries
pip install pymongo django-mongodb-engine djangotoolbox

And added the appropriate lines to requirements.txt
pymongo==2.1.1
django-mongodb-engine==0.4.0
djangotoolbox==0.9.2

I then configured the database in settings.py.
DATABASES = {
    'default': {
        'ENGINE': 'django_mongodb_engine',
        'NAME': 'heroku_app1234567',
        'USER': 'heroku_app1234567',
        'PASSWORD': 'abcdefghijklmnopqrstuvwxyz',
        'HOST': 'ds031117.mongolab.com',
        'PORT': '31117',
    }
}

Note: With versions of django-mongodb-engine over 0.2, ENGINE should be 'django_mongodb_engine', not 'django_mongodb_engine'.  Get this wrong and you'll see something like:
django.core.exceptions.ImproperlyConfigured:
'django_mongodb_engine.mongodb' isn't an available database backend.
Try using django.db.backends.XXX, where XXX is one of:
    'dummy', 'mysql', 'oracle', 'postgresql', 'postgresql_psycopg2', 'sqlite3'
Error was: No module named mongodb.base

I never did anything about the "current bug" that Dennis mentions.  Apparently it's been patched, whatever it was.

Quick check: so far, running locally (./manage.py runsersver) and pushing to heroku both work.  The next step is to add a model or two.  Time to fix that:
django-admin startapp onebutton
Here's models.py:
from django.db import models

class ButtonClick( models.Model ):
    click_time = models.DateTimeField( auto_now=True )
    animal = models.CharField( max_length=200 )
I was going to build a stupid-simple app with a single button to "catch" animals from a random list and store them in the DB, but it's getting late, so let's just jump to the proof of concept.

Going straight to the django shell on my computer...
$python manage.py shell
>>> from testproj.onebutton.models import ButtonClick as BC
>>> c1 = BC( animal="Leopard" )
>>> c1.save()
>>>
It works!  Open up the mongolab console (through the add-ons tab in heroku) and the database shows a single record in onebutton_buttonclick collection.

We still haven't written views and templates, and (more importantly) validated that the heroku app can talk to the DB over at mongolab, but I'm going to call this good enough for now.  Mission accomplished.

Saturday, February 18, 2012

Q&A about web scraping with python

Here are highlights from a recent email exchange with a tech-minded friend in the polisci department. His questions were similar to others I've been fielding recently, and very to-the-point, so I thought the conversation might be useful for others looking at getting into web scraping for research (and fun and profit).

> ... I have an unrelated question: What python libraries do you use to scrape websites?

urllib2, lxml, and re do almost everything I need. Sometimes I use wget to mirror a site, then use glob and lxml to pull out the salient pieces. For the special case of directed crawls, I built snowcrawl.


> I need to click on some buttons, follow some links, parse (sometimes ugly) html, and convert html tables to csv files.

Ah. That's harder. I've done less of this, but I'm told mechanize is good. I don't know of a good table-to-csv converter, but that's definitely a pain point in some crawls -- if you find anything good, I'd love to hear about it!

It strikes me that you could do some nice table-scraping with cleverly deployed xpath to pull out rows and columns -- the design would look a little like functional programming, although you'd still have to do use loops. Python is good for that, though.


> Is mechanize the way to go for browsing operations?

From hearsay, yes.

> What's your take on BeautifulSoup vs. lxml vs. html5lib for parsing?

I got into lxml early and it works wonderfully. The only pain point is (sometimes) installation, but I'm sure you can handle that. My impression is that lxml does everything BeautifulSoup does, faster and with slightly cleaner syntax, but that it's not so much better that everyone has switched. I don't know much about html5lib.

> Should I definitely learn xpath?

Definitely. The syntax is quite easy, very similar to jquery/css selectors. It also makes for faster development: get a browser plugin for xpath and you can test your searches directly on the pages you want to crawl. This will speed up your inner loop for development time tremendously -- much better than editing and re-running scripts, or running tests from a console.

HTH.  Cheers!

Thursday, February 16, 2012

Working notes on cloud-based MongoDB with python

I've been thinking about getting into mongoDB for a good while.  I'm looking for a platform that works, scales, and integrates with python with a minimum of hassle.  Cheap would be nice too.

Tonight, Google and I sat down to do some nuts and bold research.  Here are my notes.  Have anything to add?


PS - Based on what I found, I'm thinking Heroku + MongoLab + PyMongo + Django is probably the best way to get my feet wet, since I'm already comfortable with django and heroku. 

I'll be trying this in the near future -- will let you know how it goes.



Cloud hosts for mongoDB:
  •     MongoLab
  •     MongoHQ
  •     MongoMachine -- bought by MongoHQ

Reviews here say MongoLab > MongoHQ w.r.t customer service
    http://www.quora.com/Heroku/How-would-I-use-the-mongolab-add-on-with-python

python ORMs for mongo
  •     mongonaut
  •     mongoengine
  •     mongokit/django-mongokit
  •     pymongo (simple wrapper, no ORM)
  •     ming
  •     django-mongodb
  •     django-nonrel

Strong recc for mongoengine > mongokit, esp for django developers.
    http://www.quora.com/MongoDB/Whats-the-best-MongoDB-ORM-for-Python

Says mongoEngine is faster than mongoKit
    http://www.peterbe.com/plog/mongoengine-vs.-django-mongokit

Slides also argue for mongoEngine
    http://www.peterbe.com/plog/using-mongodb-in-your-django-app/django-mongodb-html5-slides/html5.html

Says that pyMongo > mongoEngine
    http://stackoverflow.com/questions/2740837/which-python-api-should-be-used-with-mongo-db-and-django

mongoNaut is clearly not mature -- off the island!
    http://readthedocs.org/docs/django-mongonaut/en/latest/index.html

Instructions for setting up django and mongo, if that's your thing
    http://dennisgurnick.com/2010/07/06/bootstrapping-a-django-with-mongo-project/

PyMongo documentation
    http://api.mongodb.org/python/1.7/faq.html

MongoLab's example of integration, plus a small amount of stackoverflow chatter about it.
    https://github.com/mongolab/mongodb-driver-examples/blob/master/python/pymongo_simple_example.py
    http://stackoverflow.com/questions/8859532/how-can-i-use-the-mongolab-add-on-to-heroku-from-python


Decision reached!
    Heroku + MongoLab + PyMongo

Tuesday, February 14, 2012

Don't use netlogo! (4)

Quick search on simplyhired:

Jobs with python in the description: 26,549
Jobs with netlogo in the description: 1

Yes, I'm being snarky about netlogo.  But if you're a person with any talent or ambition, why waste it on a "skill" that has no practical value?

Saturday, February 11, 2012

Announcing Tengolo, a python alternative to Netlogo!

Yesterday I wrote about discussion/fallout from my argument against using NetLogo for agent-based modeling.  Today, contra the spirit of Mark Twain's "everyone complains about the weather, but nobody does anything about it,"  I want to give would-be modelers a constructive alternative to NetLogo.

Announcing Tengolo, an open-source library for agent-based modeling in python and matplotlib!  Tengolo is open source, and currently hosted at github.  Preliminary documentation is here.

Tengolo is designed to allow users to
  1. Quickly express their ideas in code,
  2. Get immediate feedback from the python shell, matplotlib GUI, and logs, AND
  3. Scale up the scope of their experiments with batches for repeated trials, parameter sweeps, etc.
Tengolo is designed to do scratch the same itch as netlogo, without making users debase themselves with the ridiculously backwards Logo programming language.  Instead, they can use python's clean syntax and enormous codebase to develop their models.

For the most part, the advantages of Tengolo are the advantages of python and matplotlib:
  • Clean, object-oriented code for
    • Quick learning
    • Rapid prototyping
    • Easy debugging
    • Great maintainability
  • An enormous codebase of snippets and outside libraries. 
  • An active and supportive user community
  • Powerful, professional graphs and plots with a minimum of hassle
  • An intuitive GUI that lets you interact with your model in real time

Development so far
I've just begun development -- about 8 hours of work -- but the advantages are already starting to show.  As a proof of concept, here's a screen shot and the script for a model I'm building in Tengolo.  All told, the script is only 80 lines long, and does not contain a single turtle.


#!/usr/bin/python
"""
P-A model simulator for mixed motives paper
    Abe Gong - Feb 2102
"""

import numpy
import scipy.optimize

from tengolo.core import TengoloModel, TengoloView
from tengolo.widgets import contour, slider

class M4Model(TengoloModel):
    def __init__(self):
        self.beta  = .5
        self.x_bar = 2
        self.a     = 0
        self.b     = 1
        self.alpha = .5
        self.c     = .1

        delta = 0.025
        self.v = numpy.arange(0.025, 10, delta)
        self.w = numpy.arange(0.025, 10, delta)
        self.V, self.W = numpy.meshgrid(self.v,self.w)

        self.update()

    def update(self):
        self.U = self.calc_utility( self.V, self.W, self.linear_rate )
        (self.v_star, self.w_star) = self.calc_optimal_workload( self.linear_rate )
        (self.v_bar, self.w_bar) = self.calc_optimal_workload( self.flat_rate )
        print (self.v_star, self.w_star)

    def calc_utility(self, v, w, x_func):
        z = (v**(self.beta))*(w**(1-self.beta))
        x = x_func(z)

        u = self.alpha*numpy.log(v) + (1-self.alpha)*numpy.log(x) - self.c*(w+v)
        return u

    def calc_optimal_workload(self, x_func):
#        return scipy.optimize.fmin( lambda args : -1*self.calc_utility( args[0], args[1], x_func ), [2,2], disp=False )
        result = scipy.optimize.fmin_tnc( lambda args : -1*self.calc_utility( args[0], args[1], x_func ),
                [2,2],
                bounds = [(0,None),(0,None)],
                approx_grad=True,
                disp=False,
            )
        return result[0]

    def flat_rate(self, z):
        return self.x_bar

    def linear_rate(self, z):
        return self.x_bar + self.a + self.b*z



#Initialize model
my_model = M4Model()

#Initialize viewer
my_view = TengoloView(my_model)

#Attach controls and observers
my_view.add_observer( contour, [0.15, 0.30, 0.70, 0.60], args={
    "title":"Utility isoquants",
    "xlabel":"v (hrs)",
    "ylabel":"w (hrs)",
    "x":"V",
    "y":"W",
    "z":"U",
})
my_view.add_control( slider, "alpha", [0.15, 0.15, 0.70, 0.03], args={"range_min":0, "range_max":1} )
my_view.add_control( slider, "beta", [0.15, 0.10, 0.70, 0.03], args={"range_min":0, "range_max":1} )

#Render the view
my_view.render()

This particular model is game-theoretic, not agent-based, but the process of model design is essentially the same.  I want to be able to build and edit the model, and get results as quickly as possible. The process should be creative, not bogged down with debugging. Along the way, I need to experiment with model parameters, and quickly see their impact on the behavior of the system as a whole.

As I said earlier, I've only just started down this road, but there's no turning back.  If you have a model you'd like to port to Tenlogo, let me know.  Cheers!

Friday, February 10, 2012

More about not using netlogo

I got a fair amount of pushback on my last post about NetLogo.  The basic gist was that yes, logo is an antiquated programming language, but no, that's no reason to write off the Netlogo platform as a whole.



Here's a considered response from Rainer, who studies epidemiology and public health at U Michigan.:

Regarding your Netlogo sympathies ... I totally understand where you are coming from ... Python is my favorite programming language, I also know Java well. I build ABMs in Netlogo and RePast, as well as in Python.

It took me a while to get into NetLogo and I have to agree it is a horrible "programming language". One of the major advantages of NetLogo however is that it takes care of a lot of overhead (graphical output, controls, parameter sweeps, R integration). NetLogo 5 now has some form of functional programming, list comprehension, dictionaries and other features that I haven't fully explored.

I never thought I would be a NetLogo advocate but it does have its place in the world of simulation.

These are fair, very practical points.  They are basically the same reasons we continue to use the QWERTY keyboard, even though Dvorak is probably a little faster and a lot less painful.

The difference is that keyboards are at the end of the adoption cycle, and ABMs are at the beginning.  With modeling software, there's still time to change and avoid decades of deadweight legacy loss.  Since NetLogo is primarily used as a pedagogical tool, it seems a shame that we are forcing new students to invest in a dead-end language.

With all that in mind, I've decided to become a bit of a gadfly with respect to Netlogo.   More on this subject tomorrow...  In the meantime, don't use it!

Wednesday, February 8, 2012

Key skills for job-hunting data scientists

There's a lot of buzz around data science, but (as I've posted about previously) the term is still murky.

One way to get a look at the emerging definition of data science is to search for "data science" jobs, and see what they have in common: What are the key skills for data scientists?

This wordle sounds like Romney: "jobs jobs jobs..."


I took a few minutes today to run that search -- automated, of course.  Nothing rigorous or scientific, but the results are still plenty interesting.

Methods:  I searched "data scientist" on simplyhired.com, then scraped the ~250 resulting links.  All of the non-dead links (there weren't many dead ones) returned a job posting, usually on a company page, occasionally on another aggregator.  I grabbed the html of each of these pages, and cleaned the html to get rid of scripts, styles, etc.  I didn't do any fancy chrome or ad scraping, so take the results with a grain of salt.

First, I generated the obligatory word cloud.  Thank you, wordle.


Then I skimmed a dozen of the pages, looking for keywords that seemed to pop up a lot: java, hadoop, python.  For the most part, I focused on specific skills that companies are explicitly hiring for.  I also tossed in a few other terms, just to see what would happen.

Here are counts of jobs mentioning keywords (Not keyword counts -- the number of separate job postings that include at least one reference to a given keyword.):


198 data
131 statist
130 java
107 hadoop
85 mining
68 python
49 visuali
39 cloud
37 mapreduce
35 c\+\+
24 amazon
22 ruby
18 bayes
15 ec2
13 jquery
13 fun
3 estimat

Evidently, data science postings put equal value on "fun" and "jquery."

Also, at a glance, Java beats python, beats C++ in terms of employability.  It kind of makes me wish I'd been nicer to Java all these years.

One clear finding is that hadoop and MapReduce skills are in high demand.  That's not news to anyone working in this area, but I was surprised at just how many jobs were looking for these skills.  Almost half (107 of 238, 45%) of total job postings explicitly mention hadoop.

That percentage seems slightly out of whack to me, because there are plenty of valuable ways to mine data without using a MapReduce algorithm.  Maybe Hadoop is a pinch point in the job market because there just aren't enough MapReduce-literate data miners out there?  If that's the case, I would expect demand to come down (relative to supply) in the not-too-distant future -- MapReduce isn't that hard to learn.

Alternatively, there could be some bandwaggoning going on:
"Data is the Next Big Thing.  We need to hire a data person."
"What exactly is a data person?"
"I don't know, but I hear they know how to program in 'ha-doop,' so put that in the posting."
As a third explanation, it may be that the meaning of "data science" is narrowing.  Instead of encompassing all the things that can be done with data, perhaps it's coming to mean "mapreduce."  If that's the case, then "data science" jobs would naturally include hadoop/mapreduce skills.  IMO, that would be sad, because it would be an opportunity missed to change the way data flows into decisions in a more systemic way.

I'd be interested in hearing other explanations for the dominance of hadoop.  Also, if you have other queries to run against this data set, I'm happy to try them out.  What I've put up so far is just back-of-envelope coffee-break stuff.

Monday, February 6, 2012

Ubuntu fail (and fix): After the latest "update," unity crashes on alt-tab

I love ubuntu linux -- as loyal a user as they come -- but I still have to share this horror story.  It's a good case study in the ups and downs of working in a fully open-source environment.

Last week, I installed routine updates to ubuntu on my main work computer. Unfortunately, there's a huge bug in the latest update: switching windows using alt-tab crashes Unity, the main GUI for ubuntu 11.10.

This makes it impossible to open new applications -- or even shut down without holding down the power button. It also disabled keyboard input to the terminal -- like severing the spine of the OS.  Since alt-tab is a deeply ingrained reflex for me, the "update" made my computer almost entirely unusable.
For the record, this is by far the biggest bug I've run into so far with ubuntu.

The bug was reported early on, but I don't know how long it will take to fix. After four days, the status was "critical, unassigned," which I take to mean "we know it's a problem, but haven't got to it just yet."

In the meantime, I still had work to do.  I posted to various forums (like here), but didn't get much in the way of specific help -- unusual for the ubuntu community.  For the most part, I worked in a campus lab (which brought its own problems: missing software and no admin rights -- the reasons I'd come to rely on my laptop so much.) When I absolutely had to use my laptop, I sat on my left hand to avoid the temptation to switch between windows via the keyboard.

Today, I finally knuckled down to finding a workaround on my own.  It took me about an hour to discover gnome-shell, the major competitor to Unity.  From there, installation and configuration took less than 10 minutes.  I don't like gnome's look as much as unity -- too much chrome -- but if it makes my system usable, I'll keep it.

Here's the site that gave straigthforward installation instructions for gnome-shell:
http://www.ubuntugeek.com/how-to-install-gnome-shell-in-ubuntu-11-10-oneiric-ocelot.html

Here are some other links that were helpful, but probably not necessary for the final fix:

Also, there was a brief time where I disabled unity, but didn't have gnome running yet.  This meant that my only way to launch applications was from the terminal.  Here are some commands that will save your sanity in this situation:
  • Ctrl-Alt-T : Load terminal, from anywhere.  If you turn off unity, this is the only way to launch new programs.
  • firefox &  : Load firefox from terminal.  Now you can get online for help.
  • gnome-session-quit : Now you can log in and out.
  • shutdown : Now you can shutdown without yanking the power cord.

Wednesday, February 1, 2012

Follow up on personal elephants

Three thoughts from the talk about elephants and motivation I posted yesterday...

First, I found a website that talks about Buddhist symbolism, including elephants.  The metaphor is perfect:
At the beginning of one's practice the uncontrolled mind is symbolised by a gray elephant who can run wild any moment and destroy everything on his way. After practising dharma and taming one's mind, the mind which is now brought under control is symbolised by a white elephant strong and powerful, who can be directed wherever one wishes and destroy all the obstacles on his way."

Second, the closing thought in the talk is from Kara S, about getting to know your personal elephant. Her comment reminded me of a conversation from Paulo Coelho's wonderful little book, The Alchemist:

    "My [elephant] is a traitor," the boy said to the alchemist, when they had paused to rest the horses. "It doesn't want me to go on."

    "That makes sense," the alchemist answered. "Naturally it's afraid that, in pursuing your dream, you might lose everything you've won."

    "Because you will never again be able to keep it quiet. Even if you pretend not to have heard what it tells you, it will always be there inside you, repeating to you what you're thinking about life and about the world."

    "You mean I should listen, even if it's treasonous?"

    "Treason is a blow that comes unexpectedly. If you know your [elephant] well, it will never be able to do that to you. Because you'll know its dreams and wishes, and will know how to deal with them."

    "You will never be able to escape from your [elephant]. So it's better to listen to what it has to say. That way, you'll never have to fear an unanticipated blow."

Coelho uses the word "heart," instead of "elephant," but I'm sure he won't mind the substitution.

Finally, I referenced a bunch of studies (in passing) in the talk, but didn't include any citations, because I ran out of time.  If you happen to have links/refs to articles, books, videos, etc. in this area, can you paste them into the comments?

If you haven't seen the original slides yet, I'll leave these as teasers.