Monday, May 13, 2013

Simple ways to make prettier graphs

A question on graphs from my cousin. She's good with statistics, but not a programmer.  I've fielded similar questions many times, so figured it'd be worth putting the answer in the public domain.


Any graph that includes the caption "lunch" is a good graph.  Also "nap."
Quick question-- I'm writing up a paper right now and need to stick some simple graphs in. Do you have any suggestions ways to make graphs that are prettier than Excel to Word (low bar...ha ha! Accidental pun!)?
My response:

Love the pun. :)  [Miscellaneous personal stuff...]
On graphs: How many graphs are we talking?  If it's just a handful, or if they're all different kinds, I'd recommend Photoshop or Illustrator.  Import the graph from excel, and then "trace over" it to give it the styling you'd like.  A lot of great data-centric presentations use this trick. 
Another option is tableau.  It's a pricey, but gives you good tools for designing nice-looking graphs, as well as tools for automating them (i.e. generating 20 graphs with the same basic template.) You might be able to use a 30-day trial; and maybe their student licenses are cheaper than the corporate ones. 
If you don't want to shell out for tableau and you're doing *lots* of graphs in the same style, then it might be worth climbing the learning curve for matplotlibggplot2, or the google charts API.  I doubt this is worth your time because, there'd be such a long learning curve: each of these is a graphing library on top of a programming language, and you'd need facility with both to make them work.
Getting graphs right is fiddly work.  All those axis and labels and spaces to play with, and that's before you even begin to think about borders and color.

I've toyed with the idea of a declarative language for graphs: a syntax for describing the story you want told, without including all the execution details.  For example, "A is more Y than B" should give you a nice bar chart with a tall column labeled "A," and a shorter column labeled "B."  The y-axis should be labeled "Y."

This strikes me as a difficult, but maybe not an impossible challenge...

Monday, May 6, 2013

Sadly, I'm not developing tengolo. Would you like to run with it?

A little over a year ago, I announced that I was working on tengolo, a python alternative to netlogo.  I'm not actively developing it -- didn't get much farther than a rough proof-of-concept, really -- but I still get questions about the package.

As a result, I find myself writing some variation on the email below at least a couple times a month:

Dear python/ABM enthusiast - 
Glad you're interested in python and ABMs.  I started on tengolo after a thorough search turned up no good ABM frameworks in python.  I worked on it for a short while, then moved on when my dissertation committee told me to focus on stuff that would actually help me graduate. :) 
I got far enough in to be confident that a python-based ABM framework like tengolo could work. All the code is in the github repository, and every month I get questions from people asking if it's being actively developed. There's clearly demand for the project, but I don't have time to support it at this point.  I'd love to see someone take this ball and run with it. 
Best,
Abe

Would you like to run with this project?  If you're good with python and want to run a potentially popular academic open-source project, tengolo would be a great fit.  Please get in touch, and I will happily direct potential users and collaborators in your direction.



Saturday, April 13, 2013

Python multiprocessing: 8x speed in 8 lines of code

At work last week I demo'ed some python parallel processing tricks.  Nothing fancy -- just standard usage of the multiprocessing library -- but these things can be a revelation if you haven't seen them before.

Like many things, python makes basic multiprocessing very easy: ~8 more lines of code can let you use all 8 cores of your laptop.  In practical terms, it's lovely to improve your workflow from "run algorithm preprocessing overnight" to "run algorithm preprocessing during lunch."


Here's how it's done.

from multiprocessing import Pool

def my_function( a_single_argument ):
    # e.g. "accepts a filename and returns results in a tuple"
    ...

my_data_list = [...]
my_pool = Pool()
my_results = my_pool.map( my_func, my_list )

That's all it takes.  A few tips:

First, the mapped function can only accept a single argument.  This is usually pretty easy to solve, by wrapping the arguments as a tuple:

def my_function( threeple_arg ):
    arg_1, arg_2, arg_3 = threeple_arg
    ...

Second, debugging in multiprocessing is a pain.  I often invoke the function this way first for debugging, then switch to multiprocessing once I know everything works:

#my_results = [my_func(item) for item in my_data_list]

It's a little hacky, but I'll often leave the line as a comment throughout development, switching between serial and parallel processing as need demands.

Last, a hint on Pool: you can pass an integer to the initialization routine to tell it how many subprocesses to use:

my_pool = Pool(5)

If you omit the argument, python assumes you want to run as many subprocesses (not threads!) as you have cores (e.g. 8 on a MacBooks pro).  If you want to save some processing power for other tasks, you might want to specify something lower, like 6.

If you're running tasks with high latency (e.g. web spidering, or lots of disk read/writes across a network) it sometimes makes sense to use more subprocesses than you have cores.  For example, I'll often throw 40 pool workers at a quick web-scraping script, just to speed things up.  However, if performance really matters, Pools with latency are very hard to tune and scale.  For anything more than a one-off data grab, you'll be better off with a queue-based tool, like scrapy, Amazon SQS, or celery.

HTH