Like many things, python makes basic multiprocessing very easy: ~8 more lines of code can let you use all 8 cores of your laptop. In practical terms, it's lovely to improve your workflow from "run algorithm preprocessing overnight" to "run algorithm preprocessing during lunch."
Here's how it's done.
from multiprocessing import Pool
def my_function( a_single_argument ):
# e.g. "accepts a filename and returns results in a tuple"
...
my_data_list = [...]
my_pool = Pool()
my_results = my_pool.map( my_func, my_list )
That's all it takes. A few tips:
First, the mapped function can only accept a single argument. This is usually pretty easy to solve, by wrapping the arguments as a tuple:
def my_function( threeple_arg ):
arg_1, arg_2, arg_3 = threeple_arg
...
Second, debugging in multiprocessing is a pain. I often invoke the function this way first for debugging, then switch to multiprocessing once I know everything works:
#my_results = [my_func(item) for item in my_data_list]
It's a little hacky, but I'll often leave the line as a comment throughout development, switching between serial and parallel processing as need demands.
Last, a hint on Pool: you can pass an integer to the initialization routine to tell it how many subprocesses to use:
my_pool = Pool(5)
If you're running tasks with high latency (e.g. web spidering, or lots of disk read/writes across a network) it sometimes makes sense to use more subprocesses than you have cores. For example, I'll often throw 40 pool workers at a quick web-scraping script, just to speed things up. However, if performance really matters, Pools with latency are very hard to tune and scale. For anything more than a one-off data grab, you'll be better off with a queue-based tool, like scrapy, Amazon SQS, or celery.
HTH
No comments:
Post a Comment