tag:blogger.com,1999:blog-3012188750785039952024-03-24T19:52:57.338-07:00compSocSciThoughts on computation, social science, and lifehacking <br>from an up-and-coming data scientist.Unknownnoreply@blogger.comBlogger78125tag:blogger.com,1999:blog-301218875078503995.post-47576482314510472422013-09-23T20:56:00.003-07:002013-09-23T20:56:30.003-07:00Change of address: blog.abegong.comI've moved! Since finishing grad school, I've decided to fold my personal web page and blog together. In the future, I'll be posting to <a href="http://blog.abegong.com/">http://blog.abegong.com</a>.<br />
<br />
Also, my posts will focus on "data science" instead of "computational social science." They're pretty much the same thing, but "data science" seems to be the phrase that's catching on.<br />
<br />
See you there!<br />
<div>
<br /></div>
Unknownnoreply@blogger.com6tag:blogger.com,1999:blog-301218875078503995.post-64773884910618213482013-07-10T09:41:00.003-07:002013-07-10T09:41:48.891-07:00Speed up hadoop development with progressive testingDebugging Hadoop jobs can be a huge pain. The cycle time is slow, and error messages are often uninformative --- especially if you're using Hadoop streaming, or working on EMR.
<br />
<br />
I once found myself trying to debug a job that took a full six hours to fail. It took more than a week -- a whole week! -- to find and fix the problem. Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity. It was a Very Bad Week.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://upload.wikimedia.org/wikipedia/commons/f/fe/Crushed_by_elephant.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://upload.wikimedia.org/wikipedia/commons/f/fe/Crushed_by_elephant.png" width="184" /></a></div>
<br />
<br />
Painful experiences like this have taught me to follow a test-driven approach to hadoop development. Whenever I'm working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.<br />
<br />
<ol>
<li>Explore the data: The pipe must accept data from a given format, which might not be fully understood at the outset.</li>
<li>Test basic logic: The pipe must execute the intended data transformation for "normal" data. </li>
<li>Test edge cases: The pipe must deal gracefully with edge cases, missing or misformatted fields, rare divide-by-zeroes, etc. </li>
<li>Test deployment parameters: The pipe must be deployable on hadoop, with all the right filenames, code dependencies, and permissions.</li>
<li>Test cluster performance: For big enough jobs, the pipe must run efficiently. If not, we need to tune or scale up the cluster.</li>
<li>Test scheduling parameters: Once pipes are built, routine jobs must be scheduled and executed.</li>
</ol>
<div>
<br /></div>
Each of these steps requires different test data and different methods for trapping and diagnosing errors. Therefore, the goal is to make sure to (1) tackle problems one at a time, and (2) solve each kind of problem in the environment with the fastest cycle time.<br />
<br />
Steps 1 through 3 should be solved locally, using progressively larger data sets. Steps 4 and 5 must be run remotely, again using progressively larger data sets.<br />
<br />
Step 6 depends on your scheduling system and has a very slow cycle time (i.e. you must wait a day to test whether your daily jobs run on the proper schedule.). However, it's independent of hadoop, so you can build, test, and deploy it separately. (There may be some crossover with #4, but you can test this with small data sets.)<br />
<br />
Going through six different rounds of testing may seem like overkill, but in my experience it's absolutely worth it. Very likely, you'll encounter at least one new bug/mistake/unanticipated case at each stage. Progressive testing ensures that each bug is dealt with as quickly as possible, and prevents them from ganging up on you.<br />
<br />
Other suggestions:<br />
<ul>
<li>Definitely use an abstraction layer that allows you to seamlessly deploy local code to your staging and production clusters. <a href="https://github.com/nathanmarz/cascalog">Cascalog</a> and <a href="https://github.com/Yelp/mrjob">mrJob</a> are good examples. Otherwise, you'll find yourself solving steps 2 and 3 all over again in deployment.</li>
</ul>
<ul>
<li>Config files and object-oriented code can reduce a lot of headaches in step 4. Most of your deployment hooks can be written once and saved in a config file. If you have strong naming conventions, then most of your filenames can be constructed (and tested) programmatically. It's amazing how many hours you can waste debugging a simple typo in hadoop. Good OOP will spare you many of these headaches.</li>
</ul>
<ul>
<li>Part of the beauty of <a href="http://hive.apache.org/">Hive</a> and <a href="http://hbase.apache.org/">HBase</a> is that they abstract away most of the potential pitfalls on the deployment side, especially in step 4. By the same token, tools like <a href="http://data.linkedin.com/opensource/azkaban">Azkaban</a> and <a href="http://oozie.apache.org/">Oozie</a> can take a lot of the pain out of step 6. (Be careful, though -- each of these scheduling tools has its limitations.)</li>
</ul>
<br />Unknownnoreply@blogger.com28tag:blogger.com,1999:blog-301218875078503995.post-9051568830715149592013-06-17T21:33:00.000-07:002013-06-17T21:33:00.690-07:00Scientists leaving the academy: Pushed, or pulled? Several of my friends have shared and commented on <a href="http://chronicle.com/article/On-Leaving-Academe/133717/?goback=.gde_1844342_member_240166328">this article in the Chronicle of Higher Education</a>: "On Leaving Academe." The author is Terran Lane, a former computer science professor at the University of New Mexico.<br />
<br />
The article starts with the (shocking!) revelation that he is leaving his position as a professor to work for Google. Lane then lists nine reasons for leaving:<br />
<br />
<br />
<ol>
<li>Making a difference</li>
<li>Work-life imbalance</li>
<li>Centralization of authority and decrease of autonomy</li>
<li>Budget climate</li>
<li>Hyperspecialization, insularity, and narrowness of vision</li>
<li>Poor incentives</li>
<li>Mass production of education</li>
<li>Salaries</li>
<li>Anti-intellectualism, anti-education, and attacks on science and academe</li>
</ol>
<br />
<br />
<br />
The tone is of the article is very negative. Lane frames most of his complaints as forces that are <i>pushing</i> him out of the University. Honestly, it feels a little bit bitter.<br />
<br />
As I've discussed this with friends, I've decided that I disagree with the tone, if not the reasons. I've also made a similar decision to -- temporarily, at least -- leave the academy for the private sector. But I see the whole experience in a much more positive light.<br />
<br />
As I see it, there are growing incentives to find applications for science outside the academy. Since I've got into the startup world, I've met lots of psychologists, economists, and even the occasional political scientist who are building consumer-facing tools based on well-founded theories of social science.<br />
<br />
To me, this feels like an emerging renaissance in applied social science. In other words, it's not just the case that smart, ambitious people are being <i>pushed</i> out of academia; they're being <i>pulled</i> out as well.<br />
<br />
In the past, most careers paths allowed you to seek the truth OR change the world, but not both. I'm optimistic that the rising volume and value of data is going to give more scientifically-minded people the chance to have their cake and analyze it too. Eliminating artificial distinctions between "thinkers" and "doers" is good for society overall.<br />
<br />
<br />
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-15186996856886267592013-06-10T10:02:00.000-07:002013-06-10T10:02:00.515-07:00Crazy walls and dissertation graph insanityHere's my dissertation to-do list from last week. It's basically my own <a href="http://crazywalls.tumblr.com/">crazy wall</a><span id="goog_765120430"></span><span id="goog_765120431"></span><a href="http://www.blogger.com/"></a> (<a href="http://now-here-this.timeout.com/2012/10/07/crazy-walls-of-clues-from-tv-film-reviewed-by-carrie-from-homeland/">more here</a>), on a clipboard.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5RtvLnnlyQYNDRo5NLJEYEg1CuG88ESJeFUpWBExja1MLAKVEEOW0DZp3famYOZ7K_r799EIFt_hLqp5ekkBxp7i35RMoQmX8CNjZUzOPLHK0ojPGOXuGbJ3CYB0i61hYZSCJV0mZpz0Z/s1600/IMAG0459.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj5RtvLnnlyQYNDRo5NLJEYEg1CuG88ESJeFUpWBExja1MLAKVEEOW0DZp3famYOZ7K_r799EIFt_hLqp5ekkBxp7i35RMoQmX8CNjZUzOPLHK0ojPGOXuGbJ3CYB0i61hYZSCJV0mZpz0Z/s400/IMAG0459.jpg" width="225" /></a></div>
<br />
<br />
More evidence of dissertation-induced insanity:<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYPgOpSCEpsoGt9yu_pjKK005UWT_7J2GMaL-M5V5GS3GhgfNq1xGyWxGBYaTQU9qlHoUS58mGlQhZo_ZpjDw56x_ifsTSsobu2d6wYk0SGp7uWYl8_7uo2kvO4eJrJlvIhgSGHA4fkQIU/s1600/kripp-curves.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="301" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjYPgOpSCEpsoGt9yu_pjKK005UWT_7J2GMaL-M5V5GS3GhgfNq1xGyWxGBYaTQU9qlHoUS58mGlQhZo_ZpjDw56x_ifsTSsobu2d6wYk0SGp7uWYl8_7uo2kvO4eJrJlvIhgSGHA4fkQIU/s400/kripp-curves.png" width="400" /></a></div>
I made this graph last week, late at night, to answer a real research question. ("What are the effective krippendorff's alpha scores of averaged ensembles of five mechanical turkers, given individual alphas of .2, .3, .4, and .5?")<br />
<br />
When I woke up in the morning and looked at it again (and tried to explain it to Erin), I realized that the graph made no sense, and the process I had used to create it was completely bonkers.<br />
<br />
I'm going to get through this thing. But after the defense, I think I'm going to need some serious mental detox.<br />
<br />
<br />
Moral of the story: friends don't let friends do PhDs.Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-76529668638771201112013-06-03T08:55:00.000-07:002013-06-03T08:55:00.263-07:00A map of me: Life-tracking with funf<br />
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
I've been running <a href="http://funf.org/">funf</a> ever since I arrived in California -- almost six months now. If you haven't seen it before, funf is a great little android app for passive life logging. It can track anything your phone can track: GPS, accelerometer, battery, text messages, etc.</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
The downside is that it's buggy. I ended up having to use a very hacky workaround solution to get my data off: exporting data via email, downloading the zips, then running the <a href="https://code.google.com/p/funf-open-sensing-framework/downloads/list">funf_analyze_mac script</a> to parse them all to sql. It's a clunky pipeline, but works for the moment.</div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
Here's the first payoff: a map of everywhere I went between November and February. It's pretty neat to be able to see the neighborhood where I work in San Francisco, the train line along the edge of the bay, the two cities where I've lived since moving here, and a few trips south and west to Los Gatos and San Jose.<br />
<br />
I haven't done much work to make this beautiful, but I find it very engaging anyway. (It's my data, after all.)<br />
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaGb42gnPCt2MEvvQqEOCnQSTZ4l0HBvU2qX4iJxwMnB30SdyoN4aJ1kfgsQWPcnquMvetr8S-9oCRKL9isgBuqj8parRsn23edL_SNFDana_OMGaINT60L6B7JzJHfa6T7vyaMIQIkHrM/s1600/alphatest.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="2" height="400" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhaGb42gnPCt2MEvvQqEOCnQSTZ4l0HBvU2qX4iJxwMnB30SdyoN4aJ1kfgsQWPcnquMvetr8S-9oCRKL9isgBuqj8parRsn23edL_SNFDana_OMGaINT60L6B7JzJHfa6T7vyaMIQIkHrM/s400/alphatest.jpg" width="400" /></a></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<br /></div>
<div style="background-color: white; color: #222222; font-family: arial, sans-serif; font-size: 13px;">
<div>
My next plan is to write a script to isolate places where I go often or spend a lot of time, and then mash those locations up with data from other sources based on timestamps.</div>
<div>
<br /></div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-53062683701109251062013-05-27T08:29:00.000-07:002013-05-27T08:29:00.385-07:00Crocodile hunters versus zoo keepers: Aspiring data scientists should speak python<br />
<div style="background-color: white; color: #222222;">
We're <a href="https://jawbone.com/careers">hiring data scientists at Jawbone</a>, which means I've been spending a lot of time reviewing resumes and interviewing candidates. Making the "bring him/her in" or "pass" decision dozens of times an hour has been a great opportunity to focus on the features that make great data science candidates stand out from the crowd.</div>
<div style="background-color: white; color: #222222;">
<br /></div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFKfuHQ-Q_0NMrzF1zCtSbDwakiBNHahCxuN4G9xhIlXFSEbu3swcMBZWNRJA6UhV2VtBo9yHlNcpOBXu0KQ8wDnsx0W7xCPCE6jCGS_oRy-gLV6J7WUvsqbdL3Bnd_WgEN_CkRrDIlhLl/s1600/original.jpg" imageanchor="1" style="clear: left; float: left; margin-bottom: 1em; margin-right: 1em;"> <img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhFKfuHQ-Q_0NMrzF1zCtSbDwakiBNHahCxuN4G9xhIlXFSEbu3swcMBZWNRJA6UhV2VtBo9yHlNcpOBXu0KQ8wDnsx0W7xCPCE6jCGS_oRy-gLV6J7WUvsqbdL3Bnd_WgEN_CkRrDIlhLl/s200/original.jpg" width="135" /></a></div>
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKBZ4i-XsLO-APplDkGUFcvPE515r7HCwtnsiDNzTuQuraZcSRBZTQNKpTf9oOZ9JlBTmij_SeYcAXZz7xwxLMqTfdcZJswt1Nu6HrSqQgmNJOciTPUCBY-oNC0HPrOWjdgNxzdDt8JzkX/s1600/russdaff-12116-zoo-entrancex-267372.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="200" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiKBZ4i-XsLO-APplDkGUFcvPE515r7HCwtnsiDNzTuQuraZcSRBZTQNKpTf9oOZ9JlBTmij_SeYcAXZz7xwxLMqTfdcZJswt1Nu6HrSqQgmNJOciTPUCBY-oNC0HPrOWjdgNxzdDt8JzkX/s200/russdaff-12116-zoo-entrancex-267372.jpg" width="180" /></a><br />
<br />
<div style="background-color: white; color: #222222;">
<br /></div>
<div style="background-color: white; color: #222222;">
What I've learned: <b>Python is a great differentiator for job-hunting data scientists</b>. In the mix of resumes I review, it's the single biggest thing I look for. Putting python high on your resume says, "Not only do I grok statistics/machine learning/R/SQL/Matlab/Octave/d3/etc, I can build the pipes and plumbing that make data systems work."</div>
<div style="background-color: white; color: #222222;">
<br /></div>
<div style="background-color: white; color: #222222;">
When I see resumes with strong background in practical programming, I think "Crocodile Hunter." When those things are missing, I think "zoo keeper." Zoo keepers aren't bad, but I find they need hand-holding every time you go into the jungle to wrangle wild data.</div>
<div style="background-color: white; color: #222222;">
<br /></div>
<div style="background-color: white; color: #222222;">
For whatever reason*, I find python sends a stronger signal of these skills than Java, C++, or any other mainstream programming language. Not having python on your resume is a big handicap in the review process -- not an absolute deal-breaker, but in practice we hardly ever end up interviewing candidates without at least basic python skills.</div>
<div style="background-color: white; color: #222222;">
<br /></div>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;">With that said, "python" is an awfully big field to survey. What python libraries are the most useful, really? I've been asked this question several times recently. Here's what I said in a recent email:</span></div>
<blockquote class="tr_bq">
For python, I'd highly recommend installing <a href="http://ipython.org/" style="color: #1155cc;" target="_blank">ipython</a>*, and getting your feet wet with the <a href="http://pandas.pydata.org/" style="color: #1155cc;" target="_blank">pandas</a> library. The <a href="http://matplotlib.org/" style="color: #1155cc;" target="_blank">matplotlib</a>, <a href="http://docs.python.org/2/library/json.html" style="color: #1155cc;" target="_blank">json</a>, and <a href="http://docs.python-requests.org/en/latest/" style="color: #1155cc;" target="_blank">requests</a> libraries are also good places to know your way around. <a href="http://www.numpy.org/" style="color: #1155cc;" target="_blank">numpy</a> and <a href="http://www.scipy.org/" style="color: #1155cc;" target="_blank">scipy</a> have good stuff in them, but they're so huge that it's hard to really "know" them. <a href="https://github.com/boto/boto" style="color: #1155cc;" target="_blank">boto</a> is also worth knowing, but it's only useful if you have a subscription to Amazon Web Services (AWS). </blockquote>
<blockquote class="tr_bq">
For hadoop, I'd look specifically at <a href="https://github.com/Yelp/mrjob" style="color: #1155cc;" target="_blank">mrJob</a>. It's a python-based wrapper that makes hadoop easier to use. mrJob is too new to be covered in books yet, but the online documentation is pretty good. You can use it in test mode even without installing hadoop.</blockquote>
<span style="font-family: inherit;"><span style="background-color: white; color: #222222;">Disclaimer: I know know that talking up one language over another is one of the cardinal sins of code. I don't want to start a religious war here -- I'm not saying that python is </span><i style="background-color: white; color: #222222;">better</i><span style="background-color: white; color: #222222;"> than C++, or that pandas is the only way to manipulate data in python (although I believe Wes McKinney deserves a platinum-plated "better than sliced bread" award) -- I'm just saying that in the rough-and-tumble of the hiring process, data scientists with real python skills have a big advantage. </span></span><br />
<span style="background-color: white; color: #222222; font-family: inherit;"><br /></span>
<span style="background-color: white; color: #222222; font-family: inherit;">Aspiring data scientists, you have been warned.</span><br />
<span style="font-family: inherit;"><span style="background-color: white; color: #222222;"><br /></span></span>
<span style="font-family: inherit;"><span style="background-color: white; color: #222222;">Practicing data scientists, what would you add to this list? I've plugged in my favorite modules, but I'm sure there are others also worth a mention.</span></span><br />
<span style="font-family: inherit;"><span style="background-color: white; color: #222222;"><br /></span></span>
<br />
<div style="background-color: white; color: #222222;">
<div>
<span style="font-family: inherit;">*I freely admit that the reason may just be in my head. However, this preference for python also seems to be in a lot of other people's heads as well. Whether there's a real reason for it or it's just an cognitive bias, it's working as a gating factor for a lot of resumes.</span></div>
</div>
Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-89976763169520452542013-05-13T08:47:00.000-07:002013-05-13T08:47:00.569-07:00Simple ways to make prettier graphs<span style="background-color: white; color: #222222;"><span style="font-family: inherit;">A question on graphs from my cousin. She's good with statistics, but not a programmer. I've fielded similar questions many times, so figured it'd be worth putting the answer in the public domain.</span></span><br />
<span style="background-color: white; color: #222222;"><span style="font-family: inherit;"><br /></span></span>
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://i.stack.imgur.com/prhAY.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="202" src="http://i.stack.imgur.com/prhAY.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">Any graph that includes the caption "lunch" is a good graph. Also "nap."</td></tr>
</tbody></table>
<blockquote class="tr_bq">
<span style="background-color: white; color: #222222;"><span style="font-family: inherit;">Quick question-- I'm writing up a paper right now and need to stick some simple graphs in. Do you have any suggestions ways to make graphs that are prettier than Excel to Word (low bar...ha ha! Accidental pun!)?</span></span></blockquote>
<span style="background-color: white; color: #222222;"><span style="font-family: inherit;">My response:</span></span><br />
<br />
<blockquote class="tr_bq">
<span style="font-family: inherit;">Love the pun. :) [Miscellaneous personal stuff...]</span></blockquote>
<blockquote class="tr_bq">
<span style="font-family: inherit;">On graphs: How many graphs are we talking? If it's just a handful, or if they're all different kinds, I'd recommend Photoshop or Illustrator. Import the graph from excel, and then "trace over" it to give it the styling you'd like. A lot of great data-centric presentations use this trick.</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: inherit;">Another option is <a href="http://www.tableausoftware.com/" style="color: #1155cc;" target="_blank">tableau</a>. It's a pricey, but gives you good tools for designing nice-looking graphs, as well as tools for automating them (i.e. generating 20 graphs with the same basic template.) You might be able to use a 30-day trial; and maybe their student licenses are cheaper than the corporate ones.</span> </blockquote>
<blockquote class="tr_bq">
<span style="font-family: inherit;">If you don't want to shell out for tableau and you're doing *lots* of graphs in the same style, then it might be worth climbing the learning curve for <a href="http://matplotlib.org/" style="color: #1155cc;" target="_blank">matplotlib</a>, <a href="http://ggplot2.org/" style="color: #1155cc;" target="_blank">ggplot2</a>, or the <a href="https://developers.google.com/chart/" style="color: #1155cc;" target="_blank">google charts API</a>. I doubt this is worth your time because, there'd be such a long learning curve: each of these is a graphing library on top of a programming language, and you'd need facility with both to make them work.</span></blockquote>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;">Getting graphs right is fiddly work. All those axis and labels and spaces to play with, and that's before you even begin to think about borders and color.</span></div>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;"><br /></span></div>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;">I've toyed with the idea of a <a href="http://en.wikipedia.org/wiki/Declarative_programming">declarative language</a> for graphs: a syntax for describing the story you want told, without including all the execution details. For example, "A is more Y than B" should give you a nice bar chart with a tall column labeled "A," and a shorter column labeled "B." The y-axis should be labeled "Y."</span></div>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;"><br /></span></div>
<div style="background-color: white; color: #222222;">
<span style="font-family: inherit;">This strikes me as a difficult, but maybe not an impossible challenge...</span></div>
Unknownnoreply@blogger.com3tag:blogger.com,1999:blog-301218875078503995.post-9220080193939985772013-05-06T08:50:00.000-07:002013-05-06T08:50:00.273-07:00Sadly, I'm not developing tengolo. Would you like to run with it?<div class="tr_bq">
</div>
<div style="color: #222222; font-family: arial;">
A little over a year ago, I announced that I was working on <a href="http://compsocsci.blogspot.com/2012/02/announcing-tengolo-python-alternative.html">tengolo, a python alternative to netlogo</a>. I'm not actively developing it -- didn't get much farther than a rough proof-of-concept, really -- but I still get questions about the package.</div>
<div style="color: #222222; font-family: arial;">
<br /></div>
<div style="color: #222222; font-family: arial;">
As a result, I find myself writing some variation on the email below at least a couple times a month:</div>
<div style="color: #222222; font-family: arial; font-size: small;">
<br /></div>
<blockquote>
Dear python/ABM enthusiast - </blockquote>
<blockquote>
Glad you're interested in python and ABMs. I started on tengolo after a thorough search turned up no good ABM frameworks in python. I worked on it for a short while, then moved on when my dissertation committee told me to focus on stuff that would actually help me graduate. :) </blockquote>
<blockquote>
I got far enough in to be confident that a python-based ABM framework like tengolo could work. All the code is in the <a href="https://github.com/agong/tengolo" style="color: #1155cc;">github repository</a>, and every month I get questions from people asking if it's being actively developed. There's clearly demand for the project, but I don't have time to support it at this point. I'd love to see someone take this ball and run with it. </blockquote>
<blockquote>
Best,<br />
Abe</blockquote>
<br />
<span style="color: #222222; font-family: arial;">Would you like to run with this project? If you're good with python and want to run a potentially popular academic open-source project, tengolo would be a great fit. Please get in touch, and I will happily direct potential users and collaborators in your direction.</span><br />
<span style="color: #222222; font-family: arial; font-size: x-small;"><br /></span>
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFZLvsRBvdP4YlyeowKejcLuSspPGoeG54kguXDfvL93clW1v71GnCGIjRK15TE5efi2n4ziCrY6ocT4hu8CmqBTDA2Fb9n0k57-HDbDo3zUPczwx6cngAYDNAHG97eBso38alCpyJ8fLf/s320/no+turtle+touching.JPG" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiFZLvsRBvdP4YlyeowKejcLuSspPGoeG54kguXDfvL93clW1v71GnCGIjRK15TE5efi2n4ziCrY6ocT4hu8CmqBTDA2Fb9n0k57-HDbDo3zUPczwx6cngAYDNAHG97eBso38alCpyJ8fLf/s320/no+turtle+touching.JPG" /></a></div>
<span style="color: #222222; font-family: arial; font-size: x-small;"><br /></span>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-45958140993577311632013-04-13T09:38:00.003-07:002013-04-13T09:38:40.382-07:00Python multiprocessing: 8x speed in 8 lines of codeAt work last week I demo'ed some python parallel processing tricks. Nothing fancy -- just standard usage of the <a href="http://docs.python.org/2/library/multiprocessing.html">multiprocessing library</a> -- but these things can be a revelation if you haven't seen them before.<br />
<br />
Like many things, python makes basic multiprocessing very easy: ~8 more lines of code can let you use all 8 cores of your laptop. In practical terms, it's lovely to improve your workflow from "run algorithm preprocessing overnight" to "run algorithm preprocessing during lunch."<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_c6j2uz_oZrZgxA5SWQHelswl1B6ZW_6jDVFOWH84dzYAzOnVMOyk2Etmrc3DVaVh3tDP_Mc4bQmu-wIXWAlJQZyMh8qd5y-4AEiksk78pascPJtTcFJauuwRo8fgLkCeu0wz2TOoi4sa/s1600/pypvm-small.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi_c6j2uz_oZrZgxA5SWQHelswl1B6ZW_6jDVFOWH84dzYAzOnVMOyk2Etmrc3DVaVh3tDP_Mc4bQmu-wIXWAlJQZyMh8qd5y-4AEiksk78pascPJtTcFJauuwRo8fgLkCeu0wz2TOoi4sa/s1600/pypvm-small.png" /></a></div>
<br />
Here's how it's done.<br />
<div class="p2">
<br /></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">from multiprocessing import Pool</span></div>
<div class="p4">
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">def my_function( a_single_argument ):</span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;"> # e.g. "accepts a filename and returns results in a tuple"</span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;"> ...</span></div>
<div class="p4">
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">my_data_list = [...]</span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">my_pool = Pool()</span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">my_results = my_pool.map( my_func, my_list )</span></div>
<div class="p2">
<br /></div>
<div class="p1">
That's all it takes. A few tips:</div>
<div class="p2">
<br /></div>
<div class="p1">
First, the mapped function can only accept a single argument. This is usually pretty easy to solve, by wrapping the arguments as a tuple:</div>
<div class="p4">
<br /></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">def my_function( threeple_arg ):</span></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;"> arg_1, arg_2, arg_3 = </span><span style="font-family: 'Courier New', Courier, monospace;">threeple_arg</span></div>
<div class="p3">
<span style="font-family: 'Courier New', Courier, monospace;"> ...</span></div>
<div class="p4">
<br /></div>
<div class="p1">
Second, debugging in multiprocessing is a pain. I often invoke the function this way first for debugging, then switch to multiprocessing once I know everything works:</div>
<div class="p2">
<br /></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">#my_results = [my_func(item) for item in my_data_list]</span></div>
<div class="p2">
<br /></div>
<div class="p1">
It's a little hacky, but I'll often leave the line as a comment throughout development, switching between serial and parallel processing as need demands.</div>
<div class="p1">
<br /></div>
<div class="p1">
Last, a hint on Pool: you can pass an integer to the initialization routine to tell it how many subprocesses to use:</div>
<div class="p2">
<br /></div>
<div class="p3">
<span style="font-family: Courier New, Courier, monospace;">my_pool = Pool(5)</span></div>
<div class="p2">
<br /></div>
If you omit the argument, python assumes you want to run as many subprocesses (not threads!) as you have cores (e.g. 8 on a MacBooks pro). If you want to save some processing power for other tasks, you might want to specify something lower, like 6.<br />
<br />
If you're running tasks with high latency (e.g. web spidering, or lots of disk read/writes across a network) it sometimes makes sense to use more subprocesses than you have cores. For example, I'll often throw 40 pool workers at a quick web-scraping script, just to speed things up. However, if performance really matters, Pools with latency are very hard to tune and scale. For anything more than a one-off data grab, you'll be better off with a queue-based tool, like <a href="http://scrapy.org/">scrapy</a>, <a href="http://aws.amazon.com/sqs/">Amazon SQS</a>, or <a href="http://www.celeryproject.org/">celery</a>.<br />
<br />
HTHUnknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-42761766722206727382013-03-15T09:35:00.000-07:002013-04-04T20:50:06.182-07:00Hacking Scrabble with Cascalog<br />
We've barely met, but I am in love with Cascalog. So elegant. So powerful. So easy to ship from testing to staging to production. So perfect for the workflow abstraction where most data science happens. Let me count the ways...<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://technomancy.us/i/leiningen.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://technomancy.us/i/leiningen.jpg" /></a></div>
<br />
<br />
The only downside is the shortage of documentation and examples. There's an active google group, yes, but only a handful of <a href="http://stackoverflow.com/search?q=cascalog">cascalog questions on StackOverflow</a>. So I thought I'd pay things forward by tossing out a bunch of toy examples that I used in my early experiments with the language. These examples cover clojure's basic syntax and regular expressions, plus cascalog's filtering and basic aggregations. I'll save tests, joins, etc. for a later post.<br />
<br />
All these examples use the Scrabble Word dictionary, which provides a nice bite-sized playground for lots of MapReduced fun. The <a href="http://www.isc.ro/en/commands/lists.html">Tournament Word List</a> is a list of all the legal words for Scrabble in U.S. tournament play. Words are in all caps and separated by line breaks. I downloaded and saved the file as /Users/agong/Data/scrabble-list-twl06.txt. To speed up testing, I also sampled rows at random to create a 10% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt) and a 1% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://img0.etsystatic.com/000/0/5479662/il_fullxfull.232986892.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em; text-align: center;"><img border="0" height="212" src="http://img0.etsystatic.com/000/0/5479662/il_fullxfull.232986892.jpg" width="320" /></a></div>
<br />
<br />
<br />
First, let's try a vanilla query: just print all the lines in the file. This will help make sure clojure and hadoop are set up properly.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; vanilla query: print all lines</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word))</span><br />
<br />
If that works, we can do some basic filtering.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; basic filter: print lines starting with 'z'</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(clojure.core/re-matches #"Z.*" ?word) ))</span><br />
<br />
So far so good. We have Hello world, and basic filters check out. Let's use the count aggregation to count the number of words in the sample:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; count words</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count ?count) ))</span><br />
<br />
Now let's build up to letter counts (a twist on the classic MapReduce word count example.) To get there, we need to define a map cat operator:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; split words into letters</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; https://github.com/sritchie/cascalog-class/blob/master/src/cascalog_class/core.clj</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(defmapcatop split</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Accepts a word and emits a single 1-tuple for each letter."</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (clojure.core/re-seq #"." word))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; count letters</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(require `cascalog.ops)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?letter ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(split ?word :> ?letter) (cascalog.ops/count ?count))</span><br />
<br />
With these very simple tools, we can do a surprising number of interesting things. Let's create a function to do n-gram counting, modified slightly from <a href="http://lilyx.net/2011/06/11/calculating-n-gram-statistics-in-a-mapreduce-way-using-clojure/">this example</a>.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; count ngrams</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(defmapcatop ngrams</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Accepts a word and n-parameter and emits a single 1-tuple for each n-gram."</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> [word n]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (map my-join (partition n 1 word)))</span><br />
<br />
Now we can count bigrams and trigrams:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; Character bigrams</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?letter ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(ngrams ?word 2 :> ?letter) (cascalog.ops/count ?count))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; Character trigrams - no sampling</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?letter ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(ngrams ?word 3 :> ?letter) (cascalog.ops/count ?count))</span><br />
<br />
Let's get word lengths...<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; get word lengths</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(defmapcatop get-len</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Accepts a word and emits a single 1-tuple with its length."</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> [(.length word)])</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; distribution of word lengths</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?length ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(get-len ?word :> ?length)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count ?count))</span><br />
<br />
I guess it makes sense that the longest words in scrabble are 15 letters long...<br />
<br />
Now let's combine filters and aggregators. We need to create a filter operation for this...<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(deffilterop len-n? [word n]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Keep only words not of length n"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (= (.length word) n))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; distribution of lengths for 7-letter words (a silly example to make sure it worked)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?length ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(len-n? ?word 7)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(get-len ?word :> ?length)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count ?count))</span><br />
<br />
Vowel dumps are an important part of scrabble tactics: words that let you get rid of extra vowels without wasting a turn to exchange your hand. First, we can do pure vowel dumps -- words that include no consonants at all. (We'll grant Y vowel status, even though it only qualifies sometimes.)<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>; vowel dumps</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(deffilterop pure-vowel-dump? [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Keep only words containing only vowels (and sometimes y)"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (every?</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> <span class="Apple-tab-span" style="white-space: pre;"> </span>(into #{} (clojure.core/re-seq #"." "AEIOUY"))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> <span class="Apple-tab-span" style="white-space: pre;"> </span>(clojure.core/re-seq #"." word)))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(pure-vowel-dump? ?word))</span><br />
<br />
Hm. There aren't very many of these pure vowel dumps. How about a more flexible function that calculates the proportion of vowels in the word?<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(defmapop vowel-ratio [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(/</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> <span class="Apple-tab-span" style="white-space: pre;"> </span>(.length (clojure.string/replace word #"[AEIOUY]" ""))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> <span class="Apple-tab-span" style="white-space: pre;"> </span>(.length word)))</span><br />
<br />
Now we can look up words with 70% or more vowels. For good measure, let's show the ratio of vowels in each.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Ratios for vowel dumps</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word ?ratio]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(vowel-ratio ?word :> ?ratio )</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(> ?ratio 7/10))</span><br />
<br />
Among these vowel dumps, what's the distribution of lengths? The distribution of letters?<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Length distribution for vowel dumps</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?length ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(vowel-ratio ?word :> ?ratio )</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(> ?ratio 7/10)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(get-len ?word :> ?length)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count :> ?count))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Letter distribution for vowel dumps</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?letter ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(vowel-ratio ?word :> ?ratio )</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(> ?ratio 7/10)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(split ?word :> ?letter)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count :> ?count))</span><br />
<br />
Okay, a few more fun examples. Palindromes...<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Palindromes</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(deffilterop palindrome? [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Return true for palindromes"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (= word (clojure.string/reverse word)) )</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(palindrome? ?word))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Distribution of letters in palindromes</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?letter ?count]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(palindrome? ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(split ?word :> ?letter)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(cascalog.ops/count ?count))</span><br />
<br />
Not a whole lot of these either. What about vowel-consonant palindromes? That is, words where the back-to-front and front-to-back ordering of vowel consonants is the same?<br />
<br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>;Vowel-consonants palindromes</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(deffilterop vc-palindrome? [word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> "Return true for vowel-consonant palindromes"</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> (let [vc-word (clojure.string/replace (clojure.string/replace word #"[AEIOUY]" "A") #"[^A]" "B")]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span> <span class="Apple-tab-span" style="white-space: pre;"> </span>(= vc-word (clojure.string/reverse vc-word)) ))</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(?<- (stdout) [?word]</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>((lfs-textline "/Users/agong/Data/scrabble-list-twl06-10pct.txt") :> ?word)</span><br />
<span style="font-family: Courier New, Courier, monospace;"><span class="Apple-tab-span" style="white-space: pre;"> </span>(vc-palindrome? ?word))</span><br />
<br />
There we go! From ANALYZE to <span style="font-family: Consolas, 'Courier New', Courier, monospace;">ZYMOSIS</span>. (Your words will probably be different, because of the sampling. But they will still be v/c palindromes. :) )<br />
<br />
I had fun spending an afternoon putting together these examples. It was a great, well-bounded way to get my feet wet with clojure and cascalog.<br />
<br />
HTHUnknownnoreply@blogger.com2tag:blogger.com,1999:blog-301218875078503995.post-4993074410200211632013-03-11T09:23:00.000-07:002013-03-11T09:23:00.688-07:00Definitions for data science<br />
Since I'm rebooting this blog, this seems like a good moment to lay out a framework for data science. I'll tackle definitions now, and process next time.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWI5V2Pr9HA9hiO_jAYWGaPGfObwM1Gix_o68MfpTRfw4SdqkETTIxUQuRvy4BBXHWdY2WBuBRn-qDB2JN6ImYdaP8JfoOLlyrTlNrDT3lUC9j80WMeOPVTMqunIOcWfvJswoH70QbPEk/s1600/old-thick-book.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="208" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjWI5V2Pr9HA9hiO_jAYWGaPGfObwM1Gix_o68MfpTRfw4SdqkETTIxUQuRvy4BBXHWdY2WBuBRn-qDB2JN6ImYdaP8JfoOLlyrTlNrDT3lUC9j80WMeOPVTMqunIOcWfvJswoH70QbPEk/s320/old-thick-book.jpg" width="320" /></a></div>
<br />
<br />
Pinning down scope and definitions is important for data science, because the field is growing rapidly, with a sense that the sky is the limit. Without priorities and a grasp of what data science <i>isn't</i>, we run the risk of overreaching, wasting our time, and leaving everyone disappointed. I won't claim that my definition is the only definition, or even the best definition. But it works for me, and it has some virtues worth discussing.<br />
<br />
Essentially, I think of data science as "answering questions with data," or more precisely, "providing empirical answers to well-posed questions." By empirical, I mean "based on information that all participants can observe in common." By well-posed, I mean "admitting a definitive answer": once we see the right answer, we can all agree that it's the right one. In the language of formal logic, a well-posed question is one that admits a deductively valid conclusion. So, data = empirical, science = questions.<br />
<br />
The main difference between my definition and most of the others floating around (e.g. <a href="http://www.quora.com/What-is-data-science">here</a>) is that I focus on the goal of data science (answering questions), not the tools or methods for getting there (e.g. data munging, predictive analytics, writing mapReduce queries).<br />
<br />
I find that defining data science by goals instead of tools adds clarity, for two reasons. First, goals usually provide a more defined boundary than tools. Almost none of the tools of data science are unique to data science. Software engineers do lots of "hacking"; forecasters do lots of statistical modeling; DB admins use plenty of NoSQL. None of these things on its own provides a bright line for determining who is a data scientist or not, so we have to take a fuzzy average over lots of categories, and end up with a large gray area of jobs that are "kind of" data science. In contrast, it's usually pretty clear if your goal is answering questions (a.k.a. "providing insight," "running analytics," "informing decisions") or not.<br />
<br />
Second, focusing on goals lets us differentiate approaches by effectiveness. Without a clear understanding of the job of data science, it's impossible to tell the difference between professionals who choose the right tools to get the job done, and bandwaggoners who are just playing with every shiny new toy. Since the bandwagoning has started already, I think we'll be well served to differentiate between effective data scientists and the tools they use.<br />
<br />
Analogy: I'm in the hospital for an appendectomy as I write this, going under the knife in a few hours. I find it much more comforting to think of the doctors in terms of goals ("People who help you regain their health") than tools and methods ("People who cut holes in you with scalpels and wires"). Similarly, I'd be much happier hiring a data scientist who is good at answering questions, than one who is good with mongoDB, or Bayesian models. Having the tools is necessary but not sufficient to accomplish the goals.<br />
<br />
With those ideas on the table, here are some comparisons I'd like to explore in the future:<br />
<br />
<ul>
<li>How is data science different from "big data"?</li>
<li>How is data science different from statistics?</li>
<li>How is data science different from data analysis?</li>
<li>How is data science different from science in general?</li>
<li>How is data science different from software engineering?</li>
</ul>
<br />
What do you think? Discuss.<br />
Unknownnoreply@blogger.com1tag:blogger.com,1999:blog-301218875078503995.post-73908158202746362122013-03-06T08:52:00.002-08:002013-03-06T09:15:13.679-08:00Notes on Data-driven Design in the 2012 Obama campaign<br />
Just got out of a presentation by Josh Higgins and Dan Ryan, two of the technical leads on the 2012 Obama campaign. Really great presentation on designing and optimizing campaign tools using the latest and greatest in web development techniques.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://immizen.com/wp-content/uploads/2012/11/Narwhal.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="http://immizen.com/wp-content/uploads/2012/11/Narwhal.jpg" width="320" /></a></div>
<br />
<br />
Takeaways:<br />
<br />
<ul>
<li>Campaigns have three goals: (1) raise money, (2) persuade voters, (3) get out the vote.</li>
<li>The tools built in 2012 made that campaign 28% more effective than the '08 campaign.</li>
<li>The campaign was won by volunteers' "boots on the ground," but really good tech acted as a "force multiplier" for limited volunteers.</li>
<li>The team would run 16 or more A/B tests in a day. More than a thousand tests over the course of the campaign.</li>
<li>Each test needed a clear question and hypothesis so that there would be permanent learning from the experiment.</li>
<li>The campaign raised $690,000,000 online, and average of a little over $100 for 4 million donors. This was the first campaign to raise more money online than off.</li>
<li>$125,000,000 of that total was due to testing -- improvements in interventions that earned more money.</li>
<li>Spending lots of time on beautiful design didn't always work. "Sometimes ugly sells."</li>
<li>Lots of predictions were wrong. "Don't think you know anything until the data tells you so."</li>
<li>The facebook tool for "social canvassing" was "creepy awesome": given user permissions, the tool would crawl your timeline and all your friends' profiles, to identify friends who are (1) socially close, and (2a) physically close or (2b) in a battleground state. Close to election day, the app sent reminder messages (lots of them!) urging volunteers to remind their friends to vote.</li>
<li>Based on the voter file, 5 million facebook volunteers mobilized 7 million facebook-only voters -- voters for whom facebook was the campaign's <i>only</i> method of contact. In the end, Obama won the popular vote by 5 million votes. "This was our way of knocking on doors." "We won the popular vote with facebook."</li>
<li>Targeted splash pages during the conventions <b>doubled</b> the take from fundraising with <b>half</b> the asks.</li>
<li>"Drunk emails before midnight": on drinking holidays (New Years Eve, St. Patrick's Day, etc.) the campaign would send to previous donors mails (subject line: "Hey!") with a one-click donation link. "We raised millions." Way to target the perfect moment!</li>
<li>Custom data tools that worked particularly well: narwhal, the campaign's unified data warehouse; quickPayment, a database for storing required FEC info to take the friction out of donations.</li>
</ul>
<br />
<br />
Great example of the power of data and fast development cycles.<br />
<br />
PS: I want a picture of a triumphant Obama riding a narwhal, in the style of <a href="http://fc00.deviantart.net/fs70/i/2010/333/9/a/abe_lincoln_riding_a_grizzly_by_sharpwriter-d33u2nl.png">Abe Lincoln riding a bear</a>.Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-301218875078503995.post-4612410906625065302013-02-20T21:23:00.002-08:002013-02-20T21:23:21.919-08:00Data can't do everything. So what?<br />
*Sigh* <a href="http://www.nytimes.com/2013/02/19/opinion/brooks-what-data-cant-do.html">This article</a> again. The one that says, "Data can't do everything." This time, David Brooks happens to be the one writing it, but it could have been anybody, really. Brooks gives a list of things that he feels data does poorly ("context", "big problems", "the social"), and then concludes with this gem:<br />
<blockquote class="tr_bq">
"This is not to argue that big data isn’t a great tool. It’s just that, like any tool, it’s good at some things and not at others."</blockquote>
Well, duh!<br />
<br />
I'm tired of reading the many incarnations of this article, for two reasons.<br />
<br />
<ol>
<li>It's obvious. Good data analysts (and anybody with half a brain) is already aware of these kinds of limitations.</li>
<li>It doesn't move the debate forward. In fact, it clouds the issue.</li>
</ol>
<div>
The debate about data is a debate about scope: "<i>What can and can't be accomplished with data?</i>" This isn't a question that can be resolved using vague generalities. For example, the following logic (based on one of Brooks' rules of thumb) doesn't work: "Well, building a platform where millions of people can share ideas in real time (e.g. twitter) is a 'big problem,' so I guess it <i>can't</i> be solved with data. But convincing my toddler to stop throwing milk at dinner is a 'small problem,' so bring on the statistics!"</div>
<div>
<br /></div>
If you want to know whether data can help answer a question, you have to look at the structure of the data: What variables are available? What are the units of analysis? How are the data structured across time? Are there any plausible sources of exogenous variation (e.g. instrumental variables or "natural experiments")? These are the right questions to ask. Hazy adjectives like "big" or "social" simply aren't useful.<br />
<br />
It's as if Brooks is claiming he can fix your car without opening the hood. "You can fix red SUVs by flushing out the engine." "You can answer big, social questions by relying on values." A real mechanic would get inside the machine and actually see how it works. "Hmm... for this particular big, social question, you have lots of data on X and Y, and a little bit on Z, and this portion was captured as part of an experimental design. That means we can infer A, but we can't infer B..."<br />
<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIVTG2z4mwRkqIBjnUF1B4MNBNLY0Fw92rrJJYEIKqC5IrGGpzMXN8KeTTscyWLDQZkQYfr5286TCUqCMSRzX7iuOQLtxj_mo7s-mYSsuCb1u5SmPXrw5rrqzaq4UODTi_z5ue6IaFHQw0/s1600/under-35118_640.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="275" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjIVTG2z4mwRkqIBjnUF1B4MNBNLY0Fw92rrJJYEIKqC5IrGGpzMXN8KeTTscyWLDQZkQYfr5286TCUqCMSRzX7iuOQLtxj_mo7s-mYSsuCb1u5SmPXrw5rrqzaq4UODTi_z5ue6IaFHQw0/s320/under-35118_640.png" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">"I still say we're both entitled to our own methods of fixing the car."</td></tr>
</tbody></table>
<br />
Data can't do everything. Not even close. But we live in a world swimming in data of increasingly useful types. It seems reasonable to think that we'll be able to do <i>more</i> with that data once we figure out what it's good for. And we can't do that by burning the strawman of omnipotent data, or by trading in mushy platitudes. We need to get specific about real questions and data structure.<br />
Unknownnoreply@blogger.com2tag:blogger.com,1999:blog-301218875078503995.post-71170805931947212462013-02-05T22:23:00.001-08:002013-02-05T22:23:16.583-08:00...and we're back!<br />
I'm back! Life took a sudden turn this summer: I went to the Bay Area for a wedding, lined up some informational interviews around data science, and found myself hired by a fantastic little startup, which then got acquired by a fantastic big startup. I spent the summer working feverishly on my dissertation, then moved out to San Francisco at the beginning of December. It's been an awesome and surprising ride: like crashing a bicycle into a pool full of bacon.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjH1DwijyFnCUPER14B8-oCh4HDIaX-G5B6EKZZBRgA0eMFc1TZi1Y951igP8kNsV0z9SVIAQ8uZgLIY9XwgOi5Viy1tyqqssvSvORykogVIXrsAdEFspMg8iDYEXLG7lICxkNaE5MZCUP/s1600/Pool+of+bacon.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="264" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjjH1DwijyFnCUPER14B8-oCh4HDIaX-G5B6EKZZBRgA0eMFc1TZi1Y951igP8kNsV0z9SVIAQ8uZgLIY9XwgOi5Viy1tyqqssvSvORykogVIXrsAdEFspMg8iDYEXLG7lICxkNaE5MZCUP/s320/Pool+of+bacon.jpg" width="320" /></a></div>
<br />
<br />
Up until now, I couldn't tell the story, because the acquisition of <a href="http://www.massivehealth.com/">Massive</a> Health (the fantastic little startup) by <a href="https://jawbone.com/">Jawbone</a> (the fantastic big startup) wasn't finalized or public <a href="http://www.engadget.com/2013/02/05/jawbone-buys-massive-health-and-visere-to-boost-its-app-design/">until yesterday</a>. Also, I was super busy.<br />
<br />
With these issues resolved, I mean to reopen this blog. The focus is still data, especially opportunities for better living through data, and the day-to-day work of professional data science.<br />
<br />
Really, there are two questions I expect to come back to over and over:<br />
<ul>
<li>What is data science, how is it practiced, and how should it be practiced?</li>
<li>How can personal data make life better for people? Like me, for example.</li>
</ul>
One thing I <b>won't</b> write much about is products and data systems at Jawbone. The company is doing awesome, forward-looking R&D in several areas, but we are supposed to keep a lid on it until the proper moment. Because of that, I'll focus more on process, possibilities, introspection, and nifty stuff popping up around data science writ large.<br />
<br />
If you have thoughts of questions or ideas on these topics, please engage! The world is full of data, and we're still learning how to make it useful. I'm looking forward to the conversation. Cheers!<br />
<br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-64407724417079130102012-07-06T06:30:00.000-07:002012-07-06T06:30:03.419-07:00Check out (and like, tweet and +1!) the civilometer prototype site<br />
In the spirit of "show > tell", I spent a good chunk of last week building a prototype site for my civilometer proposal for the Knight Foundation's news challenge. <span style="background-color: white;">Please check it out and support us on twitter, facebook and google! Tweets with the #newschallenge hashtag would be particularly appreciated.</span><br />
<span style="background-color: white;"><br /></span><br />
<span style="background-color: white;">Here's the link: <a href="http://www.civilometer.com/">www.civilometer.com</a></span><br />
<br />
Here's a screenshot:<br />
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhW0C2CbsRAo5ITdetyskjGuPHJM_ZwRAQKNOk00QtZlEt7TMNv50rWdflOIOIwrA0zStDLmddx0hewfRhXxoW6VqOW77EBwpum_RoeemgG7Iac_OyjAufHK-BDOhxpttdEiFMh_-Yj_v0b/s1600/civilometer-screenshot.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="178" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhW0C2CbsRAo5ITdetyskjGuPHJM_ZwRAQKNOk00QtZlEt7TMNv50rWdflOIOIwrA0zStDLmddx0hewfRhXxoW6VqOW77EBwpum_RoeemgG7Iac_OyjAufHK-BDOhxpttdEiFMh_-Yj_v0b/s400/civilometer-screenshot.png" width="400" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">A screenshot, for your viewing enjoyment.<br /></td></tr>
</tbody></table>
For those of you joining the story late, <a href="http://newschallenge.tumblr.com/post/25545682065/political-civil-o-meter">the proposed project is a public-facing site for political civility</a>. The site is designed as a data playground to hold politicians and newsmakers accountable for what they say. We would take in real-time media feeds, and apply scientific civility-measuring techniques from my dissertation. A suite of data visualization tools would enable users to ask data-driven questions about civility, and create and share cool graphs of their findings.<br />
<br />
There's *a lot* that you could do with all this data. My hope is that by building a public site (rather than hiding our findings in obscure academic journals) we can inject a bit more accountability into public discourse. I'm really excited about the chance to build something genuinely productive with the research I've been doing the last five years of my life.<br />
<br />
To make all this happen, I've applied for a grant from the Knight foundation. Part of their judging criteria is public support. Judging is happening <i>right now</i>. (*bites nails in trepidation*). If you like this idea, please head over to the site (<a href="http://www.civilometer.com/">www.civilometer.com</a>)<span style="background-color: white;">, and tweet, like, and share the idea with everyone you know.</span><br />
<br />
Thanks!<br />
<br />
<br />
<br />
Warning: the site looks best in recent versions of Firefox and Chrome. I haven't really tested it on IE, or Safari. It looks decent on my kindle, though! If we get funded, I'll make sure it looks good for all you you poor corporate Microsoft slaves as well.<br />
<br />
<div>
<span style="background-color: white;"><br /></span></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-45509310137808707962012-06-29T06:58:00.000-07:002012-06-29T06:58:00.168-07:00Word cloud of Knight News Challenge Data proposalsLast night, I scraped the ~800 "data" proposals from the <a href="http://newschallenge.tumblr.com/">Knight News Challenge</a> and turned them into word soup*. As Mike says: Sorry, science! Still, you get a sense of the themes shared across the proposals.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2EDFRy0dlPMVR4Kh2fFikwbASWjlJktPysIW6KavJzU5LGnPnt-ZzAz7NbVE_IBtENycjaePmjVQZfG_FpzlWVTGtjKmSq7v0wgfcIpbW52Z_9LywBlXvtA5jBBHPaSrzacuYri89YH94/s1600/news-challenge-words.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="196" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2EDFRy0dlPMVR4Kh2fFikwbASWjlJktPysIW6KavJzU5LGnPnt-ZzAz7NbVE_IBtENycjaePmjVQZfG_FpzlWVTGtjKmSq7v0wgfcIpbW52Z_9LywBlXvtA5jBBHPaSrzacuYri89YH94/s400/news-challenge-words.png" width="400" /></a></div>
<br />
I'm excited about the contest, and the (<span style="background-color: white;">realistically, </span><span style="background-color: white;">slim) chance that our </span><span style="background-color: white;">civil-o-meter </span><span style="background-color: white;">proposal will get funded. This is a really nifty time to be working in this area.</span><br />
<span style="background-color: white;"><br /></span><br />
<span style="background-color: white;"><b>If you like the idea holding politicians and newsmakers to a fair and accurate standard for civility, please <a href="http://newschallenge.tumblr.com/post/25545682065/political-civil-o-meter">like us on tumblr</a>, or tweet about us using the <a href="https://twitter.com/#!/search/realtime/%23newschallenge">#newschallenge hashtag</a>.</b></span><br />
<br />
<br />
<span style="font-size: x-small;">*I used python for the scraping and R for the very lightweight NLP. The layout is by wordle.</span><br />Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-69610300505531018052012-06-21T04:55:00.001-07:002012-06-21T04:56:56.239-07:00A shameless plug for a worthy cause<style type="text/css">
.my-ask-button {
-moz-box-shadow:inset 0px 1px 0px 0px #bbdaf7;
-webkit-box-shadow:inset 0px 1px 0px 0px #bbdaf7;
box-shadow:inset 0px 1px 0px 0px #bbdaf7;
background:-webkit-gradient( linear, left top, left bottom, color-stop(0.05, #79bbff), color-stop(1, #378de5) );
background:-moz-linear-gradient( center top, #79bbff 5%, #378de5 100% );
filter:progid:DXImageTransform.Microsoft.gradient(startColorstr='#79bbff', endColorstr='#378de5');
background-color:#79bbff;
-moz-border-radius:6px;
-webkit-border-radius:6px;
border-radius:6px;
border:1px solid #84bbf3;
display:inline-block;
color:#ffffff;
font-family:arial;
font-size:15px;
font-weight:bold;
padding:6px 24px;
text-decoration:none;
text-shadow:1px 1px 0px #528ecc;
}.classname:hover {
background:-webkit-gradient( linear, left top, left bottom, color-stop(0.05, #378de5), color-stop(1, #79bbff) );
background:-moz-linear-gradient( center top, #378de5 5%, #79bbff 100% );
filter:progid:DXImageTransform.Microsoft.gradient(startColorstr='#378de5', endColorstr='#79bbff');
background-color:#378de5;
}.my-ask-button:active {
position:relative;
top:1px;
}
/* This imageless css button was generated by CSSButtonGenerator.com */
</style>
<br />
<div style="text-align: center;">
<div style="background-color: #dddddd; margin: auto; padding: 40px; text-align: left; width: 80%;">
<b>Please "like" my proposal for a political civil-o-meter</b> <a class="my-ask-button" href="http://newschallenge.tumblr.com/post/25545682065/political-civil-o-meter">here</a><br />
<small>If you don't have a tumblr account already,</small><br />
<small>you'll need to take two minutes and create one.</small></div>
</div>
<span style="background-color: white;"><br /></span><br />
<span style="background-color: white;"><b>Details</b></span><br />
<span style="background-color: white;">I've just put in an application for funding through the Knight Foundation's civic media <a href="http://newschallenge.tumblr.com/">news challenge</a>. They want to "accelerate media innovation by funding breakthrough ideas in news and information." This round in the grant competition focuses on the role of data in civic engagement -- right up my alley.</span><br />
<br />
To meet that challenge, I'm proposing a <b>political civil-o-meter</b> -- a crowdsourced site to generate fair and accurate civility ratings for political speech (think campaign ads, newspaper op-eds, and blog posts). Most of the tools to build such a site will already be developed as part of my dissertation; this grant would help me make them available to the public. This site would provide a really cool way to explore civility in public discourse, and hold public officials and media personalities accountable for the civility (or lack thereof) of what they say.<br />
<br />
<span style="background-color: white;">I'd appreciate it if you'd head on over to the Knight Foundation's tumblr blog and "like" </span><a href="http://newschallenge.tumblr.com/post/25545682065/political-civil-o-meter" style="background-color: white;">the civil-o-meter proposal</a><span style="background-color: white;">. (If you don't have a tumblr account already, you'll need to create one -- a quick, painless, and spam-free process.) Even if you don't like the proposal or just don't get it, you can ask clarifying questions in the comments section, and I'll do my best to explain things better. Awards aren't made strictly on the basis of voting, but I figure a little extra attention in this category can't hurt.</span><br />
<br />
Thanks!<br />
<div>
<br /></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-83564349503006896032012-06-04T07:00:00.000-07:002012-06-04T07:00:08.694-07:00Design patterns for data-centric software<br />
<div>
I wrote a few days ago about software design patterns, including the thought that <b>we're going to discover new patterns for data-centric software</b>. Let me unpack that concept.</div>
<div>
<br /></div>
<div>
First, by data-centric software, I don't mean software intended for data analysis (e.g. R, excel, or google charts). I mean any software that <i>collects and/or responds to data</i> in the course of doing whatever else it does.</div>
<div>
<br /></div>
<div>
Web analytics are a great example of this. The primary purpose of a web page is to serve content. But at the same time, it's easy to track pageviews and traffic. Compared to an untracked web site, a site instrumented with google analytics is more data-centric, because it's generating data in the background.</div>
<div>
<br /></div>
<div>
As I read it, the original design patterns are intended mainly to minimize long-term development costs. The key question is "<i>How should code be structured to make it easy to read, debug, maintain, extend, etc?"</i> It's all about saving developers' time in the long run.*</div>
<div>
<br /></div>
<div>
Five years after the original set of design patterns was popularized, <a href="http://www.amazon.com/Pattern-Oriented-Software-Architecture-Volume-Concurrent/dp/0471606952">another book</a> was published, focusing on design patterns for distributed software. This time, the key questions expanded to include bandwidth and concurrency: "How should we structure code to make the best use out of distributed computing resources?"</div>
<div>
<br /></div>
<div>
I think we're due for another expansion, because data-centric code introduces another optimization target: useful information.** Just as the list of patterns expanded to deal with networking and multiprocessing, it will expand again as data processing and analytics become integral to software design.***</div>
<div>
<br /></div>
<div>
Off the top of my head, here's a quick list of data-centric patterns.</div>
<div>
<ul>
<li>A/B testing</li>
<li>Funnel analysis</li>
<li>Recommender systems (very broad category!)</li>
<li>Top hits (most visited, emailed, etc.)</li>
<li>Automatic bug reports</li>
<li>Likes, +1s, Retweets</li>
</ul>
</div>
<div>
This list isn't complete, and it's clear that best practice is still evolving. For example, A/B testing has been industry-standard for a long time, but I recently read a good argument that <a href="http://untyped.com/untyping/2011/02/11/stop-ab-testing-and-make-out-like-a-bandit/">a multi-armed bandit algorithm is better than A/B testing,</a> because it gathers all the same information, plus integrating that feedback directly into the site design. It's a very natural extension and improvement over an older data-centric design pattern. I'm sure that many other such improvements are possible.</div>
<div>
<br /></div>
<div>
Anyway, I think it's still too early to try to write a comprehensive list. But I'd still like to expand this list to cover as many cases as possible. What else belongs here?</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
<div>
*A few of the patterns address things like limited memory and processing power, but they're the exceptions.</div>
</div>
<div>
** Defining useful opens up a whole new can of worms, which I won't get into here.</div>
<div>
***This relates back to the concept that I've written about before: <a href="http://compsocsci.blogspot.com/2011/12/software-design-for-analytics-manifesto.html">software design for analytics</a>.</div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-42511111703518545582012-06-01T23:44:00.001-07:002012-06-01T23:44:12.262-07:00Software design patternsFollowing a tip from an experienced software developer, I've been reading up on <a href="http://en.wikipedia.org/wiki/Software_design_pattern">software design patterns</a>: flyweights, factories, facades, etc. These are general patterns for object-oriented programming that show up again and again. The original canon included 23 patterns; that list has since expanded to include patterns for networking and multiprocessing.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://vis.berkeley.edu/papers/infovis_design_patterns/pattern_map.gif" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="196" src="http://vis.berkeley.edu/papers/infovis_design_patterns/pattern_map.gif" width="320" /></a></div>
<br />
<br />
These design patterns remind me of <a href="http://en.wikipedia.org/wiki/Go_proverb">Go proverbs</a> -- high-level heuristics for better strategy, sometimes contradictory. Knowing them can be extremely helpful, but it's no guarantee that you can deploy them correctly. (Here's <a href="http://senseis.xmp.net/?GoProverbs">a good list of common go proverbs</a>.)<br />
<br />
Anyway, reading the <a href="http://en.wikipedia.org/wiki/Design_Patterns_(book)">original Design Patterns book</a>, I've had three main reactions:<br />
<br />
<div>
<br /></div>
<div>
<b>1. Data-centric software development is going to discover its own list of software design patterns.</b></div>
<div>
<div>
<b>2. There are patterns for research design, just like there are patterns for software design.</b></div>
</div>
<div>
<div>
<b>3. I already know most of the software patterns -- yay!*</b></div>
</div>
<div>
<b><br /></b></div>
<div>
Since I just can't sleep tonight, I figured I'd queue up a few blog posts talking about the first two. Look for those in a couple days.</div>
<div>
<br /></div>
<div>
*Given my very ad hoc background in software design, I've been pleasantly surprised to find that most of the software design patterns are already familiar. For example, python is already very good with iterators and decorators. And working with web frameworks has taught me a lot about factories. And many of the others are much less important in python because objects are dynamically typed. Anyway, it's nice to discover that I've picked a lot of this up by osmosis. (Pat self on the back.)</div>
<div>
<br /></div>Unknownnoreply@blogger.com7tag:blogger.com,1999:blog-301218875078503995.post-53849552902193583962012-05-31T07:22:00.002-07:002012-05-31T07:23:11.305-07:00Bay Area data science people, events<br />
<div style="font-family: arial; font-size: small;">
A quick favor: I'm headed out to Palo Alto for a family event in a couple weeks. While I'm there, I'd love to meet people and find out more about the Bay Area data science scene.</div>
<div style="font-family: arial; font-size: small;">
<br /></div>
<div style="font-family: arial; font-size: small;">
Where should I go? Who should I meet?</div>
<div style="font-family: arial; font-size: small;">
<br /></div>
<div style="font-family: arial; font-size: small;">
I'm free mainly on Monday the 11th through Wednesday the 13th, with some time on Tuesday evening <a href="http://www.meetup.com/SVBigData/">here</a>.</div>
<div style="font-family: arial; font-size: small;">
<br /></div>
<table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"><tbody>
<tr><td style="text-align: center;"><a href="http://philporterart.com/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=76&g2_serialNumber=2" imageanchor="1" style="margin-left: auto; margin-right: auto;"><img border="0" height="257" src="http://philporterart.com/gallery/main.php?g2_view=core.DownloadItem&g2_itemId=76&g2_serialNumber=2" width="320" /></a></td></tr>
<tr><td class="tr-caption" style="text-align: center;">This picture is the first result for gImages: "going to the big city." I like it.</td></tr>
</tbody></table>
<div style="font-family: arial; font-size: small;">
<br /></div>Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-92118224250343730922012-05-29T08:13:00.001-07:002012-05-29T08:13:59.602-07:00Live streaming of Northeastern/Harvard/MIT workshop on computational social science @ IQSS, May 30-June 1<br />
Tomorrow, <a href="http://www.iq.harvard.edu/">IQSS</a> is running a <a href="http://events.iq.harvard.edu/events/node/2858">conference on computational social science</a>. I can't attend this year, but the conference organizers have kindly offered to livestream the sessions. Here's the email from <a href="http://www.hks.harvard.edu/davidlazer/">David Lazer</a>.<br />
<br />
<hr />
<br />
Hi all,<br />
<br />
Please note that we will be live streaming the workshop on computational social science (program below). The url:<br />
<br />
http://video.isites.harvard.edu/liveVideo/liveView.do?name=Comp_Soc_Science<br />
<br />
The Twitter hashtag is: #compsocsci12. We will monitor this hashtag during the workshops to enable remote Q&A.<br />
<br />
If you would like to embed the stream in your website, use this code:<br />
<br />
<pre><iframe src="http://video.isites.harvard.edu/liveVideo/liveEmbed.do?name=Comp_Soc_Science&width=auto&height=auto" width="640" height="360" style='border: 0px;'></iframe></pre>
<br />
Please feel free to forward this e-mail on to interested parties, and if this has been forwarded to you, and you would like to be added to the list, please contact m.lee@neu.edu.<br />
<br />
best,<br />
<br />
DavidUnknownnoreply@blogger.com1tag:blogger.com,1999:blog-301218875078503995.post-69302321874665425892012-05-18T07:00:00.000-07:002012-05-18T07:00:04.671-07:00Will crunch numbers for foodI don't like self-promotion. Makes me feel greasy, if you know what I mean. But graduation is looming, it's a <a href="http://www.smartplanet.com/blog/business-brains/big-data-market-set-to-explode-this-year-but-what-is-8216big-data/22126">boom year for big data</a>, and there's <a href="http://www.bsos.umd.edu/gvpt/graduate/placement/PhDs%20Ten%20Years%20Study.pdf">no hiring pipeline from political science to fun tech jobs in tech</a>. So I figure it's time to hang out my shingle as a <a href="http://mashable.com/2012/01/13/career-of-the-future-data-scientist-infographic/">data scientist</a>.<br />
<br />
Earlier this week, I bought the domain name <a href="http://www.abegong.com/">abegong.com</a> and worked up a digital resume. Like I said, I'm not a big self-promotion guru, so I'd be grateful for feedback (or job leads).<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="http://mynewbodynewlife.com/blog/wp-content/uploads/2011/07/will_work_for_food1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://mynewbodynewlife.com/blog/wp-content/uploads/2011/07/will_work_for_food1.jpg" /></a></div>
<br />Unknownnoreply@blogger.com21085 S University Ave, Ann Arbor Charter Township, MI 48109, USA42.275459137225027 -83.73619437217712442.274724637225027 -83.737428372177121 42.276193637225028 -83.734960372177127tag:blogger.com,1999:blog-301218875078503995.post-28047530635626615302012-05-17T11:30:00.001-07:002012-05-17T11:30:16.731-07:00Nifty tools for playing with wordsHere are a bunch of sites I use to play with words -- whether brainstorming or trying to accomplish something specific with text analysis.<br />
<br />
<a href="http://www.rhymezone.com/r/rhyme.cgi">A rhyming dictionary</a>. Helpfully splits up the word list by syllables, so you can finish that sonnet you've been working on.<br />
<br />
<br />
Here's a nifty little site for generating <a href="http://en.wikipedia.org/wiki/Portmanteau">portmanteaus</a> (word splices): <a href="http://www.werdmerge.com/?word=charades">http://www.werdmerge.com/</a><br />
<br />
<br />
<a href="http://www.leandomainsearch.com/">http://www.leandomainsearch.com</a>: generates themed domain names, and checks to make sure they're unclaimed by URL squatters.<br />
<br />
Online <a href="http://www.lipsum.com/">lorem generator</a>. Here's the <a href="http://pypi.python.org/pypi/loremipsum/1.0.2">same thing in python</a>.<br />
<br />
Markov text generation: <a href="http://www.beetleinabox.com/markov.html">http://www.beetleinabox.com/markov.html</a>.<br />
<br />
<a href="http://textmechanic.com/Permutation-Generator.html">Permute words and letters</a>. This seems less useful to me... It gives all the combinations, not just the ones that make some kind of sense.<br />
<br />
<a href="http://www.lavarnd.org/demo/index.html">Lavarand</a> used to do random haikus and corporate memos, but it looks like they've broken down.<br />
<br />
<a href="http://aws.amazon.com/datasets/8172056142375670">Google ngrams on AWS public data sets</a>. These are combinations of words that commonly co-occur in English.<br />
<br />
Yes, yes. And then there's <a href="http://www.wordle.net/">wordle</a>. Too pretty for the rest of us.<br />
<br />
What else belong on this list?Unknownnoreply@blogger.com0tag:blogger.com,1999:blog-301218875078503995.post-73576438017604306262012-05-15T06:38:00.001-07:002012-05-15T06:38:37.516-07:00Python mapreduce on EC2Last week, I wrote about getting AWS public datasets onto an EC2 cluster, and then into HDFS for MapReduce. Now let's get to hello world (or rather, countWords) with python scripts.
<br />
<br />
<br />
<pre>#!/usr/bin/env python
# mapper2.py
</pre>
<pre>import sys, re
for line in sys.stdin:
line = line.lower()
words = line.split()
#--- output tuples [word, 1] in tab-delimited format---
for word in words:
print '%s\t%s' % (word, "1")
</pre>
<br />
<br />
Here's the reducer script....<br />
<br />
<pre>#!/usr/bin/env python
# reducer.py
import sys
# maps words to their counts
word2count = {}
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
continue
try:
word2count[word] = word2count[word]+count
except:
word2count[word] = count
# write the tuples to stdout
# Note: they are unsorted
for word in word2count.keys():
print '%s\t%s'% ( word, word2count[word] )
</pre>
<br />
<br />
The command to execute all this in hadoop is a bit of a monster, mainly because of all the filepaths. Note the usage of the -file parameter, which tells hadoop to load files for use in the -mapper and -reducer arguments. Also, I set -jobconf compression to false, because I didn't have a handy LZO decompresser installed.<br />
<br />
<pre>bin/hadoop jar contrib/streaming/hadoop-0.19.0-streaming.jar -input wex-data -output output/run9 -file /usr/local/hadoop-0.19.0/my_scripts/mapper2.py -file /usr/local/hadoop-0.19.0/my_scripts/reducer.py -mapper mapper2.py -reducer reducer.py -jobconf mapred.output.compress=false</pre>
<div>
<br /></div>
<br />
NB: As I dug into this task, I discovered several pretty good python/hadoop-streaming tutorials online. The scripts here were modified from: <a href="http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program">http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_3.2_--_Using_Your_Own_Streaming_WordCount_program</a><br />
<br />
<br />
Other sources:<br />
<br />
<a href="http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/">http://www.protocolostomy.com/2008/03/20/hadoop-ec2-s3-and-me/</a><br />
<a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/</a><br />
<a href="http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html">http://www.larsgeorge.com/2010/10/hadoop-on-ec2-primer.html</a><br />
<a href="http://www.princesspolymath.com/princess_polymath/?p=137">http://www.princesspolymath.com/princess_polymath/?p=137</a><br />
<a href="http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html">http://arunxjacob.blogspot.com/2009/04/configuring-hadoop-cluster-on-ec2.html</a><br />
<br />
<a href="http://wiki.apache.org/hadoop/AmazonS3">http://wiki.apache.org/hadoop/AmazonS3</a><br />
<br />
<a href="http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html">http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html</a><br />
<br />
<a href="http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/">http://www.cloudera.com/blog/2009/07/advice-on-qa-testing-your-mapreduce-jobs/</a><br />
<a href="http://hadoop.apache.org/common/docs/r0.20.2/streaming.html">http://hadoop.apache.org/common/docs/r0.20.2/streaming.html</a><br />
<div>
<br /></div>Unknownnoreply@blogger.com6tag:blogger.com,1999:blog-301218875078503995.post-33231884175463526992012-05-03T10:51:00.002-07:002012-05-03T10:52:09.312-07:00Running mapreduce on Amazon's publicly available datasets with python<br />
on Monday, I had a preliminary interview at a really interesting tech startup. In the course of the conversation, the interviewer mentioned that he'd used some of the technical notes from compSocSci in his own work. And I thought nobody was reading!<br />
<br />
Anyway, I've been sitting on some old EC2/hadoop/python notes for a while. The talk gave me the motivation to clean up and post them, just in case they can help somebody else. The goal here is threefold:<br />
<br />
<ol>
<li>Fire up a <a href="http://wiki.apache.org/hadoop/AmazonEC2">hadoop cluster on EC2</a></li>
<li>Import data from an <a href="http://aws.amazon.com/ebs/">EBS volume</a> with one of <a href="http://aws.amazon.com/publicdatasets/">AWS' public data sets</a></li>
<li>Use <a href="http://hadoop.apache.org/common/docs/r0.15.2/streaming.html">hadoop streaming</a> and <a href="http://www.python.org/">python</a> for quick scripting</li>
</ol>
<br />
In other words, we want to set up a tidy, scalable data pipeline as fast as possible. My target project is to do word counts on wikipedia pages -- the classic "hello world" of mapReduce. This isn't super-hard, but I haven't seen a good soup-to-nuts guide that brings all of these things together.<br />
<br />
<b>Phase 1:</b><br />
Follow the notes below to get to the digits-of-pi test. Except for a little trouble with AWS keys, this all went swimmingly, so I see no need to duplicate. If you run into trouble with this part, we can troubleshoot in the comments.<br />
<br />
<a href="http://wiki.apache.org/hadoop/AmazonEC2#Running_a_job_on_a_cluster">http://wiki.apache.org/hadoop/AmazonEC2#Running_a_job_on_a_cluster</a><br />
<a href="http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/">http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/</a><br />
<br />
<b>Phase 2:</b><br />
Now let's attach an external dataset. Here's the dataset we'll use: <a href="http://aws.amazon.com/datasets/2345">Wikipedia Extraction (WEX)</a>. It's a processed dump of the English language Wikipedia, hosted publicly on Amazon Web Services under snapshot ID snap-1781757e.<br />
<br />
This dataset contains a dump of 1,000 popular English wikipedia articles. It's about 70GB. At Amazon's $.12/GB rate, maintaining this volume costs about $8 for a whole month -- cheap! If you want to scale up to full-size wikipedia (~500GB), you can do that too. After all, we're in big data land.<br />
<br />
Here's the command sequence to create an EBS volume for this snapshot and attach it to an instance. You can look up the ids using ec2-describe-volumes and ec2-describe-instances, or get them from the AWS console at <a href="https://console.aws.amazon.com/">https://console.aws.amazon.com</a>. (Hint: they're not vol-aaaaaaaa and i-bbbbbbbbb.)<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"> ec2-create-volume -snapshot snap-1781757e -z us-east-1a</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> ec2-attach-volume vol-aaaaaaaa -i i-bbbbbbbb -d /dev/sdf</span><br />
<br />
It took a while for these commands to execute. Attaching the volume got stuck in "attaching" status for several minutes. I finally got tired of waiting and mounted the volume, and then the status switched right away. Can't say whether that was cause-and-effect or coincidence, but it worked.<br />
<br />
Once you've attached the EBS volume, login to the instance (instructions <a href="http://wiki.apache.org/hadoop/AmazonEC2#Running_a_job_on_a_cluster">here</a>) and mount the volume as follows. This should be pretty much instantaneous.<br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> mkdir /mnt/wex_data</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> mount /dev/sdf /mnt/wex_data</span><br />
<br />
Now import the data into the Hadoop file system:<br />
<br />
<span style="font-family: 'Courier New', Courier, monospace;"> cd /usr/local/hadoop/</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> hadoop fs -copyFromLocal /mnt/wex_data/rawd/freebase-wex-2009-01-12-articles.tsv wex-data</span><br />
<br />
If you want, you can now remove and delete the EBS volume. The articles file is stored in the distributed filesystem across your EC2 instances in you hadoop cluster. The nice thing is that you can get to this point within less than an hour, meaning that you only have to pay a tiny fraction of the monthly storage cost.<br />
<span style="font-family: 'Courier New', Courier, monospace;"><br /></span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> ec2-detach-volume vol-aaaaaaaa -i i-bbbbbbbbb -d /dev/sdf</span><br />
<span style="font-family: 'Courier New', Courier, monospace;"> ec2-delete-volume vol-aaaaaaaa</span><br />
<br />
I had some trouble detaching volumes until I used the force flag: -f. Maybe I was just being impatient again.<br />
<br />
That's enough for the moment. I'll tackle python in my next post.<br />
<div>
<br /></div>Unknownnoreply@blogger.com0