Friday, March 15, 2013

Hacking Scrabble with Cascalog


We've barely met, but I am in love with Cascalog.  So elegant.  So powerful.  So easy to ship from testing to staging to production.  So perfect for the workflow abstraction where most data science happens.  Let me count the ways...



The only downside is the shortage of documentation and examples. There's an active google group, yes, but only a handful of cascalog questions on StackOverflow.  So I thought I'd pay things forward by tossing out a bunch of toy examples that I used in my early experiments with the language.  These examples cover clojure's basic syntax and regular expressions, plus cascalog's filtering and basic aggregations.  I'll save tests, joins, etc. for a later post.

All these examples use the Scrabble Word dictionary, which provides a nice bite-sized playground for lots of MapReduced fun.  The Tournament Word List is a list of all the legal words for Scrabble in U.S. tournament play.  Words are in all caps and separated by line breaks.  I downloaded and saved the file as /Users/agong/Data/scrabble-list-twl06.txt.  To speed up testing, I also sampled rows at random to create a 10% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt) and a 1% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt).




First, let's try a vanilla query: just print all the lines in the file. This will help make sure clojure and hadoop are set up properly.

; vanilla query: print all lines
(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word))

If that works, we can do some basic filtering.

; basic filter: print lines starting with 'z'
(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(clojure.core/re-matches #"Z.*" ?word) ))

So far so good. We have Hello world, and basic filters check out.  Let's use the count aggregation to count the number of words in the sample:

; count words
(?<- (stdout) [?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(cascalog.ops/count ?count) ))

Now let's build up to letter counts (a twist on the classic MapReduce word count example.)  To get there, we need to define a map cat operator:

; split words into letters
; https://github.com/sritchie/cascalog-class/blob/master/src/cascalog_class/core.clj
(defmapcatop split
 "Accepts a word and emits a single 1-tuple for each letter."
 [word]
 (clojure.core/re-seq #"." word))

; count letters
(require `cascalog.ops)
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(split ?word :> ?letter) (cascalog.ops/count ?count))

With these very simple tools, we can do a surprising number of interesting things. Let's create a function to do n-gram counting, modified slightly from this example.

; count ngrams
(defmapcatop ngrams
 "Accepts a word and n-parameter and emits a single 1-tuple for each n-gram."
 [word n]
 (map my-join (partition n 1 word)))

Now we can count bigrams and trigrams:

; Character bigrams
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(ngrams ?word 2 :> ?letter) (cascalog.ops/count ?count))

; Character trigrams - no sampling
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") ?word)
(ngrams ?word 3 :> ?letter) (cascalog.ops/count ?count))

Let's get word lengths...

; get word lengths
(defmapcatop get-len
 "Accepts a word and emits a single 1-tuple with its length."
 [word]
 [(.length word)])

; distribution of word lengths
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(get-len ?word :> ?length)
(cascalog.ops/count ?count))

I guess it makes sense that the longest words in scrabble are 15 letters long...

Now let's combine filters and aggregators.  We need to create a filter operation for this...

(deffilterop len-n? [word n]
 "Keep only words not of length n"
 (= (.length word) n))

; distribution of lengths for 7-letter words (a silly example to make sure it worked)
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(len-n? ?word 7)
(get-len ?word :> ?length)
(cascalog.ops/count ?count))

Vowel dumps are an important part of scrabble tactics: words that let you get rid of extra vowels without wasting a turn to exchange your hand.  First, we can do pure vowel dumps -- words that include no consonants at all.  (We'll grant Y vowel status, even though it only qualifies sometimes.)

; vowel dumps
(deffilterop pure-vowel-dump? [word]
 "Keep only words containing only vowels (and sometimes y)"
 (every?
  (into #{} (clojure.core/re-seq #"." "AEIOUY"))
  (clojure.core/re-seq #"." word)))

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(pure-vowel-dump? ?word))

Hm. There aren't very many of these pure vowel dumps.  How about a more flexible function that calculates the proportion of vowels in the word?

(defmapop vowel-ratio [word]
(/
    (.length (clojure.string/replace word #"[AEIOUY]" ""))
    (.length word)))

Now we can look up words with 70% or more vowels.  For good measure, let's show the ratio of vowels in each.

;Ratios for vowel dumps
(?<- (stdout) [?word ?ratio]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10))

Among these vowel dumps, what's the distribution of lengths?  The distribution of letters?

;Length distribution for vowel dumps
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10)
(get-len ?word :> ?length)
(cascalog.ops/count :> ?count))

;Letter distribution for vowel dumps
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10)
(split ?word :> ?letter)
(cascalog.ops/count :> ?count))

Okay, a few more fun examples.  Palindromes...

;Palindromes
(deffilterop palindrome? [word]
 "Return true for palindromes"
 (= word (clojure.string/reverse word)) )

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(palindrome? ?word))

;Distribution of letters in palindromes
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(palindrome? ?word)
(split ?word :> ?letter)
(cascalog.ops/count ?count))

Not a whole lot of these either.  What about vowel-consonant palindromes?  That is, words where the back-to-front and front-to-back ordering of vowel consonants is the same?

;Vowel-consonants palindromes
(deffilterop vc-palindrome? [word]
 "Return true for vowel-consonant palindromes"
 (let [vc-word (clojure.string/replace (clojure.string/replace word #"[AEIOUY]" "A") #"[^A]" "B")]
  (= vc-word (clojure.string/reverse vc-word)) ))

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-10pct.txt") :> ?word)
(vc-palindrome? ?word))

There we go!  From ANALYZE to ZYMOSIS.  (Your words will probably be different, because of the sampling.  But they will still be v/c palindromes. :) )

I had fun spending an afternoon putting together these examples.  It was a great, well-bounded way to get my feet wet with clojure and cascalog.

HTH

Monday, March 11, 2013

Definitions for data science


Since I'm rebooting this blog, this seems like a good moment to lay out a framework for data science.  I'll tackle definitions now, and process next time.



Pinning down scope and definitions is important for data science, because the field is growing rapidly, with a sense that the sky is the limit.  Without priorities and a grasp of what data science isn't, we run the risk of overreaching, wasting our time, and leaving everyone disappointed.  I won't claim that my definition is the only definition, or even the best definition.  But it works for me, and it has some virtues worth discussing.

Essentially, I think of data science as "answering questions with data," or more precisely, "providing empirical answers to well-posed questions." By empirical, I mean "based on information that all participants can observe in common."  By well-posed, I mean "admitting a definitive answer": once we see the right answer, we can all agree that it's the right one. In the language of formal logic, a well-posed question is one that admits a deductively valid conclusion.  So, data = empirical, science = questions.

The main difference between my definition and most of the others floating around (e.g. here) is that I focus on the goal of data science (answering questions), not the tools or methods for getting there (e.g. data munging, predictive analytics, writing mapReduce queries).

I find that defining data science by goals instead of tools adds clarity, for two reasons.  First, goals usually provide a more defined boundary than tools.  Almost none of the tools of data science are unique to data science. Software engineers do lots of "hacking"; forecasters do lots of statistical modeling; DB admins use plenty of NoSQL.  None of these things on its own provides a bright line for determining who is a data scientist or not, so we have to take a fuzzy average over lots of categories, and end up with a large gray area of jobs that are "kind of" data science. In contrast, it's usually pretty clear if your goal is answering questions (a.k.a. "providing insight," "running analytics," "informing decisions") or not.

Second, focusing on goals lets us differentiate approaches by effectiveness. Without a clear understanding of the job of data science, it's impossible to tell the difference between professionals who choose the right tools to get the job done, and bandwaggoners who are just playing with every shiny new toy.  Since the bandwagoning has started already, I think we'll be well served to differentiate between effective data scientists and the tools they use.

Analogy: I'm in the hospital for an appendectomy as I write this, going under the knife in a few hours.  I find it much more comforting to think of the doctors in terms of goals ("People who help you regain their health") than tools and methods ("People who cut holes in you with scalpels and wires").  Similarly, I'd be much happier hiring a data scientist who is good at answering questions, than one who is good with mongoDB, or Bayesian models.  Having the tools is necessary but not sufficient to accomplish the goals.

With those ideas on the table, here are some comparisons I'd like to explore in the future:

  • How is data science different from "big data"?
  • How is data science different from statistics?
  • How is data science different from data analysis?
  • How is data science different from science in general?
  • How is data science different from software engineering?

What do you think?  Discuss.

Wednesday, March 6, 2013

Notes on Data-driven Design in the 2012 Obama campaign


Just got out of a presentation by Josh Higgins and Dan Ryan, two of the technical leads on the 2012 Obama campaign.  Really great presentation on designing and optimizing campaign tools using the latest and greatest in web development techniques.



Takeaways:

  • Campaigns have three goals: (1) raise money, (2) persuade voters, (3) get out the vote.
  • The tools built in 2012 made that campaign 28% more effective than the '08 campaign.
  • The campaign was won by volunteers' "boots on the ground," but really good tech acted as a "force multiplier" for limited volunteers.
  • The team would run 16 or more A/B tests in a day.  More than a thousand tests over the course of the campaign.
  • Each test needed a clear question and hypothesis so that there would be permanent learning from the experiment.
  • The campaign raised $690,000,000 online, and average of a little over $100 for 4 million donors.  This was the first campaign to raise more money online than off.
  • $125,000,000 of that total was due to testing -- improvements in interventions that earned more money.
  • Spending lots of time on beautiful design didn't always work. "Sometimes ugly sells."
  • Lots of predictions were wrong. "Don't think you know anything until the data tells you so."
  • The facebook tool for "social canvassing" was "creepy awesome": given user permissions, the tool would crawl your timeline and all your friends' profiles, to identify friends who are (1) socially close, and (2a) physically close or (2b) in a battleground state.  Close to election day, the app sent reminder messages (lots of them!) urging volunteers to remind their friends to vote.
  • Based on the voter file, 5 million facebook volunteers mobilized 7 million facebook-only voters -- voters for whom facebook was the campaign's only method of contact.  In the end, Obama won the popular vote by 5 million votes.  "This was our way of knocking on doors." "We won the popular vote with facebook."
  • Targeted splash pages during the conventions doubled the take from fundraising with half the asks.
  • "Drunk emails before midnight": on drinking holidays (New Years Eve, St. Patrick's Day, etc.) the campaign would send to previous donors mails (subject line: "Hey!") with a one-click donation link.  "We raised millions."  Way to target the perfect moment!
  • Custom data tools that worked particularly well: narwhal, the campaign's unified data warehouse; quickPayment, a database for storing required FEC info to take the friction out of donations.


Great example of the power of data and fast development cycles.

PS: I want a picture of a triumphant Obama riding a narwhal, in the style of Abe Lincoln riding a bear.