Friday, March 15, 2013

Hacking Scrabble with Cascalog


We've barely met, but I am in love with Cascalog.  So elegant.  So powerful.  So easy to ship from testing to staging to production.  So perfect for the workflow abstraction where most data science happens.  Let me count the ways...



The only downside is the shortage of documentation and examples. There's an active google group, yes, but only a handful of cascalog questions on StackOverflow.  So I thought I'd pay things forward by tossing out a bunch of toy examples that I used in my early experiments with the language.  These examples cover clojure's basic syntax and regular expressions, plus cascalog's filtering and basic aggregations.  I'll save tests, joins, etc. for a later post.

All these examples use the Scrabble Word dictionary, which provides a nice bite-sized playground for lots of MapReduced fun.  The Tournament Word List is a list of all the legal words for Scrabble in U.S. tournament play.  Words are in all caps and separated by line breaks.  I downloaded and saved the file as /Users/agong/Data/scrabble-list-twl06.txt.  To speed up testing, I also sampled rows at random to create a 10% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt) and a 1% sample (/Users/agong/Data/scrabble-list-twl06-1pct.txt).




First, let's try a vanilla query: just print all the lines in the file. This will help make sure clojure and hadoop are set up properly.

; vanilla query: print all lines
(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word))

If that works, we can do some basic filtering.

; basic filter: print lines starting with 'z'
(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(clojure.core/re-matches #"Z.*" ?word) ))

So far so good. We have Hello world, and basic filters check out.  Let's use the count aggregation to count the number of words in the sample:

; count words
(?<- (stdout) [?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(cascalog.ops/count ?count) ))

Now let's build up to letter counts (a twist on the classic MapReduce word count example.)  To get there, we need to define a map cat operator:

; split words into letters
; https://github.com/sritchie/cascalog-class/blob/master/src/cascalog_class/core.clj
(defmapcatop split
 "Accepts a word and emits a single 1-tuple for each letter."
 [word]
 (clojure.core/re-seq #"." word))

; count letters
(require `cascalog.ops)
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(split ?word :> ?letter) (cascalog.ops/count ?count))

With these very simple tools, we can do a surprising number of interesting things. Let's create a function to do n-gram counting, modified slightly from this example.

; count ngrams
(defmapcatop ngrams
 "Accepts a word and n-parameter and emits a single 1-tuple for each n-gram."
 [word n]
 (map my-join (partition n 1 word)))

Now we can count bigrams and trigrams:

; Character bigrams
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(ngrams ?word 2 :> ?letter) (cascalog.ops/count ?count))

; Character trigrams - no sampling
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") ?word)
(ngrams ?word 3 :> ?letter) (cascalog.ops/count ?count))

Let's get word lengths...

; get word lengths
(defmapcatop get-len
 "Accepts a word and emits a single 1-tuple with its length."
 [word]
 [(.length word)])

; distribution of word lengths
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") ?word)
(get-len ?word :> ?length)
(cascalog.ops/count ?count))

I guess it makes sense that the longest words in scrabble are 15 letters long...

Now let's combine filters and aggregators.  We need to create a filter operation for this...

(deffilterop len-n? [word n]
 "Keep only words not of length n"
 (= (.length word) n))

; distribution of lengths for 7-letter words (a silly example to make sure it worked)
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(len-n? ?word 7)
(get-len ?word :> ?length)
(cascalog.ops/count ?count))

Vowel dumps are an important part of scrabble tactics: words that let you get rid of extra vowels without wasting a turn to exchange your hand.  First, we can do pure vowel dumps -- words that include no consonants at all.  (We'll grant Y vowel status, even though it only qualifies sometimes.)

; vowel dumps
(deffilterop pure-vowel-dump? [word]
 "Keep only words containing only vowels (and sometimes y)"
 (every?
  (into #{} (clojure.core/re-seq #"." "AEIOUY"))
  (clojure.core/re-seq #"." word)))

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(pure-vowel-dump? ?word))

Hm. There aren't very many of these pure vowel dumps.  How about a more flexible function that calculates the proportion of vowels in the word?

(defmapop vowel-ratio [word]
(/
    (.length (clojure.string/replace word #"[AEIOUY]" ""))
    (.length word)))

Now we can look up words with 70% or more vowels.  For good measure, let's show the ratio of vowels in each.

;Ratios for vowel dumps
(?<- (stdout) [?word ?ratio]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-1pct.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10))

Among these vowel dumps, what's the distribution of lengths?  The distribution of letters?

;Length distribution for vowel dumps
(?<- (stdout) [?length ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10)
(get-len ?word :> ?length)
(cascalog.ops/count :> ?count))

;Letter distribution for vowel dumps
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(vowel-ratio ?word :> ?ratio )
(> ?ratio 7/10)
(split ?word :> ?letter)
(cascalog.ops/count :> ?count))

Okay, a few more fun examples.  Palindromes...

;Palindromes
(deffilterop palindrome? [word]
 "Return true for palindromes"
 (= word (clojure.string/reverse word)) )

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(palindrome? ?word))

;Distribution of letters in palindromes
(?<- (stdout) [?letter ?count]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06.txt") :> ?word)
(palindrome? ?word)
(split ?word :> ?letter)
(cascalog.ops/count ?count))

Not a whole lot of these either.  What about vowel-consonant palindromes?  That is, words where the back-to-front and front-to-back ordering of vowel consonants is the same?

;Vowel-consonants palindromes
(deffilterop vc-palindrome? [word]
 "Return true for vowel-consonant palindromes"
 (let [vc-word (clojure.string/replace (clojure.string/replace word #"[AEIOUY]" "A") #"[^A]" "B")]
  (= vc-word (clojure.string/reverse vc-word)) ))

(?<- (stdout) [?word]
((lfs-textline "/Users/agong/Data/scrabble-list-twl06-10pct.txt") :> ?word)
(vc-palindrome? ?word))

There we go!  From ANALYZE to ZYMOSIS.  (Your words will probably be different, because of the sampling.  But they will still be v/c palindromes. :) )

I had fun spending an afternoon putting together these examples.  It was a great, well-bounded way to get my feet wet with clojure and cascalog.

HTH

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. i got interested in playing scrabble specially when my mom introduced it to me..well i do learned a lot from her specially she also gives me some helpful tips on how to play scrabble..


    Scrabble Consonant Words

    ReplyDelete