In the midst of all the hullabaloo about Ron Paul's decades-old racists newsletters, and his denial that he "never read that stuff," I ran across an interesting attempt to adding some data to the discussion. Filed under "pointless data exercises" and "politics," blogger Peter Larson has used text analysis to compare his blog, Ron Paul's recent speeches, and the original newsletters. He calls his results a smoking gun, with a question mark tacked on, and argues that Ron Paul wrote most of the original newsletters.
Before I go on, let me say that this is a really cool application. Instead of the he-said-she-said debate that's running in the media, this piece brings some actual data to bear on the conversation. Bravo!
That said, now I'm going to harp on statistics and inference. The problem with Larson's analysis is that he never addresses the question, "If Ron Paul didn't write the newsletters, who did?" Without answering that question, and putting some probabilities behind it, it's going to be very hard for text analysis to prove the issue one way or another. (Larson admits this on his blog.) Right now, his analysis proves that between himself and Ron Paul, Paul is much more likely to have written the letters. Not exactly a smoking gun.
That said, it's interesting data. For the record, I'm mostly convinced. In my mind, the statistics are flawed, but they still lend some weight against Paul. Oddly enough, Larson's finding that several of the letters were probably *not* written by Ron Paul was particularly persuasive. It feels human and messy, the way I would expect this kind of thing to be.
Final thoughts, mainly intended for my statistically minded friends: yes, this is an unabashedly Bayesian perspective. I'm demanding priors, which probably can't be specified to anyone's satisfaction.
IMHO, the frequentist approach has even deeper problems. From a frequentist perspective (which is where Larson's original, and deeply confusing p-values come from.) we use Paul's recent speeches and writings to estimate some parameters of his current text-generating process. We then compare the newsletters to that process and estimate the probability that the older text was generated by the same process.
Problem: we *know* the old text was not generated by the same process. It was written (allegedly) by a younger Ron Paul, on different topics, speaking into a different political climate. Without a broader framework, it's impossible to determine whether the differences are important. The Bayesian approach provides a direct way of assessing that framework. The frequentist approach doesn't -- at least not that I can see, without jumping through a lot of hoops -- and in the meantime, it obscures the test that's actually being conducted.