Monday, January 9, 2012

Data set: pretty much every candidate in any of the US primary elections (house, senate, governor, president)

Working on a project on social media and U.S. primary elections, I couldn't find a good, machine-readable listing of candidates.  (Project VoteSmart doesn't do primaries.)  So I crowd-sourced it on mturk.  Here's the data, in json format (documented below).

For each election, we asked for as many candidate names, parties, and campaign websites as turkers could find.  We also asked for websites to verify the information, with the stern warning, "To receive credit for your work, you must include stable URLs to credible sites where you found your information."

Here's a screenshot of the mturk task:

Some details:

We vetted the data by running each task twice and comparing responses.  Wherever we found discrepancies (there weren't very many), we fixed mispellings, checked to make sure candidates were real, etc.  Overall, it worked pretty well.  We found nearly 2,000 candidates across the almost-500 elections.  I'm guessing the final data contain a handful of mistakes, but not many.  It is, as they say, good enough for government work.

Here's the data format.  The main file is an array of election objects:

election :
    id : a unique ID for the election
    office : "president" / "senate" / "house" / "governor"

    state : The name of the state where the election is being held.  N/A for president.
    district : The congressional district where the election is being held.  N/A for everything except house races.
    candidates : an array of candidate objects.

candidate :
    name : the candidate name
    party : The candidate's party (Republican / Democrat / Other)
    websites : an array of website URLs.

I'm putting this out in case anybody else was searching for the same kind of data. I'd love the hear about any mistakes and/or useful applications for it.  Cheers!

No comments:

Post a Comment