Here are highlights from a recent email exchange with a tech-minded friend in the polisci department. His questions were similar to others I've been fielding recently, and very to-the-point, so I thought the conversation might be useful for others looking at getting into web scraping for research (and fun and profit).
> ... I have an unrelated question: What python libraries do you use to scrape websites?
urllib2, lxml, and re do almost everything I need. Sometimes I use wget to mirror a site, then use glob and lxml to pull out the salient pieces. For the special case of directed crawls, I built snowcrawl.
> I need to click on some buttons, follow some links, parse (sometimes ugly) html, and convert html tables to csv files.
Ah. That's harder. I've done less of this, but I'm told mechanize is good. I don't know of a good table-to-csv converter, but that's definitely a pain point in some crawls -- if you find anything good, I'd love to hear about it!
It strikes me that you could do some nice table-scraping with cleverly deployed xpath to pull out rows and columns -- the design would look a little like functional programming, although you'd still have to do use loops. Python is good for that, though.
> Is mechanize the way to go for browsing operations?
From hearsay, yes.
> What's your take on BeautifulSoup vs. lxml vs. html5lib for parsing?
I got into lxml early and it works wonderfully. The only pain point is (sometimes) installation, but I'm sure you can handle that. My impression is that lxml does everything BeautifulSoup does, faster and with slightly cleaner syntax, but that it's not so much better that everyone has switched. I don't know much about html5lib.
> Should I definitely learn xpath?
Definitely. The syntax is quite easy, very similar to jquery/css selectors. It also makes for faster development: get a browser plugin for xpath and you can test your searches directly on the pages you want to crawl. This will speed up your inner loop for development time tremendously -- much better than editing and re-running scripts, or running tests from a console.