Saturday, February 18, 2012

Q&A about web scraping with python

Here are highlights from a recent email exchange with a tech-minded friend in the polisci department. His questions were similar to others I've been fielding recently, and very to-the-point, so I thought the conversation might be useful for others looking at getting into web scraping for research (and fun and profit).

> ... I have an unrelated question: What python libraries do you use to scrape websites?

urllib2, lxml, and re do almost everything I need. Sometimes I use wget to mirror a site, then use glob and lxml to pull out the salient pieces. For the special case of directed crawls, I built snowcrawl.

> I need to click on some buttons, follow some links, parse (sometimes ugly) html, and convert html tables to csv files.

Ah. That's harder. I've done less of this, but I'm told mechanize is good. I don't know of a good table-to-csv converter, but that's definitely a pain point in some crawls -- if you find anything good, I'd love to hear about it!

It strikes me that you could do some nice table-scraping with cleverly deployed xpath to pull out rows and columns -- the design would look a little like functional programming, although you'd still have to do use loops. Python is good for that, though.

> Is mechanize the way to go for browsing operations?

From hearsay, yes.

> What's your take on BeautifulSoup vs. lxml vs. html5lib for parsing?

I got into lxml early and it works wonderfully. The only pain point is (sometimes) installation, but I'm sure you can handle that. My impression is that lxml does everything BeautifulSoup does, faster and with slightly cleaner syntax, but that it's not so much better that everyone has switched. I don't know much about html5lib.

> Should I definitely learn xpath?

Definitely. The syntax is quite easy, very similar to jquery/css selectors. It also makes for faster development: get a browser plugin for xpath and you can test your searches directly on the pages you want to crawl. This will speed up your inner loop for development time tremendously -- much better than editing and re-running scripts, or running tests from a console.

HTH.  Cheers!


  1. Hi Dude,

    Web scraping is the set of techniques used the to get some information, structured only for presentation purposes, from a website automatically instead of copying it manually. Thanks a lot......

    Web Scraping

  2. Looks like we're got some link-farming going on. It looks like these guys are trying to run a business around web scraping. It's relevant, so I guess I'll leave the link here, but...

    "Carly" - next time you leave a comment, don't just copy the text from wikipedia!

  3. Haha, I know this comment isn't providing much value... but Abe your response just make me laugh. It's pretty funny how a spam comment/business is.. "relevant".

  4. Hi...

    The web scraping process focuses greatly on transformation of the web content that is unstructured into structured. This allows easy analysis and storage into a database or even a spreadsheet.It is regarded as a technique that is used in extracting information or data from websites..

    Using Web Scraping in Business