Wednesday, July 10, 2013

Speed up hadoop development with progressive testing

Debugging Hadoop jobs can be a huge pain.  The cycle time is slow, and error messages are often uninformative --- especially if you're using Hadoop streaming, or working on EMR.

I once found myself trying to debug a job that took a full six hours to fail.  It took more than a week -- a whole week! -- to find and fix the problem.  Of course, I was doing other things at the same time, but the need to constantly check up on the status of the job was a huge drain on my energy and productivity.  It was a Very Bad Week.

Painful experiences like this have taught me to follow a test-driven approach to hadoop development.  Whenever I'm working on a new hadoop-based data pipe, my goal is to isolate six distinct kinds of problems that arise in hadoop development.

  1. Explore the data: The pipe must accept data from a given format, which might not be fully understood at the outset.
  2. Test basic logic: The pipe must execute the intended data transformation for "normal" data. 
  3. Test edge cases: The pipe must deal gracefully with edge cases, missing or misformatted fields, rare divide-by-zeroes, etc. 
  4. Test deployment parameters: The pipe must be deployable on hadoop, with all the right filenames, code dependencies, and permissions.
  5. Test cluster performance: For big enough jobs, the pipe must run efficiently.  If not, we need to tune or scale up the cluster.
  6. Test scheduling parameters: Once pipes are built, routine jobs must be scheduled and executed.

Each of these steps requires different test data and different methods for trapping and diagnosing errors.  Therefore, the goal is to make sure to (1) tackle problems one at a time, and (2) solve each kind of problem in the environment with the fastest cycle time.

Steps 1 through 3 should be solved locally, using progressively larger data sets.  Steps 4 and 5 must be run remotely, again using progressively larger data sets.

Step 6 depends on your scheduling system and has a very slow cycle time (i.e. you must wait a day to test whether your daily jobs run on the proper schedule.).  However, it's independent of hadoop, so you can build, test, and deploy it separately.  (There may be some crossover with #4, but you can test this with small data sets.)

Going through six different rounds of testing may seem like overkill, but in my experience it's absolutely worth it.  Very likely, you'll encounter at least one new bug/mistake/unanticipated case at each stage.  Progressive testing ensures that each bug is dealt with as quickly as possible, and prevents them from ganging up on you.

Other suggestions:
  • Definitely use an abstraction layer that allows you to seamlessly deploy local code to your staging and production clusters.  Cascalog and mrJob are good examples.  Otherwise, you'll find yourself solving steps 2 and 3 all over again in deployment.
  • Config files and object-oriented code can reduce a lot of headaches in step 4.  Most of your deployment hooks can be written once and saved in a config file.  If you have strong naming conventions, then most of your filenames can be constructed (and tested) programmatically.  It's amazing how many hours you can waste debugging a simple typo in hadoop.  Good OOP will spare you many of these headaches.
  • Part of the beauty of Hive and HBase is that they abstract away most of the potential pitfalls on the deployment side, especially in step 4.  By the same token, tools like Azkaban and Oozie can take a lot of the pain out of step 6.  (Be careful, though -- each of these scheduling tools has its limitations.)


  1. Hadoop keeps track of where the data resides ,there are multiple copy stores, data stored on a server.
    Hadoop Development

  2. This comment has been removed by a blog administrator.

  3. This website is very helpful for the students who need info about the Hadoop courses.i appreciate for your post. thanks for shearing it with us. keep it up.
    Hadoop Training in hyderabad

  4. Hadoop Developer Online Training, ONLINE TRAINING – IT SUPPORT – CORPORATE TRAINING The 21st Century Software Solutions of India offers one of the Largest conglomerations of Software Training, IT Support, Corporate Training institute in India - +919000444287 - +917386622889 - Visakhapatnam,Hyderabad Hadoop Developer Online Training, Hadoop Developer Training, Hadoop Developer, Hadoop Developer Online Training| Hadoop Developer Training| Hadoop Developer| "Courses at 21st Century Software Solutions
    Talend Online Training -Hyperion Online Training - IBM Unica Online Training - Siteminder Online Training - SharePoint Online Training - Informatica Online Training - SalesForce Online Training - Many more… | Call Us +917386622889 - +919000444287 -

  5. Uniqe informative article and of course True words, thanks for sharing. Today I see myself proud to be a hadoop professional with strong dedication and will power by blasting the obstacles. Thanks to Big Data Training Chennai

  6. I get a lot of great information from this blog. Thank you for your sharing this informative blog.
    AWS course chennai | AWS Certification in chennai | AWS Certification chennai

  7. This comment has been removed by the author.

  8. Nice article i was really impressed by seeing this article, it was very interesting and it is very useful for Learners.
    VMWare course chennai | VMWare certification in chennai | VMWare certification chennai

  9. Nice piece of article you have shared here, my dream of becoming a hadoop professional become true with the help of Hadoop Training Chennai, keep up your good work of sharing quality articles.

  10. Thanks for sharing your informative article on Hive ODBC Driver. Your article is very descriptive and assists me to learn whole concept in detail. Hadoop Training in Chennai

  11. Welcome to Wiztech Automation - Embedded System Training in Chennai. We have knowledgeable Team for Embedded Courses handling and we also are after Job Placements offer provide once your Successful Completion of Course. We are Providing on Microcontrollers such as 8051, PIC, AVR, ARM7, ARM9, ARM11 and RTOS. Free Accommodation, Individual Focus, Best Lab facilities, 100% Practical Training and Job opportunities.

    Embedded System Training in chennai
    Embedded System Training Institute in chennai
    Embedded Training in chennai
    Embedded Course in chennai
    Best Embedded System Training in chennai
    Best Embedded System Training Institute in chennai
    Best Embedded System Training Institutes in chennai
    Embedded Training Institute in chennai
    Embedded System Course in chennai
    Best Embedded System Training in chennai

  12. It seems there is no difference between the subject mentioned at this blog and hadoop online training center. Thanks for presenting the information in an excellent way.

  13. Using big data analytics may give the companies many fruitful results, the findings can be implemented in their business decisions so as to minimize their risk and to cut the costs.
    hadoop training in chennai|big data training|big data training in chennai

  14. Cloud computing is the next big thing, through cloud the users have the liberty to use a shared network. The companies can focus on core business parts rather than investing heavily on infrastucture.
    cloud computing training in chennai|cloud computing courses in chennai|cloud computing training

  15. Your blog is much effective and thanks for sharing information.. Here we are providing training & materilas and dumps for those who are preparing for cca 175 spark and hadoop developer certification exam and in the related fields of cloudera hadoop developer certification exam.
    cca 175 certification
    cloudera hadoop developer certification
    cca 175 spark and hadoop developer certification
    hadoop developer certification

  16. Hadoop Online and classroom training by real-time experts for more details visit:

    Hadoop training in hyderabad