Wednesday, February 20, 2013

Why are there three Hadoop svn repositories (common, hdfs and mapreduce)? Where is the repository for YARN?

When developers start reading about Hadoop, one of the first info they get is:

"The project includes these modules:
  • Hadoop Common: The common utilities that support the other Hadoop modules.
  • Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
  • Hadoop YARN: A framework for job scheduling and cluster resource management.
  • Hadoop MapReduce: A YARN-based system for parallel processing of large data sets."
So it might be a little confusing when trying to build Hadoop code from source, they are indicated to check out only a repository called hadoop-common.

It might became even more confusing when you realize that there are two other repositories for Hadoop: hadoop-hdfs and hadoop-mapreduce.

So what repositories to use? 

The answer is: hadoop-common encompasses all these Hadoop modules.

When looking at the hadoop-hdfs or hadoop-mapreduce repos you should see that they haven't been modified since 2009 (more or less). What happened was that until version 0.21 Hadoop repositories were divided between modules. From version 0.22 on, they were combined into a single SVN repository, as documented on Jira:

What about the YARN thing? And what is MapReduce 2?

A few years ago, there was a "split" between Hadoop releases: release 1.x continued on as classic Hadoop from version 0.21, and release 2.x was created based on 0.22, with different features.

Hadoop 2.x includes a couple of new modules that enables MapReduce running on a general resource management system for running distributed applications. This system is YARN, and the MapReduce that runs on it is called MapReduce 2.

This new 2.x release does not contain the old classic MapReduce, only the  MapReduce 2.

The hadoop-common repository includes all of these modules: YARN, HDFS, MapReduce, MapReduce2 and Common utilities libraries. Just pay attention that the presence of these modules will vary from release to release.