Tuesday, August 20, 2013

Understanding Apache Hive


  
Introduction

  BigData and Hive

Apache Hive is a software application created to facilitate data analyses on Apache Hadoop. It is a Java framework that helps extracting knowledge from data placed on a HDFS cluster by providing a SQL-like interface to it.

The Apache Hadoop platform is a major project on distributed computing and it is commonly assumed to be the best approach when dealing with BigData challenges.
It is now very well established that great volume of data is produced everyday. Whether it is by system logs or by users purchases, the amount of information generated is such that previous existing Databases and Datawarehouses solutions don’t seem to scale well enough.
The MapReduce programming paradigm was uncovered in 2004 as a new approach on processing large datasets. In 2005 its OpenSource version, Hadoop, was created by Doug Cutting. Although Hadoop is not set for substituting relational databases, it is a good solution for big data analyses.

Hadoop facilitates large data processing, but it still requires skillful programmers to create the Map and Reduce functions to analyze the data. All analyzes made through Hadoop had to be condensed on these two functions. Creating this type of applications might be challenging and difficult to maintain. Previous data developers had difficulty on extracting intelligence from their data. Hive was created to overcome this issues.

Apache Hive

First introduced by Facebook and latter donated to the Apache Software Foundation, it is a data warehouse interface for Hadoop. With Hive, users can create SQL statements that will be automatically converted to MapReduce jobs and run on a HDFS cluster.

Data can be inserted or dealt with on the Hadoop cluster through command line interface using statements from the Hive Quey Language, or HiveQL, such as SELECT, INSERT or CREATE TABLE. Users can also create their own User Defined Functions, by extending the UDF class already provided. Within these statements tables can be defined using primitive types as integers, floating points, strings, dates and booleans. Furthermore, new types can be created by grouping these primitives types into maps and arrays. Please check https://cwiki.apache.org/Hive/languagemanual.html for more information on HiveQL.

Although Hive presents a data warehouse interface for Hadoop, it is still a batch processing framework. As Hive’s data is located on Hadoop, it is limited to Hadoop’s constraints.
Hadoop does not index data it is not made for editing Data. There is no UPDATE on Hive, because this functionality could not be executed on data over HDFS. Hive does not support transactions. If you want these kind of database on top of Hadoop you should look for options such as HBase. Check http://wiki.apache.org/hadoop/HadoopIsNot to read more about this Hadoop limitations.

Even so, Apache Hive made it possible for developers with basic SQL knowledge to create complicated meaningful queries and quickly extract value from big data.

Architecture

Users can start interacting with Hive though a Command Line Interface (CLI), Hive Web Interface (HWI), JDBC or ODBC.
The CLI interface is a command line tool accessed through a terminal. It can be initiated by calling the HIVE_HOME/bin/hive script, inside Hive downloaded source code. Hive also provides a Hive server, so that users can use JDBC or ODBC to communicate with it.




When you type a query through the CLI interface, this HiveQL statement will be handled by the Driver component. The Driver connects a bunch of modules that transform the statement into MapReduce jobs to be run in Hadoop. It is importante to note that the query is not transformed in Java code in this process. Its goes direclty to MapReduce jobs. The modules involved in this process are: Parser, Semantic Analyzes, Logical Plan generator, Optimizer, Physical Plan Generator and Executor.


First, the Driver creates a session to remember details about the process, to maintain dates and statistics.

Some metadata (information about tables and columns) is then collected and stored on Metastore as soon as the input data (tables) are created. This metadata is actually stored in a relational database and it is latter on used on the Semantic Analyses.

ANTLR software is used to create a parser on the Parser module and parse the query. As in a compiler, the statement in broken down into token values and a Abstract Syntax Tree (AST) is created.

The following HiveQL statement

FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key + src2.key = src3.key)

INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value

would became this AST


(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_JOIN (TOK_TABREF (TOK_TABNAME src) src1) (TOK_TABREF (TOK_TABNAME src) src2) (= (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key))) (TOK_TABREF (TOK_TABNAME src) src3) (= (+ (. (TOK_TABLE_OR_COL src1) key) (. (TOK_TABLE_OR_COL src2) key)) (. (TOK_TABLE_OR_COL src3) key)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME dest1))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL src1) key)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL src3) value))))) null

The process follows on with a Semantic Analyses on the generated AST. The information provided on the query is verified to be valid by confronting the schema information from the input tables, stored on the Metastore component.
Type checking is a example of operations performed by the Semantic Analyzes.

After this analyses, a operator tree is created by the Logical Plan Generator, based on the parsed information and on the AST created. This operator tree is then, passed to the Optimizer procedure, which will perform a set of transformations to, not surprisingly, optimize the operations. The improvements accomplished by the Optimizer include column pruning (only column really needed will be fetched) and join reordering (to make sure only small tables are kept in memory).
The Physical Plan Generator gets the optimized operator tree and creates a Directed Acyclic Graph of MapReduce jobs of it. This physical plan is displayed in a XML file, and it is delivered to the Executor to be executed into the Hadoop cluster finally.




Hive and the different Hadoop versions

Hive can be built with Hadoop 1.x or with Hadoop 2.x. It presents interfaces for this purpose, and these interfaces are defined in the Shims interface.

There are three interfaces for Hadoop: 0.20, 0.20S, 0.23. 0.20 is supposed to work with Hadoop 1.x, 0.20s is for a secure version of Hadoop 1.x and 0.23 is for building Hive against Hadoop 2.x.

You can prevent a interface to be built by editing the property shims.include on HIVE_HOME/shims/build.xml:

<property name="shims.include" value="0.20,0.20S,0.23"/>                             

Hive uses a Factory Method to decide which Hadoop interface to use, based on the version of Hadoop on the classpath. This is situated on

HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/ShimLoader.java.

HIVE_HOME/shims/src/common/java/org/apache/hadoop/hive/shims/HadoopShims.java

encapsulates the interfaces.

hadoop.version is defined on HIVE_HOME/build.properties but you can overwrite it by using the flag -Dhadoop.version.

To build Hive with Hadoop 1.1.2, use


$ ant clean package -Dhadoop.version=1.1.2                   


To build Hive with Hadoop 2.0.4-alpha, use

$ ant clean package -Dhadoop.version=2.0.4-alpha -Dhadoop-0.23.version=2.0.4-alpha -Dhadoop.mr.rev=2


Building

To use some of Hive features such as UDF, or even to adapt Hive’s code to your own needs, you might have to build the source code from source.
Before start, make sure you have a Java JDK, Ant and Subversion installed on your computer.

Then, start by downloading the last stable release version from Hive repository.

$ svn checkout https://svn.opensource.ibm.com/svn/stg-hadoop/hive/0.11.0/trunk hive-0.11.0

Enter on your Hive home directory (which from now on, we will call HIVE_HOME):

$ cd hive-0.11.0                                               

And finally build the code with:

$ ant package                                                  

This will automatically download and install all dependencies required for Hive’s use.

Hive depends on (or uses) other Hadoop-related components. As from Hive 0.11 version, these are:

Apache Hadoop
Apache HBase
Apache Avro
Apache Zookeper

This components will be automatically downloaded by Ant and Ivy, when you run the ant package command. You can check which version of each component will be downloaded in

HIVE_HOME/ivy/libraries.properties

and, as explained on last session, Hadoop version can be chekced here:

HIVE_HOME/build.properties

To check all ant command possibilities with Hive, type:

$ ant -p                                                       

This should show how to built it, test it and even how to create a tar file from the source. The testing will be explained a little bit further next.


Unit Tests

Hive provides several buitltin Unit Tests to verify its own modules and features functionalities. They are constructed using JUnit 4 and run queries (.q files) already provided by the framework.

To create the JUnit classes execute:

$ ant package                                                  

To run the unit tests type:

$ ant test                                                     

To run a specific test run:


$ ant test -Dtestcase=TestCliDriver                            

To run a specific query inside one Unit Test run:

$ ant test -Dtestcase=TestCliDriver -Dqfile=alter5.q           


The command described above will produce a output that will be compared with Hive’s expected output. It will also generate a .xml log file, very helpuf for debbugging purposes:

HIVE_HOME/build/ql/test/TEST-org.apache.hadoop.hive.cli.TestCliDriver.xml

If you are having troubles with a certain testcase, and trying to debug it, pay attention: some java test files (all files under ql module) for Hive are created on build time from Velocity Templates (.vm). If you want to modify this tests you have to change the .vm file, not the .java one.



References:




Book: Programming Hive:Data Warehouse and Query Language for Hadoop. Edward Capriolo, Dean Wampler, Jason Rutherglen



Article: Hive A Warehousing Solution Over a MapReduce

Framework. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao,

Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy.

Facebook Data Infrastructure Team