Revision as of 15:51, 5 January 2014

Hive

Overview

From the project site: "Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL."

The Fedora Big Data SIG is investigating the requirements to adapt the latest version of Hive as a package in Fedora, now that Hadoop 2.x has been packaged. Although Hive obviously has a significant dependency on Hadoop, the Java project is not Maven-based and instead is built using Ant and Ivy. The xmvn tooling support in Fedora does not directly apply to the Hive build. In many ways this can be viewed as a simplification instead of a challenge since one can configure a local file-system Ivy resolver relatively easily.

Using static build-derived analysis (Ant doesn't really provide something like the Maven dependency plugin), there are a group of dependencies that are currently missing from Fedora which block the build of Hive using Fedora-only installed versions. There are also many dependencies available which are not necessarily version-compatible. However, like the hadoop outline, those can hopefully be mitigated in the Hive source where possible.

Build

Version 0.11 is the latest release and built from source (using the Fedora Hadoop target of 2.0.5a) using:

ant very-clean package -Dhadoop.version=2.0.5-alpha -Dhadoop-0.23.version=2.0.5-alpha -Dhadoop.mr.rev=23 -DenhanceModel.notRequired=true -Dmvn.hadoop.profile=hadoop23 -Dshims.include=0.23

Note that to do a local build using the SIG branch you must make a directory to store any (currently) unpackaged jars:

mkdir -p ~/hive/lib/missing

Dependencies

The full Hive dependency list is captured here but the following table outlines the missing dependencies. The ones in bold are deemed hard dependencies and must be packaged.

Missing/Questionable Dependencies
Project	State	Review BZ	Packager	Notes
avro-ipc, avro-mapred	Complete	~~RHBZ #1009170~~	ricardo	Although avro 1.6.2 is packaged, it does not include the ipc and mapred jars. IPC appears to only apply to 0.20 shim. MapRed is used by an Avro reader/input/output feature in QL and is based on the legacy mapred API (i.e.,org.apache.hadoop.mapred).
datanucleus-core	Complete	~~RHBZ #1011705~~	pmackinn,gil	Forms backbone of metastore layer for different data sinks. Upstream project at http://www.datanucleus.org/
datanucleus-api-jdo	Complete	~~RHBZ #1011962~~	pmackinn,gil	JDO implementation for datanucleus
datanucleus-rdbms	Complete	~~RHBZ #1011960~~	pmackinn,gil	RDBMS plugin adapter for datanucleus
hbase	Active		rrati	hbase-handler can be compiled out but seems like a significant omission
high-scale-lib	Review	RHBZ #865893	gil
javolution	Complete	~~RHBZ #1009153~~	pmackinn	Used by the QL classes: a hard dependency
jdo-api	Complete	~~RHBZ #1011696~~	pmackinn,gil	Dependency for datanucleus-api-jdo. CANNOT substitute existing jdo2-api.
libthrift, libfb303	Complete	~~RHBZ #982285, RHBZ #1000563~~	willb	Will Benton has some RPM artifacts at http://freevariable.com/thrift/
metrics-core	Review	RHBZ #861502	gil
pig	Active		pmackinn	Test and source imports of Pig classes, however they appear to be in the adapter space so may be able to defer.
tempus-fugit	Review	RHBZ #1009654	gil	Concurrency library. May only be a test dep. Upstream at http://tempusfugitlibrary.org/

NB: This list is distilled from the overall set of missing dependencies but many of the ones that aren't listed are not required for the latest Fedora version of Hadoop (2.0.5a), assuming the appropriate build properties noted are specified.

SCM

The BigData SIG is tracking a set of commits here to build according to FPG. These will eventually be converted into a patch set for a spec file once all the outstanding missing dependencies are in place. These commits include a set of custom Ivy resolvers that only inspect the local filesystem in typical Fedora Java jar locations. A RFE was created to make Fedora Ivy map dependencies into the local filesystem implicitly, thus doing away with custom resolvers (as much as feasible).

@@ Line 10: / Line 10: @@
 Version 0.11 is the latest release and built from source (using the Fedora Hadoop target of 2.0.5a) using:
-<pre>ant very-clean package -Dhadoop.version=2.0.5-alpha -Dhadoop-0.23.version=2.0.5-alpha -Dhadoop.mr.rev=23 -DenhanceModel.notRequired=true -Dmvn.hadoop.profile=hadoop23</pre>
+<pre>ant very-clean package -Dhadoop.version=2.0.5-alpha -Dhadoop-0.23.version=2.0.5-alpha -Dhadoop.mr.rev=23 -DenhanceModel.notRequired=true -Dmvn.hadoop.profile=hadoop23 -Dshims.include=0.23</pre>
 Note that to do a local build using the [[#SCM|SIG branch]] you must make a directory to store any [[#Dependencies|(currently)]] unpackaged jars:

Search

SIGs/bigdata/packaging/Hive: Difference between revisions

Revision as of 15:51, 5 January 2014

Contents

Hive

Overview

Build

Dependencies

SCM