Pig
Overview
From the project site: "Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets."
The Fedora Big Data SIG is investigating the requirements to adapt the latest version of Pig as a package in Fedora, now that Hadoop 2.x has been packaged. Although Pig obviously has a significant dependency on Hadoop, the Java project is not Maven-based and instead is built using Ant and Ivy. The xmvn tooling support in Fedora does not directly apply to the Pig build. In many ways this can be viewed as a simplification instead of a challenge since one can configure a local file-system Ivy resolver relatively easily.
Using static build-derived analysis (Ant doesn't really provide something like the Maven dependency plugin), there are a group of dependencies that are currently missing from Fedora which block the build of Pig using Fedora-only installed versions. There are also many dependencies available which are not necessarily version-compatible. However, as described in the hadoop outline outline, those can hopefully be mitigated in the Pig source where possible.
Build
Version 0.11 is the latest release and built from source (using the Fedora Hadoop target of 2.0.5a) using:
ant very-clean package -Dexcludes="**/jython/**,**/jruby/**"
The jython and jruby scripting features are compiled out currently due to API issues with those current packages.
Note that to do a local build using the SIG branch you must make a directory to store any (currently) unpackaged jars:
mkdir -p ~/pig/lib/missing
Dependencies
The following table outlines the significant missing dependencies. The ones in bold are deemed hard dependencies and must be packaged.
Project | State | Review BZ | Packager | Notes |
---|---|---|---|---|
hbase | Active | rrati | part of the piggybank contrib, may be optional | |
hive | Active | pmackinn | also part of the piggybank contrib, may be optional and in fact poses a circular package dependency | |
libthrift, libfb303 | Review | RHBZ #982285, RHBZ #1000563 | willb | presumably part of the piggybank contrib; Will Benton has some RPM artifacts at http://freevariable.com/thrift/ |
NB: This list is distilled from the overall set of missing dependencies but many of the ones that aren't listed are not required for the latest Fedora version of Hadoop (2.0.5a), assuming the appropriate build properties noted are specified.
SCM
The BigData SIG is tracking a set of commits here to build according to FPG. These will eventually be converted into a patch set for a spec file once all the outstanding missing dependencies are in place. These commits include a set of custom Ivy resolvers that only inspect the local filesystem in typical Fedora Java jar locations. A RFE was created to make Fedora Ivy map dependencies into the local filesystem implicitly, thus doing away with custom resolvers (as much as feasible).