Revision as of 18:35, 24 April 2013

Apache Hadoop 2.0

Summary

Bring Apache Hadoop, the hottest open source big data platform, to Fedora, the hottest open source distribution. Fedora should be the best distribution for using Apache Hadoop.

This and other big data activities can be found going on in the Big Data SIG.

Owner

Name: Matthew Farrellee
Email: matt@fedoraproject.org

People involved

Name	IRC	Focus	Additional
Matthew Farrellee	mattf	keeping track, integration testing	UTC-5
Peter MacKinnon	pmackinn	packaging	UTC-5
Rob Rati	rsquared	packaging	UTC-5
Timothy St. Clair	tstclair	setup and configuration	UTC-6
Sam Kottler	skottler	packaging	UTC-5
Gil Cattaneo	gil	packaging	UTC+1

Current status

Targeted release: Fedora 20
Last updated: 24 Apr 2013
Percentage of completion: 5%

Detailed Description

Apache Hadoop is a widely used, increasingly complete big data platform, with a strong open source community and growing ecosystem. The goal is to package and integrate the core of the Hadoop ecosystem for Fedora, allowing for immediate use and creating a base for the rest of the ecosystem.

Benefit to Fedora

The Apache Hadoop software will be packaged and integrated with Fedora. The core of the Hadoop ecosystem will be available with Fedora and provide a base for additional packages.

Scope

Package the Apache Hadoop 2.0.2 software
Package all dependencies needed for Apache Hadoop 2.0.2
Skip package dependencies required for unit testing, record them in a dependency backlog for later cleanup

Approach

We are taking an iterative, depth-first approach to packaging. We do not have all the dependencies mapped out ahead of time. Dependencies are being tabulated into two groups:

missing - the dependency being requested from a hadoop-common pom has not yet been packaged, reviewed or generated into fedora repos
broken - the dependency requested is out of date with current fedora versions, and patches must be developed for inclusion in a hadoop rpm build that address any build, API or source code deltas

Note that a dependency may show up in both of these tables.

Anyone who wants to help should find an available dependency below, edit the table changing the state to Active and packager to yourself.

While packaging a dependency, test dependencies can be skipped. Testing will be done via integration testing periodically during packaging and then after packaging completes. Test dependencies that are skipped must be added to the Skipped dependencies table below.

If you are lucky enough to pick a dependency that itself has unpackaged dependencies, identify the sub-dependencies and add them to the bottom of the Dependencies table below, change your current dependency to Blocked and repeat.

If your dependency is already packaged but the version is incompatible, contact the package owner and resolve the incompatibility in a mutually satisfactory way. For instance:

If the version available in Fedora is older, explore updating the package. If that is not possible, explore creating a package that includes a version in its name, e.g. pkgnameXY. Ultimately, the most recent version in Fedora should have the name pkgname while older versions have pkgnameXY. It may take a full Fedora release to rationalize package names. Make a note in the Dependencies table.

If the version you need is older than the packaged version, consider creating a patch to use the newer version. If a patch is not viable, proceed by packaging the dependency with a version in its name, e.g. pkgnameXY. Make a note in the Dependencies table.

Missing dependency legend
State	Notes
Available	free for someone to take
Active	dependency is actively being packaged if missing, or patch is being developed or tested for inclusion in hadoop-common build
Blocked	pending packages for dependencies
Review	under review, include link to review BZ
Complete	woohoo!

Missing Dependencies
Project	State	Review BZ	Packager	Notes
hadoop	Active		rrati,pmackinn
bookkeeper	Review	RHBZ #948589	gil	Version 4.0 requested. packaged 4.2.1. Patch: BOOKKEEPER-598
glassfish-gmbal	Complete	RHBZ #859112	gil	F18 build
glassfish-management-api	Complete	RHBZ #859110	gil	F18 build
grizzly	Review	RHBZ #859114	gil
groovy	Review	RHBZ #858127	gil	1.5 requested but 1.8 packaged in fedora. Possible moving forward 1.8 series will be known as groovy18 and groovy will be 2.x.
hsqldb	Available			1.8 in fedora, 2.0 requested. 2.2.8 packaged by gil, but seemingly no review request. Needs followup.
jersey	Complete	RHBZ #825347	gil	F18 build Should be rebuilt with grizzly2 support enabled.
jets3t	Review	RHBZ #847109	gil
jspc-compiler	Active		pmackinn	jspc specfile developed. Adaptations made for incumbent Tomcat 7 within spec. RPMs packaged in local custom repo. Reviewing fit as part of overall hadoop-common compilation/testing.
kfs	Available			gil has packaged 0.5, but no review request. kfs has become Quantcast qfs.
maven-native	Review	RHBZ #864084	gil	Needs patch to build with java7. NOTE: javac target/source is already set by mojo.java.target option
zookeeper	Review	RHBZ #823122	gil

Broken Dependencies
Project	Packager	Notes
ant		Version 1.6 requested, 1.8 currently packaged in Fedora. Needs to be inspected for API/functional incompatibilities(?)
apache-commons-collections	pmackinn	Java import compilation error with existing package. Patches for hadoop-common being tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-collections
apache-commons-math	pmackinn	Current apache-commons-math uses math3 in pom instead of math, and API changes in code. Patches for hadoop-common being tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-math
ecj	rrati	Need ecj version ecj-4.2.1-6 or later to resolve a dependency lookup issue
gmaven	gil	Version 1.0 requested, available 1.4 (but has broken deps) RHBZ #914056
hadoop-hdfs	pmackinn	glibc link error in hdfs native build. Patch for hadoop-common being tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-cmake
jersey	pmackinn	Needs jersey-servlet and version. Tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-jersey
jets3t	pmackinn	Requires 0.6.1. With 0.9.x: hadoop-common Jets3tNativeFileSystemStore.java error: incompatible types S3ObjectsChunk chunk = s3Service.listObjectsChunked(bucket.getName(). Patches for hadoop-common being tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-jets3t
jetty	rrati	jetty8 packaged in Fedora, but 6.x requested. 6 and 8 are incompatible. Patches tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-jetty
slf4j	pmackinn	Package in fedora fails to match in dependency resolution. jcl104-over-slf4j dep in hadoop-common moved to jcl-over-slf4j as part of jspc/tomcat dep. Patch being tracked at https://github.com/pdmack/hadoop-common/tree/fed-master-jasper
tomcat-jasper	pmackinn	Version 5.5.x requested. Adaptations made for incumbent Tomcat 7 via patches at https://github.com/pdmack/hadoop-common/tree/fed-master-jasper. Reviewing fit as part of overall hadoop-common compilation/testing.

Skipped dependencies
JAR	Project	State	Packager	Notes
[jar name]	[package name]	Available	noone	Needed for tests by #N

Workflow

Repo of dependencies already packaged an in review state can be found here: http://repos.fedorapeople.org/repos/rrati/hadoop/

Currently, only Fedora 18 x86_64 packages are available

Packager tips

mvn-rpmbuild utility will ONLY resolve from system repo
mvn-local will resolve from system repo first then fallback to maven if unresolved
can be used to find the delta between system repo packages available and missing dependencies that can be viewed in the .m2 local maven repo (find *.jar)
-Dmaven.local.debug=true
- reveals how JPP lookups are executing per dependency -> useful for finding gId,aId mismatches
-Dmaven.test.skip=true
- tells maven to skip test compilation

TODO: Template spec files to work from

TODO: Setup staging repository for sharing packages under review

An alternative to gmaven:

apply a patch with the following content where required
test support is not guaranteed, should not work.

     <plugin>
       <groupId>org.apache.maven.plugins</groupId>
       <artifactId>maven-antrun-plugin</artifactId>
       <version>1.7</version>
       <dependencies>
         <dependency>
           <groupId>org.codehaus.groovy</groupId>
           <artifactId>groovy</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>antlr</groupId>
           <artifactId>antlr</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>commons-cli</groupId>
           <artifactId>commons-cli</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>asm</groupId>
           <artifactId>asm-all</artifactId>
           <version>any</version>
         </dependency>
         <dependency>
           <groupId>org.slf4j</groupId>
           <artifactId>slf4j-nop</artifactId>
           <version>any</version>
         </dependency>
       </dependencies>
       <executions>
         <execution>
           <id>compile</id>
           <phase>process-sources</phase>
           <configuration>
             <target>
               <mkdir dir="${basedir}/target/classes"/>
               <taskdef name="groovyc" classname="org.codehaus.groovy.ant.Groovyc">
                 <classpath refid="maven.plugin.classpath"/>
               </taskdef>
               <groovyc destdir="${project.build.outputDirectory}" srcdir="${basedir}/src/main" classpathref="maven.compile.classpath">
                 <javac source="1.5" target="1.5" debug="on"/>
               </groovyc>
             </target>
           </configuration>
           <goals>
             <goal>run</goal>
           </goals>
         </execution>
       </executions>
     </plugin>

How To Test

TODO: NEEDS MORE DEFINITION
yum install X Y Z across one or more nodes
Setup a simple cluster by following TBD
Run http://hadoop.apache.org/docs/stable/gridmix.html

User Experience

For users who are interested in running Apache Hadoop on Fedora, they will find it available from Fedora Project yum repositories.

TODO: SPECIFICALLY PACKAGES X Y Z

Dependencies

No other packages currently depend on Apache Hadoop.

Completion of this feature will involve packaging numerous dependencies, see the Dependencies table. Some of the dependencies are already being packaged by others in the Fedora community. Where dependency overlap is found, a negotaition must occur to ensure a satisfactory version and package is available to all parties.

TODO: Is https://fedoraproject.org/wiki/Hypertable ?

Contingency Plan

With no packages depending on Apache Hadoop, none is necessary. The biggest risk is not completing packages for all dependencies. In that case, the feature can be removed from the release notes. The packaged dependencies should remain in the distribution. The feature can be pushed to the next Fedora release.

Documentation

http://wiki.apache.org/hadoop

Release Notes

TODO

Comments and Discussion

See Talk:Features/Hadoop

@@ Line 47: / Line 47: @@
 == Current status ==
 * Targeted release: [[Releases/20 | Fedora 20 ]]
-* Last updated: 3 Apr 2013
+* Last updated: 24 Apr 2013
 * Percentage of completion: 5%
 == Detailed Description ==

Search

Features/Hadoop: Difference between revisions