If you're wondering what Big Data things are in Fedora, or are interested in working on packaging or reviews to help out the Big Data SIG, this is the page to look at!
If you know of a big-data-related package that is already in Fedora, or have one that you'd like to get into Fedora, be sure to list it here, or link to the page describing what needs to be done, or link to the bugzilla that needs help.
Packages available in Fedora
Package | Description | Packaged Version |
Upstream Version |
Sources | Who | Notes |
---|---|---|---|---|---|---|
Apache Hadoop | Batch processing system and core of the Hadoop ecosystem | 2.4.1 | 2.7.1 | hadoop.git | Hadoop packaging | |
Apache HBase | The Apache Hadoop NoSQL Database | 0.98.3 | 1.0.1.1 | hbase.git | HBase packaging | |
Apache Hive | SQL-on-Hadoop query framework, a data warehouse for Hadoop | 0.12.2 | 1.2.1 | hive.git | ||
Apache Pig | Language for expression data analysis programs run on MapReduce | 0.13.10 | 0.15.0 | pig.git | Pig packaging | |
Apache Zookeeper | A service for highly reliable distributed coordination | 3.4.6 | 3.4.6 | zookeeper.git | ||
Apache Oozie | Workflow scheduler system to manage Apache Hadoop jobs | 4.0.1 | 4.2.0 | oozie.git | rsquared | Oozie packaging |
Apache Ambari | Hadoop cluster manager | 1.5.1 | 2.1.0 | ambari.git | ||
Apache Accumulo | A software platform for processing vast amounts of data | 1.6.1 | 1.7.0 | accumulo.git | ||
Apache Mesos | Cluster manager for sharing distributed application frameworks | 0.22.1 | 0.23.9 | mesos.git | Mesos packaging | |
Apache Solr | Ultra-fast Lucene-based Search Server | 5.5.0 | 6.0.1 | Retired | ||
Apache Spark | Lightning-fast cluster computing | 0.9.1 | 1.4.1 | spark.git | Spark packaging Scala packaging | |
AMPLab Tachyon | A memory resident, fault tolerant distributed file system | 0.99 | 0.7.0 | tachyon.git | Tachyon packaging |
Packages we're working on
Package | Description | Packaged Version |
Upstream Version |
Sources | Who | Notes | |
---|---|---|---|---|---|---|---|
Apache Flume | Data ingestion tool for large amounts of log data | 1.6.0 | 1.6.0 | flume-rpm.git | gil | Flume packaging RHBZ#1279201 | |
Cloudera Kite SDK | Kite SDK to simplify the development of data-related systems | 1.0.0 | 1.1.0 | ||||
Apache Crunch | Java library provides a framework for MapReduce pipelines. | 0.11.0 | 0.12.0 | crunch-rpm.git | gil | ||
Apache Tez | Generalizes the MapReduce paradigm to a more powerful framework | 0.5.3 | 0.7.0 | tez-rpm.git | gil | ||
Apache Kafka | Publish-subscribe messaging broker for large scale | 0.8.0 | 0.8.2.1 | kafka-rpm.git | jromanes | Kafka packaging | |
Apache Storm | Distributed real-time computation system | 0.9.3 | 0.9.5 | storm-rpm.git | jromanes | Storm packaging | |
Apache Tajo | Low-latency and scalable SQL-on-Hadoop framework | 0.10.0 | 0.10.1 | ||||
Apache Jena | Java framework for building Semantic Web and Linked Data applications | 3.0.0 | 3.0.0 | jena.spec | donpellegrino | ||
Cascading | Data processing workflows on a Hadoop using any JVM-based language | 2.6.3 | 2.7.1 | cascading.spec | gil | ||
Apache Sqoop2 | Bulk data transfer between Hadoop and structured datastores | 1.99.3 | 1.99.6 | sqoop.spec | pmackinn | RHBZ #1089675 | |
Neo4j | Java Graph Database | 2.2.8 | 3.0.0-M04 | neo4j.spec | gil | Newer release (2.3+) use scala 2.11+ | |
Apache Cassandra | OpenSource database Apache Cassandra | 3.4 | 3.5 | cassandra.git | trepik | RHBZ#1324020 |
Packages we'd like to include
- Shark
- Aurora
- Sparrow
- Presto
- Summingbird
- RHadoop
- Sentry
- Ooyala Job Server
- unicage
- GridGain
- Elephant Bird
- Hadoop-lzo
- CKAN - "The open source data portal software"
- Samza
- Flink
- Geode
- New stuff here!
Becoming a packager
Not yet a packager? Check out the Package Maintainers, or the Join the package collection maintainers page to get more information. You could also ask on the Big Data SIG mailing list for assistance and see if you can find a willing helper or sponsor. For bundling Java packages read the Java packaging guidelines first.
Typical workflow (relies on github)
- Clone original repo, if modifications are required.
- Patch where necessary. (Use github tickets where possible if working as a group).
- Try to organize your patch set into meaningful units, and create tickets to push upstream where possible.
- For patches that require carrying, they should be applied to the raw-sources where possible.
- Create a package-rpm repo with specs and system integration files (systemd, custom-conf, etc).
- Use rpmbuild | hack fedpkg to enable prototype package building
- spectool -g package.spec (will download sources)
- md5sum package-sources.tar.gz > sources
- fedpkg local
- Once you feel you have a package ready for review run the following prior to submit:
- Setup Fedora Review
- rpmlint package.spec
- mock --clean --init -r fedora-rawhide-x86_64 && fedora-review -m fedora-rawhide-x86_64 -n package.srpm
Packaging Notes
- Fedora java rpms can not bundle dependent jars. Every jar file not created by the build must come from an rpm in the Fedora repository.
- All jars must be built from source
- Fedora build tools: xmvn-resolve,
mvn-local, mvn-rpmbuild, mvn-buildno longer available in rawhide, considered private implementation - Fedora rpm macros: %pom_*, %mvn_build, %mvn_install, %mvn_file
- xmvn-subst for dependency jars when packaging
- Fedora Java Packaging guidelines: https://fedoraproject.org/wiki/Packaging:Java JNI handling: System.load replaces System.loadLibrary, jar file in %{_jnidir} Jar files in %{_javadir}
- Fedora build systems have no internet access, avoid DNS if possible.
- Breaking apart or subsuming subelements
- Depending on the popularity of a sub-element as a stand-alone package it sometimes makes more sense to break it out as a sub-package which can stand alone, but doesn't have to live in a separate repository. This is a choice which will have to be made by the upstream group and will depend heavily on their ideal workflow, but from a maintenance perspective it's far easier to maintain as a sub-package. E.g. one project produces multiple libs/jars.
- Fedora is OpenJDK7 or higher. You cannot mix-and-match usage of the Fedora versions of maven and ant with Java 6, since they are themselves compiled with source="1.7".