statistics++: Making Fedora Project statistics accessible and automated
Ian Weller, Fedora Engineering, Red Hat, Inc.
Version 1.0 (Tue Mar 27 2012)
Executive summary
This document is a specification for statistics++, a set of software to aggregate and display data and statistics about the Fedora community. Its primary goals are to make data about the Fedora Project easily accessible to the public and automate current statistical analysis done by hand.
statistics++ is a smaller project in Fedora Engineering's FY13 plan. It depends on a messaging bus existing within Fedora Infrastructure. If we complete all milestones on time, the project's first release will be mid-July 2012.
Revision history
- Version 1.0 — Tue Mar 27 2012
- Initial specification release
Project overview
Fedora Infrastructure has had a limited foray into the field of statistics. The Statistics page on the Fedora Project Wiki has some limited information about the number of HTTP requests made to various infrastructure applications and the number of wiki edits made per month.
The statistics app in the first version of Fedora Community attempted to improve on the Statistics page, but ultimately failed because of the complexity of adding new and relevant automated queries to the platform and the limited amount of information Fedora's application servers could access.
With the planned messaging infrastructure for infrastructure applications, we can create a statistics application to listen on the message bus, record activity, and store activity in a database for later retrieval. We call this program statistics++.
statistics++ consists of three components:
datanommer
, a server daemon that listens on the infrastructure message bus and records activity to a databasedatagrepper
, an HTTP application that provides a RESTful web API for downloading data stored in the database based on a simple query syntaxdataviewer
, an HTTP application that produces automated data displays such as tables or charts
Target audience
Component | Target audience |
---|---|
datanommer |
Fedora Infrastructure application developers that want to make application data available for use in datagrepper and dataviewer
|
datagrepper |
Programmers that want to generate queries on datanommer -provided data for personal use or for inclusion in dataviewer
|
dataviewer |
Any user interested in statistics about the Fedora Project, including Fedora users and developers, Red Hat executives, and journalists |
Goals
This project aims to solve the following problems:
- Data on the Statistics wiki page can only be generated and validated by those who have access to Fedora log servers
- Data on the Statistics wiki page requires a human to generate the data each week
- Data on the Statistics wiki page does not encompass all infrastructure applications
- Anybody who can edit the wiki can change data on the Statistics wiki page
- Programmers must write different code to generate data for each infrastructure application
To solve these problems, statistics++ has the following functionality:
- Open, read-only access to any anonymous data collected by infrastructure applications
- A standard RESTful API for downloading data
- Flexible schemas for storing and retrieving data from infrastructure applications
- Live updates of statistical data from infrastructure applications
- An interface for creating automated queries and representing data in tables or charts
Non-goals
This project should not attempt to solve the following problems:
- Live pushing of data to other applications (the purpose of the messaging bus)
Details / design overview
Modularity
I broke statistics++ into three components. There are some benefits to this:
- We can version and update each component separately
- Other projects can decide to use the project as a whole or as separate components (such as using
datanommer
alone to prevent using the TurboGears 2 stack) - I get to reuse the name for
datanommer
(the name for the statistics project started about two years ago and put on indefinite hold)
datanommer
datanommer
is a system daemon written in Python. At a basic level, its purpose is to connect to a message bus, listen for interesting messages, and store data from those messages into a database.
datanommer
includes a SysV-style init script or systemd service file.
A configuration file defines data stored for each application called schemas. A schema represents a single application, but applications can have multiple schemas. Each schema consists of this information:
- The namespace to check messages against
- The fields stored in the database and their types (SQLAlchemy field types, most likely)
- (optional) A regular expression for reading data in from log files using the
datanommer-logread
utility
When enabled, datanommer
checks each message on the bus against its list of namespaces. If it matches any that datanommer
knows, it will extract the data and store it in the database.
datagrepper
datagrepper
is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It accepts queries to the statistics database and return the requested information.
Depending on implementation, datagrepper
may or may not need access to datanommer
's configuration file. If the database is SQL-backed (i.e. PostgreSQL), datagrepper
can determine the schema for each database based on table layouts. If a NoSQL database is used, datanommer
could put information about the schema in the database. Alternatively, datagrepper
can simply have access to datanommer
's configuration file.
The index page of datagrepper
shows available schemas and what fields can be fetched or searched. It outputs this list in HTML or JSON.
The /query
URI accepts a query string as either a GET or POST request. Query string variable names match those of the database fields. /query
accepts Django-like field lookup arguments (for example, sending the query string date__lte=2011-12-31
returns rows in the table where the "date" field is less than or equal to December 31, 2011). /query
accepts a __format
argument to output data in JSON or CSV.
datagrepper
client Python library
A Python library for accessing datagrepper
will automate the intricacies of downloading data via HTTP, using gzip compression, continuing queries and converting the JSON output to a Python object.
dataviewer
dataviewer
is a web frontend written in the TurboGears 2 framework, run with Apache httpd and WSGI. It makes queries to datagrepper
using the Python client library and displays data in various formats (such as tables or charts).
The specific plan for defining what displays are available and how they get data is being discussed in the #fedora-apps
IRC channel on freenode.
Requirements for release
- The following applications must send activity or log messages over the message bus:
- Apache httpd
- MediaWiki
- FAS
- MirrorManager
- Bodhi
- Koji
- AutoQA
- Git (pkgs.fedoraproject.org and git.fedorahosted.org)
- The
datanommer
service must run, connect to a message bus, listen for activity, parse activity messages and store data into a database for all the above services. - Data from before
datanommer
began running must be gathered from log files or application databases and placed in the database. - The
datagrepper
service must run and respond to basic queries. The data schema for each infrastructure application and the query syntax must have documentation, and examples in that documentation must function. The service must provide responses in JSON and compress a response when requested. - Queries on the Statistics wiki page using the above application data must exist in
dataviewer
. - Documentation must exist for:
- Adding schemas to
datanommer
- Using the
datanommer
API - Using the
datagrepper
Python client library - Adding displays to
dataviewer
- Adding schemas to
Use cases
Within six months, statistics++ should handle the following use cases:
- Adam wants information on wiki edits made in 2011. He doesn't have experience with any programming languages, but if he could import data into a spreadsheet program he can use the data that way.
- Brenda needs information on how often Fedora systems requested repodata for different architectures from MirrorManager to provide information to FESCo on the debate of demoting an architecture to secondary.
- Cathy is a journalist and wants to determine the year-by-year growth rate of the Fedora user base and compare that to the year-by-year growth rate of the Fedora contributor base.
- David wants to see how many packages were available at each release's end-of-life and whether the rate of change is increasing or decreasing.
- Ethan of the Websites team wants to see if a certain page was regularly accessed enough to decide whether to remove it.
- Fred wants to determine how many packages required to stay in testing for a certain time actually receive positive or negative karma in Bodhi.
- Giles wants to see information on mailing list user counts over time.
Relationship to other services
statistics++ is directly related to the messaging bus project (fedbus and busmon).
statistics++ is indirectly related to every other infrastructure application, as we wish to include every infrastructure application in statistics++ eventually.
Reviewers
(Subject to change)
- Infrastructure reviewer: Kevin Fenzi
- Code reviewer: Toshio Kuratomi
- Message bus reviewer: Ralph Bean
- Frontend design/usability reviewer: Máirín Duffy
Schedule and milestones
Milestones aren't likely to change, but dates are subject to wild change depending on the status of messaging support in infrastructure.
Date | Milestone |
---|---|
2012-04-13 |
|
2012-04-20 | datanommer written:
|
2012-04-27 | datanommer packaged for EPEL and in production infrastructure (or staging if during a change freeze)
|
2012-05-21 | datagrepper written:
(The long amount of time involved here takes into account my inexperience with TurboGears 2, the preferred web framework for Fedora Infrastructure.) |
2012-05-28 |
|
2012-06-29 | dataviewer written:
|
2012-07-13 | dataviewer packaged for EPEL and in production infrastructure (or staging if during a change freeze)
|
Open issues
- How does the
datanommer
configuration file define data types? (Currently thinking SQLAlchemy types will work best) - How should messages sent while
datanommer
is not listening be handled? - Should
datanommer
check for duplicate messages (such as reading in log files during a time period whendatanommer
was running)? If so, should this be configured per-schema? - How should
datagrepper
handle excessively large queries? Some large queries may take longer than a normal HTTP timeout to generate. Some ideas:- Have a response that means "your query is generating, here's a code you can check to see if you can download it." Advantages: server can process query when it has idle time; downloads have less HTTP request overhead. Disadvantages: user has to wait for data; server has to retain data for some time period.
- MediaWiki style
query-continue
messages that give changes to query string variables to access the next set of results
- Should we use RRD as a secondary database for faster queries and rendering?
- How should
dataviewer
be configured? - Should the
dataviewer
component be a separate web application or should it be part of the Fedora Community web framework?
Resources for information
- Current Statistics wiki page: http://fedoraproject.org/wiki/Statistics
- Why Fedora thinks metrics are important and some discussion on how to count users: http://fedoraproject.org/wiki/Infrastructure/Metrics
- Updates system metrics: https://admin.fedoraproject.org/updates/metrics/?release=F16
- Fedora Messaging SIG: http://fedoraproject.org/wiki/Messaging_SIG
Responsible parties
- Responsible party for initial go-ahead: Tom 'spot' Callaway
- Responsible party for final project sign-off: Tom 'spot' Callaway