From Fedora Project Wiki
(Undo revision 490612 by Jkaluza (talk))
(Undo revision 490611 by Jkaluza (talk))
 
Line 1: Line 1:
= What is a Focus Document? =
= What is a Focus Document? =


The Factory 2.0 team produces a confusing number of documents.  The first round was about the Problem Statements we were trying to solve.  Let’s retroactively call them Problem Documents.  The Focus Documents (like this one) focus on some system or some aspect of our solutions that cut across different problems.  The content here doesn’t fit cleanly in one problem statement document, which is why we broke it out.
The Factory 2.0 team produces a confusing number of documents.  The first round was about the '''Problem Statements''' we were trying to solve.  Let’s retroactively call them '''Problem Documents'''.  The '''Focus Documents''' (like this one) focus on some system or some aspect of our solutions that cut across different problems.  The content here doesn’t fit cleanly in one problem statement document, which is why we broke it out.


= Introduction =
= Background on ResultsDB =


With all the new possible combinations of content (RPM packages, modules, Docker images, ...), it is no longer possible to require humans to for example remember to rebuild containers after they have built an RPM package intended for that container.  There are just too many.
* ResultsDB is a database for storing results.  Unsurprising!
* It is a passive system, it doesn’t actively do anything.
* It has a HTTP REST interface. You POST new results to it and GET them out.
* It was written by Josef Skladanka of Tim Flink’s Fedora QA team.
* It was originally written as a part of the larger Taskotron system, but we’re using it independently here.


With the introduction of modules, the dependencies between artifacts will be even more complex, because of dependencies between modules, between modules and RPMs and even between the Docker images and modules, because modules can be used as a building block for Docker images.
Links


The goal of Continuous Compose Service is to address this issue by automatically rebuilding the artifacts when their sources or dependencies get updated. This for example ensures that RPM packages in latest Docker image are always up to date or that modules gets automatically rebuilt after the change of their modulemd specification and can be automatically tested by the CI.
* Live, [https://taskotron.fedoraproject.org/resultsdb_api/api/v1.0/ production API] in Fedora.
* [http://docs.resultsdb.apiary.io/ API documentation]
* [https://pagure.io/taskotron/resultsdb Source code]


= Background on Freshmaker =
= What problems can we solve with it? =


* Freshmaker is a service scheduling rebuilds of artifacts as new content becomes available.
In formal Factory 2.0 problem statement terms, this helps us solve the '''Serialization''' and '''Automation''' problems directly, and indirectly all of the problems that depend on those two.
** With all the new possible combinations of content, it is no longer possible to require humans to remember to rebuild containers after they have built an rpm intended for that container.
* It listens on fedmsg bus for messages from other builds systems like Koji or Module Build Service (MBS) and triggers the rebuild of higher-level artifact when lower-level artifact is built in.
** For example, when new release of “foobar” RPM is built in Koji, all Freshmaker handles that event and triggers the rebuild of all modules which require this RPM. Later when those modules are rebuilt, it triggers the rebuild of all Docker images based on those modules and so on. More info is written further in this document.
* It does not have a visible web UI or public API to manage or invoke builds, but does have an API to query the status of the service.
* It cooperates with other services like PDC and PolicyEngine to make decisions about what to rebuild and if it even should do the rebuild.


= Automatic rebuild events =
Beyond that, let’s look at '''fragmentation'''.  The goal of Central CI in Red Hat was to consolidate all of the fragmentation around various CI solutions.  This was a success in terms of operational and capex costs -- instead of 100 different CI systems running on 100 different servers sitting under 100 different desks, we have one Central CI infrastructure backed by OpenStack serving on-demand Jenkins masters.  Win.  A side-effect of this has been that teams can request and configure their own Jenkins masters, without commonality in their configuration.  While teams are incentivized to move to a common test execution tool (Jenkins), there’s no common way to organize jobs and results.  While we reduced fragmentation at one level, it remains untouched at another.  People speak of this as the problem of “the fourteen Jenkins masters” of Platform QE.


This sub-chapter describes the events which trigger the automatic rebuild of other artifacts by Freshmaker. It hence also describes the use-cases for Freshmaker. The use-cases can be summarized in the following chart. Full description of each part of a chart is written later in this subchapter.
Beyond Jenkins, some PnT DevOps tools perform tasks that are ''QE-esque'' but yet are not a part of the Central CI infrastructure.  Notably, the Errata Tool directly runs jobs like covscan, rpmgrill, rpmdiff, and TPS/insanity that are unnecessarily tied to the “release checklist” phase of the workflow. They could benefit from the common infrastructure of Central CI.


== About the service sending the event ==
One option could be to attempt to ''corral'' all of the various dev and QE groups into getting onto the same platform and configuring their jobs the same way.  That’s a possibility, but there is a high cost to achieving that level of social coordination.


Although the chart and this subchapter mentions particular services which send and even to trigger the rebuilds, the real service sending the event to trigger the rebuild can be different. We will for example rebuild Docker image only when the updated module passes the tests. Therefore it is possible that the real service triggering the rebuild of a Docker image will be PolicyEngine or something similar handling the tests results, but the real cause of the rebuild is a module build and therefore we mention it as an service sending an event.
Instead, we intend to use resultsdb and a small number of messagebus hooks to ''insulate consuming services from the details of job execution''.


Just keep in mind that there might be multiple services on the road influencing whether and when Freshmaker sees the information about a particular artifact being built/updated.
= Wait!  Why not an ELK stack? =


== RPM spec file change ==
ELK is cool, and people are putting data in it anyways.  Why bother standing up resultsdb if some or all of this data is going to be in ELK?


* Event sent by:
* ELK has a schema that is very unopinionated.  You can store ''anything'' in it.  This is attractive, because there is a low barrier to entry for getting stuff in.  When it comes time for scripts to query for results, we worry that we’ll encounter unforeseen costs as we have to handle innumerable undocumented variations in the data: heterogeneous data.
** dist-git (.spec file change)
* On the other hand, resultsdb is actually quite opinionated about its schema.  You ''must'' fit the mold. This is good only as long as that schema remains simple.
* Triggers:
* We support teams in Red Hat populating ELK instances and using them.  However, we want those teams to get that information ''on to the message bus first'' and use that feed to populate ELK.  We can then consume the same event feed to populate resultsdb.  Different storage tools for different purposes. (We can furthermore protect ourselves from future bit-rot in either storage tool if we rely on the bus for our feed abstraction.)
** MBS (rebuilds all modules depending on this SRPM - spec file)
* Mapping can be found in:
** PDC - but needs more work, see below.


Although we have the list of SRPMs from which the module is built in PDC, it is currently not possible to search them according to SRPM. We can only fetch particular modules and then list their SRPMs. However, for this use-case, we would need to list all modules containing the SRPM. This would have to be implemented in the PDC.
This can be summarized in the following mantra: “ELK is for humans.  Resultsdb is for machines.


== Modulemd metadata change ==
= Getting data out of resultsdb =


* Event sent by:
Resultsdb, unsurprisingly, stores results.  A result must be associated with a ''testcase'', which is just a namespaced name (for example, <code>general.rpmlint</code>).  It must also be associated with an ''item'', which you can think about as the unique name of a build artifact produced by some RCM tool: the <code>nevra</code> of an rpm is a typical value for the ''item'' field indicating that a particular result is associated with a particular rpm.
** dist-git (modulemd .yaml file change)
* Triggers:
** MBS (rebuilds the module described by the modulemd .yaml file)
* Mapping can be found in:
** No mapping needed - we can directly rebuild the module based on the dist-git fedmsg message.


== Dockerfile change ==
== Generally ==


* Event sent by:
Take a look at some examples of queries to the Fedora QA production instance of taskotron, to get an idea for what this thing can store:
** Dist-git (Dockerfile change)
* Triggers:
** Koji (maybe also support for OSBS) (rebuilds all images based on this Dockerfile)
* Mapping can be found in:
** Dist-git Dockerfile maps directly to single container-build.


== RPM build ==
* A list of known testcases<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases
* Information on the <code>dist.depcheck</code> testcase<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck
* All known results for the <code>dist.depcheck</code> testcase<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck/results
* Only <code>dist.depcheck</code> results associated with builds<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.depcheck/results?type=koji_build
* All <code>dist.rpmlint</code> results associated with the <code>python-gradunwarp-1.0.3-1.fc24</code> build<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/testcases/dist.rpmlint/results?item=python-gradunwarp-1.0.3-1.fc24
* All results of any testcase associated with that same build<br/>https://taskotron.fedoraproject.org/resultsdb_api//api/v1.0/results?item=python-gradunwarp-1.0.3-1.fc24


* Event sent by:
== For the release checklist ==
** koji/brew (RPM is built)
* Triggers:
** Koji (maybe also support for OSBS) (rebuilds all images directly containing this RPM)
* Mapping can be found in:
** PDC - for example here.


== Module build ==
For the '''Errata Tool''' problems described in the introduction, we need to:
* Event sent by:
** MBS (new version of module is built)
* Triggers:
** MBS (rebuilds all the modules depending on this module or including it as “included module”)
** Koji (OSBS) (rebuilds all images containing this module)
* Mapping can be found in:
** Dependencies between modules are tracked in PDC now, but included modules are not. This is mostly the same situation as with tracking “RPM spec file change to module”.
** For Module to container image, there is no mapping like that. Should it be in PDC?


To rebuild the container images automatically, CoCo needs to auto-generate the Dockerfiles based on the built module. We should probably have one Dockerfile per module profile, although the PolicyEngine can influence that.
* Set up Jenkins jobs that do exactly what the Errata Tool processes do today: rpmgrill, covscan, rpmdiff, TPS/Insanity.  Ondrej Hudlicky is working on this.  Those jobs need to:
** Be triggered by appropriate message bus events (build complete, dist-git commit, etc..)
** Publish to the bus using the CI-Metrics format, driven by Jiri Canderle.
* We need to ingest data from the bus about those jobs, and store that in resultsdb.  The Factory 2.0 team will be working on that.
* We also need to write and stand up an accompanying ''waiverdb'' service, that allows overriding an immutable result in resultsdb.
** Should have an audit trail to track who waived and when.
** May need an approval workflow, i.e. a waiver requested by person A then approved or disapproved by person B (with comments about why).
** May need waivers to be related to a purpose somehow.  We may want to waive a result for an advisory, or for a cloud image, or for one product but not another.  Some research should go into thinking about how best to do this.  Referring to PDC’s product/release keys may be a good candidate here.
* The Errata Tool needs to be modified to refer to resultsdb’s stored results instead of its own.
* We can decommission Errata Tool’s scheduling and storage of QE-esque activities.


== Container image build ==
Note that, in Fedora the [https://bodhi.fedoraproject.org/ Bodhi Updates System] already works along these lines to gate updates on their resultsdb status.  A '''subset''' of testcases are declared as ''required''.  However, if a testcase is failing erroneously, a developer must change the requirements associated with the update to get it out the door.  This is silly.  Writing and deploying something like waiverdb will make that much more straightforward.


* Event sent by:
Note also that the [https://github.com/fedora-infra/fedimg fedimg] tool, used to upload newly composed images to AWS, currently has ''no gating'' in place at all.  It uploads everything.  While talking about how we actually want to introduce gating into its workflow, it was proposed that it should query the ''cloud-specific test executor'' called [https://apps.fedoraproject.org/autocloud/compose autocloud].  Our answer here should be ''no''. Autocloud should store its results in resultsdb, and fedimg should consult resultsdb to know if an image is good or not.  This insulates fedimg’s code from the details of autocloud and enables us to more flexibly change out QE methods and tools in the future.
** OSBS (new container image is built)
** Triggers:
** OSBS (rebuilds all the layers images based on this image)


This is out of scope for CoCo. OSBS takes care of rebuilding layered container images when one of the layers changes.
== For rebuild automation ==


== Base Image Builds ==
For Fedora Modularity, we know we need to [https://fedorapeople.org/groups/modularity/sprint-5-demo/sprint5demo-threebean.ogv build and deploy tools to automate rebuilds].  In order to avoid unnecessary rebuilds of Tier 2 and Tier 3 artifacts, we’ll want to first ensure that Tier 1 artifacts are “good”.  The rebuild tooling we design will need to:


* Event sent by:
* '''Refer to resultsdb to gather testcase results.'''  It should not query test-execution systems directly for the reasons mentioned above.
** New RPM is built that goes into the Docker base image
* '''Have configurable policy.'''  Resultsdb gives us access to ''all'' test results.  Do we block rebuilds if one test fails?  How do we introduce new experimental tests while not blocking the rebuild process?  A ''constrained subset'' of the total set of testcases should be used on a per-product/per-component basis to define the rebuild criteria: a policy.
* Triggers:
** A run of Pungi which produces a new base image and other “compose” artifacts.
* Mapping can be found in:
** PDC has a list of all rpms that went into the last compose for each release.


Ralph’s comment: We don’t want to do this *every* time.  It is too much.  Maybe only trigger new composes when further conditions are met.  Take this one on last after we’ve gained experience with the other triggers.
= Putting data in resultsdb =


= Deciding whether to rebuild or not =
== Generally speaking ==


QualityCoCo should certainly not rebuild all the artifacts all the time. There might be good reasons why not to do it, for example when the underlying artifact did not pass its tests. To decide whether CoCo should rebuild the upstream artifacts, it must query the PolicyEngine.
* Resultsdb receives new results by way of an HTTP POST.
* In Fedora, the [https://taskotron.fedoraproject.org/ Taskotron] system puts results directly into resultsdb.
* Internally, we’ll need a level of indirection due to the social coordination issue described above. Any QE process that wants to have its results stored in resultsdb (and therefore be considered in PnT DevOps rebuild and release processes) will need to publish to the unified message bus or the CI-bus using the CI-Metrics format, driven by Jiri Canderle.
* The Factory 2.0 team will write, deploy and maintain a service that listens for those messages, formats them appropriately, and stores them in resultsdb.


Iterative Enablement:  We also need a way to turn this on slowly for different content types - in particular for container rebuilds.  The system should allow a whitelist of patterns for container names that should be considered for this, so we can start out with only doing automated rebuilds of the sssd container.  Once we’re happy that it is working well, we can expand the whitelist, or remove it all together to handle all containers.
== What data on the bus? ==


Detecting Cycles: We need some way to detect cycles, to make sure we’re not in an infinite loop….
* For our MVP, the target is to consume the CI-Metrics data feed coming out of Platform QE, but long-term we don’t want to be limited to just Platform.
* The ship-shift initiative out of CI-ops looks like a very promising source of information. They will publish events about Jenkins job completion to the bus, and a ship-shift worker will pick up that event and archive the job metadata and artifacts into elasticsearch and cold storage.
* Observe that the most expensive part of this project will be “herding the cats”, getting all of the owners of all of the Jenkins masters to start publishing events about their jobs.
* We want to drive the resultsdb-updater process using the ''same'' data feed produced for ship-shift, which means we will only have to solve that coordination problem once.  This further enables us to integrate CI activity across all of the engineerings organizations, not just Platform.


Handling Failures: Does CoCo care whether builds that it kicks off pass or fail? If it does care, what actions are taken?
= TODO =
 
* Write up a description of how to translate TAP or xUnit into resultsdb’s expected format.
** We won’t expect any test runners to actually do this themselves.  The Factory 2.0 service that listens on the bus will do it for them. Still, it will be useful to write down here (the request comes from Ari).
** Tim linked to https://bitbucket.org/fedoraqa/resultsdb_api in a comment above, which is useful here.
* Write about handling results for manual tests.  It may make sense for the Errata Tool to gate on those (and show % progress when the gate is closed?)  This would take us closer to eliminating the manual handoff from QE to RCM in the release checklist.

Latest revision as of 11:40, 11 April 2017

What is a Focus Document?

The Factory 2.0 team produces a confusing number of documents. The first round was about the Problem Statements we were trying to solve. Let’s retroactively call them Problem Documents. The Focus Documents (like this one) focus on some system or some aspect of our solutions that cut across different problems. The content here doesn’t fit cleanly in one problem statement document, which is why we broke it out.

Background on ResultsDB

  • ResultsDB is a database for storing results. Unsurprising!
  • It is a passive system, it doesn’t actively do anything.
  • It has a HTTP REST interface. You POST new results to it and GET them out.
  • It was written by Josef Skladanka of Tim Flink’s Fedora QA team.
  • It was originally written as a part of the larger Taskotron system, but we’re using it independently here.

Links

What problems can we solve with it?

In formal Factory 2.0 problem statement terms, this helps us solve the Serialization and Automation problems directly, and indirectly all of the problems that depend on those two.

Beyond that, let’s look at fragmentation. The goal of Central CI in Red Hat was to consolidate all of the fragmentation around various CI solutions. This was a success in terms of operational and capex costs -- instead of 100 different CI systems running on 100 different servers sitting under 100 different desks, we have one Central CI infrastructure backed by OpenStack serving on-demand Jenkins masters. Win. A side-effect of this has been that teams can request and configure their own Jenkins masters, without commonality in their configuration. While teams are incentivized to move to a common test execution tool (Jenkins), there’s no common way to organize jobs and results. While we reduced fragmentation at one level, it remains untouched at another. People speak of this as the problem of “the fourteen Jenkins masters” of Platform QE.

Beyond Jenkins, some PnT DevOps tools perform tasks that are QE-esque but yet are not a part of the Central CI infrastructure. Notably, the Errata Tool directly runs jobs like covscan, rpmgrill, rpmdiff, and TPS/insanity that are unnecessarily tied to the “release checklist” phase of the workflow. They could benefit from the common infrastructure of Central CI.

One option could be to attempt to corral all of the various dev and QE groups into getting onto the same platform and configuring their jobs the same way. That’s a possibility, but there is a high cost to achieving that level of social coordination.

Instead, we intend to use resultsdb and a small number of messagebus hooks to insulate consuming services from the details of job execution.

Wait! Why not an ELK stack?

ELK is cool, and people are putting data in it anyways. Why bother standing up resultsdb if some or all of this data is going to be in ELK?

  • ELK has a schema that is very unopinionated. You can store anything in it. This is attractive, because there is a low barrier to entry for getting stuff in. When it comes time for scripts to query for results, we worry that we’ll encounter unforeseen costs as we have to handle innumerable undocumented variations in the data: heterogeneous data.
  • On the other hand, resultsdb is actually quite opinionated about its schema. You must fit the mold. This is good only as long as that schema remains simple.
  • We support teams in Red Hat populating ELK instances and using them. However, we want those teams to get that information on to the message bus first and use that feed to populate ELK. We can then consume the same event feed to populate resultsdb. Different storage tools for different purposes. (We can furthermore protect ourselves from future bit-rot in either storage tool if we rely on the bus for our feed abstraction.)

This can be summarized in the following mantra: “ELK is for humans. Resultsdb is for machines.”

Getting data out of resultsdb

Resultsdb, unsurprisingly, stores results. A result must be associated with a testcase, which is just a namespaced name (for example, general.rpmlint). It must also be associated with an item, which you can think about as the unique name of a build artifact produced by some RCM tool: the nevra of an rpm is a typical value for the item field indicating that a particular result is associated with a particular rpm.

Generally

Take a look at some examples of queries to the Fedora QA production instance of taskotron, to get an idea for what this thing can store:

For the release checklist

For the Errata Tool problems described in the introduction, we need to:

  • Set up Jenkins jobs that do exactly what the Errata Tool processes do today: rpmgrill, covscan, rpmdiff, TPS/Insanity. Ondrej Hudlicky is working on this. Those jobs need to:
    • Be triggered by appropriate message bus events (build complete, dist-git commit, etc..)
    • Publish to the bus using the CI-Metrics format, driven by Jiri Canderle.
  • We need to ingest data from the bus about those jobs, and store that in resultsdb. The Factory 2.0 team will be working on that.
  • We also need to write and stand up an accompanying waiverdb service, that allows overriding an immutable result in resultsdb.
    • Should have an audit trail to track who waived and when.
    • May need an approval workflow, i.e. a waiver requested by person A then approved or disapproved by person B (with comments about why).
    • May need waivers to be related to a purpose somehow. We may want to waive a result for an advisory, or for a cloud image, or for one product but not another. Some research should go into thinking about how best to do this. Referring to PDC’s product/release keys may be a good candidate here.
  • The Errata Tool needs to be modified to refer to resultsdb’s stored results instead of its own.
  • We can decommission Errata Tool’s scheduling and storage of QE-esque activities.

Note that, in Fedora the Bodhi Updates System already works along these lines to gate updates on their resultsdb status. A subset of testcases are declared as required. However, if a testcase is failing erroneously, a developer must change the requirements associated with the update to get it out the door. This is silly. Writing and deploying something like waiverdb will make that much more straightforward.

Note also that the fedimg tool, used to upload newly composed images to AWS, currently has no gating in place at all. It uploads everything. While talking about how we actually want to introduce gating into its workflow, it was proposed that it should query the cloud-specific test executor called autocloud. Our answer here should be no. Autocloud should store its results in resultsdb, and fedimg should consult resultsdb to know if an image is good or not. This insulates fedimg’s code from the details of autocloud and enables us to more flexibly change out QE methods and tools in the future.

For rebuild automation

For Fedora Modularity, we know we need to build and deploy tools to automate rebuilds. In order to avoid unnecessary rebuilds of Tier 2 and Tier 3 artifacts, we’ll want to first ensure that Tier 1 artifacts are “good”. The rebuild tooling we design will need to:

  • Refer to resultsdb to gather testcase results. It should not query test-execution systems directly for the reasons mentioned above.
  • Have configurable policy. Resultsdb gives us access to all test results. Do we block rebuilds if one test fails? How do we introduce new experimental tests while not blocking the rebuild process? A constrained subset of the total set of testcases should be used on a per-product/per-component basis to define the rebuild criteria: a policy.

Putting data in resultsdb

Generally speaking

  • Resultsdb receives new results by way of an HTTP POST.
  • In Fedora, the Taskotron system puts results directly into resultsdb.
  • Internally, we’ll need a level of indirection due to the social coordination issue described above. Any QE process that wants to have its results stored in resultsdb (and therefore be considered in PnT DevOps rebuild and release processes) will need to publish to the unified message bus or the CI-bus using the CI-Metrics format, driven by Jiri Canderle.
  • The Factory 2.0 team will write, deploy and maintain a service that listens for those messages, formats them appropriately, and stores them in resultsdb.

What data on the bus?

  • For our MVP, the target is to consume the CI-Metrics data feed coming out of Platform QE, but long-term we don’t want to be limited to just Platform.
  • The ship-shift initiative out of CI-ops looks like a very promising source of information. They will publish events about Jenkins job completion to the bus, and a ship-shift worker will pick up that event and archive the job metadata and artifacts into elasticsearch and cold storage.
  • Observe that the most expensive part of this project will be “herding the cats”, getting all of the owners of all of the Jenkins masters to start publishing events about their jobs.
  • We want to drive the resultsdb-updater process using the same data feed produced for ship-shift, which means we will only have to solve that coordination problem once. This further enables us to integrate CI activity across all of the engineerings organizations, not just Platform.

TODO

  • Write up a description of how to translate TAP or xUnit into resultsdb’s expected format.
    • We won’t expect any test runners to actually do this themselves. The Factory 2.0 service that listens on the bus will do it for them. Still, it will be useful to write down here (the request comes from Ari).
    • Tim linked to https://bitbucket.org/fedoraqa/resultsdb_api in a comment above, which is useful here.
  • Write about handling results for manual tests. It may make sense for the Errata Tool to gate on those (and show % progress when the gate is closed?) This would take us closer to eliminating the manual handoff from QE to RCM in the release checklist.