(Created page with "= Privacy-preserving Telemetry for Fedora Workstation = == Summary == Red Hat proposes to enable limited data collection of anonymous Fedora Workstation usage metrics. Please don't panic yet! Fedora is an open source community project, and nobody is interested in violating user privacy. We do not want to collect data about individual users. We want to collect only aggregate usage metrics that are actually needed to achieve specific Fedora improvement objectives, and n...") |
(Add link to new version of change proposal) |
||
(15 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
= Privacy-preserving Telemetry for Fedora Workstation = | = Privacy-preserving Telemetry for Fedora Workstation = | ||
{{Change_Proposal_Banner}} | |||
== Summary == | == Summary == | ||
Red Hat proposes to enable limited data collection of anonymous Fedora Workstation usage metrics. | The Red Hat Display Systems Team (which develops the desktop) proposes to enable limited data collection of anonymous Fedora Workstation usage metrics. | ||
Fedora is an open source community project, and nobody is interested in violating user privacy. We do not want to collect data about individual users. We want to collect only aggregate usage metrics that are actually needed to achieve specific Fedora improvement objectives, and no more. We understand that if we violate our users' trust, then we won't have many users left, so if metrics collection is approved, we will need to be very careful to roll this out in a way that respects our users at all times. (For example, we should not collect users' search queries, because that would be creepy.) | |||
We believe an open source community can ethically collect limited aggregate data on how its software is used without involving big data companies or building creepy tracking profiles that are not in the best interests of users. Users will have the option to disable data upload before any data is sent for the first time. Our service will be operated by Fedora on Fedora infrastructure, and will not depend on Google Analytics or any other controversial third-party services. And in contrast to proprietary software operating systems, you can redirect the data collection to your own private metrics server instead of Fedora's to see precisely what data is being collected from you, because the server components are open source too. | We believe an open source community can ethically collect limited aggregate data on how its software is used without involving big data companies or building creepy tracking profiles that are not in the best interests of users. Users will have the option to disable data upload before any data is sent for the first time. Our service will be operated by Fedora on Fedora infrastructure, and will not depend on Google Analytics or any other controversial third-party services. And in contrast to proprietary software operating systems, you can redirect the data collection to your own private metrics server instead of Fedora's to see precisely what data is being collected from you, because the server components are open source too. | ||
Line 20: | Line 22: | ||
== Current status == | == Current status == | ||
[[Category: | This change proposal was withdrawn and has been obsoleted by a [[Changes/Metrics|newer version]]. | ||
[[Category:ChangePageIncomplete]] | |||
[[Category:SystemWideChange]] | [[Category:SystemWideChange]] | ||
Line 36: | Line 40: | ||
=== How will data collection be approved? === | === How will data collection be approved? === | ||
The proposal owners feel it is essential to ensure the Fedora community has ultimate oversight over metrics collection. Community control is required to maintain user trust. If this change proposal is approved, then we'll need new policies and procedures to ensure community oversight over metrics collection and ensure Fedora users can be confident that our metrics collection does not violate their privacy. | |||
We can say "we would never collect personally-identifiable data" and write software that really doesn't collect any such data, but this alone will never be enough to ensure user confidence. We will need a metrics collection policy that describes what sort of data may be collected by Fedora (anonymous, non-invasive), and what sort of data may not be collected. Such a policy does not exist currently. We will also want to ensure the Fedora community has ultimate control over which particular metrics are collected. One option is that each metric to be collected should be separately approved by FESCo. Collection of particular metrics in a particular data format is ultimately an engineering decision, and therefore FESCo seems like an appropriate approval point. Because FESCo members are elected regularly by the Fedora community, this also provides the community with ultimate control over metrics collection via the election process. But other oversight and approval structures would work too. | We can say "we would never collect personally-identifiable data" and write software that really doesn't collect any such data, but this alone will never be enough to ensure user confidence. We will need a metrics collection policy that describes what sort of data may be collected by Fedora (anonymous, non-invasive), and what sort of data may not be collected. Such a policy does not exist currently. We will also want to ensure the Fedora community has ultimate control over which particular metrics are collected. One option is that each metric to be collected should be separately approved by FESCo. Collection of particular metrics in a particular data format is ultimately an engineering decision, and therefore FESCo seems like an appropriate approval point. Because FESCo members are elected regularly by the Fedora community, this also provides the community with ultimate control over metrics collection via the election process. But other oversight and approval structures would work too. | ||
Line 101: | Line 105: | ||
This change proposal will likely be compared to the Ubuntu spyware complaints from a decade ago, when Ubuntu desktop users' search queries were sent to Amazon by default. Let's not do that. | This change proposal will likely be compared to the Ubuntu spyware complaints from a decade ago, when Ubuntu desktop users' search queries were sent to Amazon by default. Let's not do that. | ||
== Benefit to Fedora == | == Benefit to Fedora == | ||
Line 122: | Line 120: | ||
This proposal will require substantial effort by Community Platform Engineering (CPE) to host the metrics server infrastructure. | This proposal will require substantial effort by Community Platform Engineering (CPE) to host the metrics server infrastructure. | ||
* Release engineering: [https://pagure.io/releng/issues # | * Release engineering: [https://pagure.io/releng/issues/11514 #11514] | ||
* Policies and guidelines: New processes and guidelines are proposed above under the section "How will data collection be approved?" | * Policies and guidelines: New processes and guidelines are proposed above under the section "How will data collection be approved?" | ||
Line 173: | Line 169: | ||
Release Notes are not required for initial proposal. We need to write the release notes before change freeze. | Release Notes are not required for initial proposal. We need to write the release notes before change freeze. | ||
== Feedback == | |||
This section summarizes feedback provided by users after the publication of this change proposal. It was written one week after the rest of this change proposal was published, except for the suggestion regarding reproducible builds, which was added on November 2, 2023. | |||
=== Some users do not want any telemetry collected at all === | |||
Many Fedora users prefer that Fedora never collect any telemetry data from any users under any conditions. No modifications to the change proposal will satisfy these users. | |||
=== Many users want to require opt-in rather than opt-out === | |||
A very large number of users complained that the toggle switch in gnome-initial-setup would default to the on position. This issue received approximately as much feedback than all other aspects of the change proposal combined. Users say the original design is a "dark pattern" to trick users into consenting to metrics collection by accident if they skip through the pages without reading. However, defaulting the toggle to the off position would undoubtedly result in few users consenting to data collection, resulting in a non-representative sample. The proposal owner suggests a compromise "suggested opt-in" design, where the UI encourages the user to opt-in, but the user must explicitly make a decision to do so or not. This introduces friction into the first boot experience that we would normally prefer to avoid, and the proposal owner is not sure whether an adequate proportion of users will agree to data collection, but it ensures that all data collection would be fully opt-in without precluding the possibility of achieving consent from a representative quantity of users. Several vocal users are still very unhappy that opting in would be the suggested action. The proposal owner does not believe it will be possible to reach a compromise acceptable to these users. | |||
=== The "Benefit to Fedora" section of the change proposal is too short === | |||
The proposal's most concrete examples of benefit to Fedora from collecting particular metrics are difficult to find because they are located in the "What data might we collect?" section of the change proposal instead of the "Benefit to Fedora" section, which is short and insufficiently-persuasive. It is difficult to show concrete benefit to Fedora without more examples of what metrics would be collected. | |||
=== The proposal should specify how ongoing benefit to Fedora will be demonstrated === | |||
One user suggested we should publish occasional blog posts describing specific examples of how the collected data has been used to improve Fedora. Fedora Magazine would be a good place for this. | |||
=== The proposal does not specify which metrics would be collected === | |||
The proposal provides a few examples of metrics that Fedora might collect in the future, but it does not propose collecting any particular metrics. Many users found the few example metrics insufficient to determine whether they were comfortable with allowing Fedora to collect data in general. | |||
=== The proposal does not specify technical mechanisms that would provide anonymity === | |||
The proposal promises to collect only anonymous data, but there is no technical mechanism to enforce anonymity. At the point when the telemetry server initially receives the data, it is not yet anonymous because the server can see both the user's IP address and an entire batch of metrics. The privacy guarantees rely on the server ignoring the IP address and storing all the metrics separately into its database. These guarantees would fail if the server were to be compromised by an attacker or if the server operator is malicious and installs code that does not match the azafea open source releases. Many users do not actually trust that Fedora will operate the server securely and as promised, and it is impossible to guarantee that any service will be secure even if best practices are followed. | |||
Several users were concerned at the possibility that IP addresses would be taken from web server logs and correlated to data in the metrics database. The proposal owner believes the server can be configured such that web server logs are retained only for a very short period of time, but many users do not trust that Fedora will not change log retention policy in the future. Additionally, users familiar with Fedora infrastructure suggested that Fedora's proxy service would keep logs of all traffic, exacerbating this concern. | |||
One way to reduce concerns would be for users to connect directly to a separate proxy server. The proxy server would connect to the real telemetry server, so neither the telemetry server nor Fedora's proxy service would see any user IP addresses. Metrics would be encrypted using a public key encryption scheme such that the proxy server cannot see any of the data that the user is submitting. Then both servers would need to be hacked in order to deanonymize users. However, most likely both services would be deployed on Fedora's OpenShift cluster if hosted by Fedora CPE, so in practice this might not provide much actual benefit unless the proxy server were to be controlled separately. | |||
Several users suggested investigating use of differential privacy techniques to provide provable privacy guarantees. The proposal owner is unfamiliar with the topic but intends to investigate it. | |||
=== The proposal does not clarify that metrics will be collected separately and not correlated together === | |||
Each metric will be stored in the database separately. For example, say we were to keep track of two metrics: a boolean to indicate whether the user launched GNOME Builder today, and the model of the user's GPU. Fedora would know how many users launched GNOME Builder on a given day, and it would know how many users have particular GPUs, but there would be no way to know that a user with a particular GPU launched GNOME Builder, because the metrics are not stored together. | |||
However, the proposal failed to include this information, leading users to worry that the server could collect enough information to fingerprint them. This is not possible when the server is operating as expected. (However, fingerprinting ''would'' be possible if the server were operating maliciously and not storing the data as designed.) | |||
=== The proposal would keep the metrics database private and does not specify who would be allowed to access it === | |||
The proposal has been criticized by many users for failing to make the database public. Many users will not trust that the data is truly anonymous unless the entire database is publicly available to ensure the server truly does not contain any personal data. | |||
The proposal owner has indicated willingness to open up the database, but is not convinced this is really a good idea. It may be possible to apply unknown statistical techniques to the database to guess that unconnected records were submitted by the same user, which, if successful, would anger users. Additionally, if we fail to exercise due care when defining how metrics will be collected, it is possible we may inadvertently collect a metric that contains sensitive personal data. Users would be very upset if such data were made public. | |||
One suggestion is to start with only a guinea pig group of volunteer users. The database will be made public, then we would wait and see if anybody is able to deanonymize the data. If nobody is able to demonstrate that unconnected records can be correlated, and if nobody finds any private information in the database, then we have increased confidence that it would be safe to make the entire database public. | |||
Another suggestion is to periodically make public small samples of the database that have been reviewed by humans to ensure they do not inadvertently contain personal data. | |||
=== The proposal provides insufficient user visibility into what data would be collected === | |||
The proposal says instructions will be provided so you can run your own metrics server to see exactly what data is being collected and verify that it is not creepy. However, most users do not have enough technical skill to set up a server, even if the instructions are relatively easy for technical users to follow. There should be easier ways to see what data is being collected. Many users proposed displaying this data in the OS itself, before asking users to consent to data collection. However, this would clutter the user interface significantly, and incorrectly imply that the data to be collected will not change. Instead, the proposal owner suggests developing an extra application that can be installed by users who wish to have a detailed view of what data would be collected. The OS would link to a wiki page containing detailed descriptions of each metric to be collected and showing examples of what they might look like. | |||
=== Users have requested increased control over what data would be uploaded === | |||
Many users have suggested that, instead of a on/off switch, we instead provide a slider to allow adjusting the amount of data to be collected. Other users have requested fine-grained controls to enable or disable each particular metric. The proposal owner hopes that we will collect so little data that users who are willing to enable data collection will not feel the need to disable any particular metric, but acknowledges that it would be nice to have a mechanism to determine which metrics would be submitted. | |||
=== The proposal references policies and procedures that do not exist yet === | |||
The proposal says metrics should be collected only in accordance with a metrics collection policy that does not exist yet. The proposal owner thought that leaving the policy initially undefined would be useful as it would allow the community to participate in its development. However, not having a proposed policy ready to share makes it harder to understand the possible scope of data collection. (Note this policy is required for legal purposes and will need to be approved by Fedora Legal.) | |||
The proposal additionally says all metrics to be collected should be approved via an undefined community process. The proposal contains only a placeholder suggestion that metrics could be approved by FESCo. This section of the change proposal is ambiguously-worded and has been misunderstood to allow a "blank check" for the proposal owners to collect any desired metric without oversight if the change proposal is approved by FESCo. The intended interpretation was that the change proposal would not itself authorize the collection of any metrics; no metrics would be collected until individually approved via some additional process developed by the community. The proposal owner thought that leaving the process undefined would allow the community to participate in defining how the process would work; however, the Fedora community was not actually interested in doing this despite efforts to solicit feedback on this topic. Much confusion could have been avoided had a process for approving collection of metrics been specified from the beginning. The proposal owner now suggests that a thread be posted on Discourse each time a new metric is desired, specifying precisely which data will be collected including what the database schema and GVariant would look like. The community would have at least two weeks to provide feedback, then FESCo would vote on whether to approve the metric. | |||
=== Collection of particular metrics should be limited to defined periods of time === | |||
The proposal does not include any limitations on how long particular metrics may be collected. A time limit for how long the metric needs to be collected for should be specified when requesting approval to collect a metric. Fedora should discontinue collection of the metric when the end time is reached. We might want to collect some metrics in perpetuity, but most metrics should only need to be collected for a short while to provide enough data to inform design decisions. | |||
=== Collection of metrics should be limited to small cohorts of users === | |||
The proposal envisions collecting a particular data point from 100% of users who consent to data collection. For most metrics, this is overkill. A small percentage of users should be adequate to provide useful representative data so long as the set of participating users is a representative population. We can significantly reduce load on the telemetry service and avoid the database growing huge and unwieldy by randomly collecting most metrics from only a small, random subset of users. | |||
=== Enabling local collection when uploading is disabled is confusing === | |||
The proposal says that local metrics collection will be initially enabled while uploading will be initially disabled. This means metrics would be collected locally before the user consents, even though they will all be deleted and never uploaded to Fedora if the user does not consent. Users who upgrade from previous versions of Fedora or who install the eos- packages by mistake would have local collection enabled indefinitely, waiting for the user to consent to uploading the data. The change proposal devotes several paragraphs to explaining how this would work. This feature is unpopular and is only needed to collect metrics early in the initial setup process, which we probably won't need to do. It would be simpler to have local collection disabled by default, matching the upload setting. | |||
=== The GDPR section of the proposal is confusing === | |||
Many users complained that the proposal owner was ignoring GDPR concerns by refusing to respond to posts that mention GDPR. Although it was probably wise for the proposal owner to refuse to debate a topic he is not qualified to discuss, the change proposal did not explain the reasons for this choice, leading some users to conclude the proposal was drafted with a cavalier attitude towards legal requirements. Additionally, the proposal did not directly mention that Fedora Legal had already reviewed and approved the plans for data collection. Several users complained that implementing the change proposal would be illegal in Europe, unaware that the plan had already been approved by Fedora's data protection attorney. | |||
=== The proposal does not specify that the telemetry packages may be removed === | |||
Some users were concerned it would not be possible to uninstall the eos- packages. The proposal should clarify that this will be possible because the eos- packages will be only weak dependencies of other packages. | |||
=== Clarify that the proposal does not apply to packages that have their own upstream telemetry === | |||
The change proposal is intended to apply only to data that is collected by Fedora. It has no impact on Fedora packages that have their own upstream telemetry, like Firefox. The proposal should clarify this. | |||
=== Consider collaborating with a trusted third party === | |||
More users might consent to data collection if Fedora were to collaborate with a trusted digital privacy organization to audit and review the proposal. The proposal owner would like to hear from anyone interested in such a collaboration. | |||
=== The telemetry packages should support reproducible builds === | |||
The telemetry packages should support [https://reproducible-builds.org/ reproducible builds] to ensure that the built package corresponds to the published source code for the package and has not been maliciously modified. Unlike Debian and some other distributions, Fedora has not historically invested in reproducible builds. The proposal owner is uncertain whether this is currently possible. [https://pagure.io/fedora-reproducible-builds/project/issues A new issue tracker exists], so at least some work seems to be underway. |
Latest revision as of 14:39, 24 June 2024
Privacy-preserving Telemetry for Fedora Workstation
Summary
The Red Hat Display Systems Team (which develops the desktop) proposes to enable limited data collection of anonymous Fedora Workstation usage metrics.
Fedora is an open source community project, and nobody is interested in violating user privacy. We do not want to collect data about individual users. We want to collect only aggregate usage metrics that are actually needed to achieve specific Fedora improvement objectives, and no more. We understand that if we violate our users' trust, then we won't have many users left, so if metrics collection is approved, we will need to be very careful to roll this out in a way that respects our users at all times. (For example, we should not collect users' search queries, because that would be creepy.)
We believe an open source community can ethically collect limited aggregate data on how its software is used without involving big data companies or building creepy tracking profiles that are not in the best interests of users. Users will have the option to disable data upload before any data is sent for the first time. Our service will be operated by Fedora on Fedora infrastructure, and will not depend on Google Analytics or any other controversial third-party services. And in contrast to proprietary software operating systems, you can redirect the data collection to your own private metrics server instead of Fedora's to see precisely what data is being collected from you, because the server components are open source too.
Keep in mind this Fedora change proposal is just that: a proposal. It must undergo community review and must be approved by the community-elected Fedora Engineering Steering Committee (FESCo) before it can be implemented, just like any other Fedora change proposal. We welcome community participation and fully expect this proposal may need to be modified significantly depending on Fedora community feedback.
Owner
- Name: Michael Catanzaro
- Email: <mcatanzaro@redhat.com>
Current status
This change proposal was withdrawn and has been obsoleted by a newer version.
- Targeted release: Fedora Linux 40
- Last updated: 2024-06-24
- FESCo issue: <will be assigned by the Wrangler>
- Tracker bug: <will be assigned by the Wrangler>
- Release notes tracker: <will be assigned by the Wrangler>
Detailed Description
We intend to deploy the Endless OS metrics system. This blog post contains a description of how the system works. We do not plan to deploy the eos-phone-home component in Fedora.
How will data collection be approved?
The proposal owners feel it is essential to ensure the Fedora community has ultimate oversight over metrics collection. Community control is required to maintain user trust. If this change proposal is approved, then we'll need new policies and procedures to ensure community oversight over metrics collection and ensure Fedora users can be confident that our metrics collection does not violate their privacy.
We can say "we would never collect personally-identifiable data" and write software that really doesn't collect any such data, but this alone will never be enough to ensure user confidence. We will need a metrics collection policy that describes what sort of data may be collected by Fedora (anonymous, non-invasive), and what sort of data may not be collected. Such a policy does not exist currently. We will also want to ensure the Fedora community has ultimate control over which particular metrics are collected. One option is that each metric to be collected should be separately approved by FESCo. Collection of particular metrics in a particular data format is ultimately an engineering decision, and therefore FESCo seems like an appropriate approval point. Because FESCo members are elected regularly by the Fedora community, this also provides the community with ultimate control over metrics collection via the election process. But other oversight and approval structures would work too.
What data might we collect?
We are not proposing to collect any of these particular metrics just yet, because a process for Fedora community approval of metrics to be collected does not yet exist. That said, in the interests of maximum transparency, we wish to give you an idea of what sorts of metrics we might propose to collect in the future.
One of the main goals of metrics collection is to analyze whether Red Hat is achieving its goal to make Fedora Workstation the premier developer platform for cloud software development. Accordingly, we want to know things like which IDEs are most popular among our users, and which runtimes are used to create containers using Toolbx.
Metrics can also be used to inform user interface design decisions. For example, we want to collect the clickthrough rate of the recommended software banners in GNOME Software to assess which banners are actually useful to users. We also want to know how frequently panels in gnome-control-center are visited to determine which panels could be consolidated or removed, because there are other settings we want to add, but our usability research indicates that the current high quantity of settings panels already makes it difficult for users to find commonly-used settings.
Metrics can help us understand the hardware we should be optimizing Fedora for. For example, our boot performance on hard drives dropped drastically when systemd-readahead was removed. Ubuntu has maintained its own readahead implementation, but Fedora does not because we assume that not many users use Fedora on hard drives. It would be nice to collect a metric that indicates whether primary storage is a solid state drive or a hard disk, so we can see actual hard drive usage instead of guessing. We would also want to collect hardware information that would be useful for collaboration with hardware vendors (such as Lenovo), such as laptop model ID.
Other Fedora teams may have other metrics they wish to collect. For example, Fedora localization wishes to count users of particular locales to evaluate which locales are in poorer shape relative to their usage.
This is only a small sample of what we might want to know; no doubt other community members can think of many more interesting data points to collect. But note the purpose of all of the above metrics is to inform specific design decisions, not to build tracking profiles. We only need to collect data in aggregate, and have no need to associate the data we collect with particular users.
Metrics transparency
Transparency is required to provide confidence that Fedora metrics collection is not creepy or invasive. Since Fedora is open source, a developer can review the source code to verify exactly what it is doing and what data is being collected. But most Fedora users are not software developers, and few software developers have time or inclination to review the source code of the operating system to see what it is doing. To retain user trust, we need an easy way for users to understand exactly what data we are collecting. We propose to maintain a documentation page showing the current metrics database schema, so users can see exactly which fields are in the database and what example data looks like.
Experienced users may gain additional confidence by building and running their own metrics collection server; all of the components of the server (discussed below) are open source, and we will provide instructions for how to run a simple server yourself and view its metrics database. You can redirect metrics from Fedora's server to your own by changing a URL in a configuration file.
User control
A new metrics collection setting will be added to the privacy page in gnome-initial-setup and also to the privacy page in gnome-control-center. This setting will be a toggle that will enable or disable metrics collection for the entire system. We want to ensure that metrics are never submitted to Fedora without the user's knowledge and consent, so the underlying setting will be off by default in order to ensure metrics upload is not unexpectedly turned on when upgrading from an older version of Fedora. However, we also want to ensure that the data we collect is meaningful, so gnome-initial-setup will default to displaying the toggle as enabled, even though the underlying setting will initially be disabled. (The underlying setting will not actually be enabled until the user finishes the privacy page, to ensure users have the opportunity to disable the setting before any data is uploaded.) This is to ensure the system is opt-out, not opt-in. This is essential because we know that opt-in metrics are not very useful. Few users would opt in, and these users would not be representative of Fedora users as a whole. We are not interested in opt-in metrics.
To make this a little more confusing, metrics collection is actually separate from uploading. Collection is always initially enabled, while uploading is always initially disabled. The graphical toggle enables or disables both at the same time. That is, a newly-installed Fedora system will always collect metrics locally at first, but the collected metrics will be deleted and never submitted to Fedora if the user disables the metrics collection toggle on the privacy page. If the user leaves the toggle enabled, then the collected metrics may be submitted only after finishing the privacy page.
Metrics uploading will be opt-in for users who upgrade from previous versions of Fedora Workstation, because we don't yet have a mechanism to ask the user to consent to data collection after a system upgrade like we do for new installations, but metrics collection will be opt-out. That is, your upgraded system will collect metrics locally but will never submit them to Fedora. If you visit the privacy page in gnome-control-center, then both collection and uploading will be either enabled or disabled depending on the user's selection. Unlike gnome-initial-setup, the switch in gnome-control-center will default to off if the user has not seen the switch in gnome-initial-setup and has not previously selected a value for the setting.
This might sound complicated, but it is consistent. If the user has not yet made a decision whether to allow telemetry, we collect it locally so that it's ready to submit if the user approves telemetry in the future, but we never upload it. Once the user makes a decision, then we either upload it or delete it and stop collecting.
GDPR
It is Fedora Legal's obligation to ensure our data collection complies with legal requirements in the jurisdictions in which Red Hat operates. This is not an obligation of the Fedora community, so there is no need to discuss GDPR rules on our mailing lists. The proposal owners will not respond to mailing list posts that discuss GDPR or similar legal obligations during this change proposal discussion. In short, let's keep discussion focused on what Fedora SHOULD or SHOULD NOT do, rather than what we MUST or MUST NOT do.
That said, Fedora Legal has determined that if we collect any personally-identifiable data, the entire metrics system must be opt-in. Since we are only interested in opt-out metrics due to the low value of opt-in metrics, we must accordingly never collect any personally-identifiable data. We must also not collect any data that could become personally-identifiable if combined with other data, which notably means IP addresses must not be stored. We only want collect anonymous data anyway, but we need to be especially mindful of the possibility that combining two "anonymous" data points could result in the data no longer being anonymous.
Fedora data collection policy
Fedora Legal requires that we publish a Fedora data collection policy separate from the existing Fedora Privacy Policy, which is designed to address usage of Fedora websites. This is currently a work in progress that we're not quite ready to share yet. You can expect it to be very short and very generic.
Metrics server infrastructure
We propose to deploy Azafea, the open source metrics collection server used by Endless OS. An Azafea deployment consists of five components: an nginx proxy server, azafea-metrics-proxy, redis, azafea itself, and a Postgres database. nginx proxies HTTP requests to azafea-metrics-proxy, which is itself a simple HTTP server that adds metrics into the redis database, where they will be fetched by Azafea and stored into Postgres. We will provide instructions on how to set up your own server and see for yourself what data gets collected.
Metrics client infrastructure
The client side consists of eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation. eos-metrics is a D-Bus interface that applications and services may use to record events, plus a GObject library that provides a simple API around the D-Bus interface. eos-event-recorder-daemon is the service that actually implements this interface: it collects incoming metrics, batches them together, and sends them to the metrics server at predefined intervals. eos-metrics-instrumentation is the component that actually collects specific metrics. Originally, we had planned to not use this component and instead write our own fedora-metrics-instrumentation that would collect only a few particular metrics that are approved via Fedora community process. However, currently we are planning to ship eos-metrics-instrumentation and instead ensure that it is not collecting more metrics than would be acceptable to the Fedora community. A review process to decide which metrics to collect and which metrics to disable will be required.
Data set considerations
Although we assume the metrics server administrator is not malicious and will not actively attempt to deanonymize users, we will still take reasonable precautions to make it difficult to correlate metrics to a particular user, starting by not storing any IP address information in the metrics database. Additionally, each metric that we collect will be considered individual, non-correlatable data by default, unless approved to be correlated with particular other metrics via future Fedora community process. That is, if a user submits two data points, we usually don't want the ability to know that these data points were both submitted by the same user.
Each metric is stored in the database with a Unix timestamp indicating when it was generated on the client. If abused, this timestamp could allow correlation of data points that are collected at the same time as each other, or at a fixed time offset to other events. For example, if the system were designed to collect two metrics exactly 300 seconds after the system were booted, then just looking at the timestamps would be enough to determine that both metrics recorded at the same time were submitted by the same user. Accordingly, we should consider modifying the metrics server to reduce timestamp granularity at least somewhat.
History
Currently Fedora's only form of metrics collection is DNF Better Counting, but this only counts Fedora installations. That is useful, but we want to count more than just how many users we have.
Fedora's first metrics collection attempt was Smolt, a precursor to hw-probe which collected data on user hardware. The current proposal is different from Smolt because it will collect more than just hardware data, and also because Smolt collected only opt-in data. The current proposal would be opt-out, not opt-in.
This change proposal will likely be compared to the Ubuntu spyware complaints from a decade ago, when Ubuntu desktop users' search queries were sent to Amazon by default. Let's not do that.
Benefit to Fedora
The main benefit to Fedora is that we will be able to use collected metrics to inform design decisions. It is very common for developers to wish to know something about how Fedora software is used, and we will finally have a way to answer such questions.
Occasionally, Red Hat might need to collect specific metrics to justify additional time spent on contributing to Fedora or additional investment in Fedora.
Scope
- Proposal owners:
This change requires substantial technical and nontechnical work from the change owners. Most notably, we will need to package eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation properly for Fedora; they are currently packaged in a copr. We also still need to modify eos-metrics-instrumentation so that it does not send events not approved for use in Fedora, as we expect to collect less data than Endless OS.
- Other developers:
This proposal will require substantial effort by Community Platform Engineering (CPE) to host the metrics server infrastructure.
- Release engineering: #11514
- Policies and guidelines: New processes and guidelines are proposed above under the section "How will data collection be approved?"
- Trademark approval: N/A (not needed for this Change)
- Alignment with Objectives: This change does not align with any current Fedora Initiatives, which are very limited in scope. That said, one of the main purposes of metrics collection is to determine whether we are achieving other objectives not listed on the wiki page. For example, we want Fedora Workstation to become the premier developer workstation operating system. To that end, we want to know how many of our users are using particular IDEs.
Upgrade/compatibility impact
We would like to enable metrics upload for upgraded systems, but this isn't trivial because we want to obtain user consent before enabling metrics upload. This would require us to design a user interface that would run on upgraded systems and present the setting to users. We have not yet created such a user interface, so for now metrics upload will need to default to disabled for systems upgraded from older versions of Fedora. Since the underlying setting will be off by default, we don't need to do anything special to achieve this.
How To Test
The ultimate goal is to see metrics appear in the Postgres database of a metrics server, but configuring and running the server is not trivial. Accordingly, we propose to publish a separate document detailing how to set up and configure a metrics server for testing purposes, how to redirect metrics to the custom server, and how to force the client to immediately submit metrics to ease testing. Although we don't actually expect many community members to seriously run their own metrics servers, we still want to document the steps involved so that interested developers can see exactly how it works.
User Experience
A new metrics collection setting will be added to the privacy page in gnome-initial-setup and also to the privacy page in gnome-control-center. This setting will be a simple toggle that will enable or disable all metrics upload for the entire system. Users who do not want any metrics upload should feel confident that uploading can be disabled with a simple toggle.
Fedora users should be confident that Fedora metrics collection respects their privacy and collects only limited, anonymous usage data.
Dependencies
Any package that wishes to collect a metric would need to depend on eos-metrics. For example, if we were to collect statistics on which system settings panels are used most frequently, then the gnome-control-center package would need to depend on eos-metrics in order to send a metric to eos-event-recorder-daemon.
Contingency Plan
- Contingency mechanism: We would need to remove the eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation packages from the workstation-product comps group, and rebuild any packages that gained a dependency on eos-metrics.
- Contingency deadline: Beta freeze
- Blocks release? Yes, if the change is incomplete, it will need to be reverted before release.
Documentation
This feature will depend on several different upstream projects with varying amounts of documentation.
The client side consists of eos-metrics, eos-event-recorder-daemon, and eos-metrics-instrumentation. The best documentation of eos-metrics available online is its D-Bus interface XML. eos-metrics also contains normal API documentation that will be built and installed in a docs subpackage, but this is not currently available online. The eos-event-recorder-daemon and eos-metrics-instrumentation components do not appear to have any online documentation.
On the server end, the metrics server consists of azafea-metrics-proxy feeding metrics into redis, where they will be pulled by azafea and then added to a Postgres database. Documentation for azafea-metrics-proxy and azafea can be reviewed online. Events recognized by the server are documented here. Note that this documentation is currently focused on use by Endless OS rather than by Fedora, and includes documentation of many events that are no longer sent by Endless OS. This change proposal does not propose to enable sending any particular events in Fedora.
Release Notes
Release Notes are not required for initial proposal. We need to write the release notes before change freeze.
Feedback
This section summarizes feedback provided by users after the publication of this change proposal. It was written one week after the rest of this change proposal was published, except for the suggestion regarding reproducible builds, which was added on November 2, 2023.
Some users do not want any telemetry collected at all
Many Fedora users prefer that Fedora never collect any telemetry data from any users under any conditions. No modifications to the change proposal will satisfy these users.
Many users want to require opt-in rather than opt-out
A very large number of users complained that the toggle switch in gnome-initial-setup would default to the on position. This issue received approximately as much feedback than all other aspects of the change proposal combined. Users say the original design is a "dark pattern" to trick users into consenting to metrics collection by accident if they skip through the pages without reading. However, defaulting the toggle to the off position would undoubtedly result in few users consenting to data collection, resulting in a non-representative sample. The proposal owner suggests a compromise "suggested opt-in" design, where the UI encourages the user to opt-in, but the user must explicitly make a decision to do so or not. This introduces friction into the first boot experience that we would normally prefer to avoid, and the proposal owner is not sure whether an adequate proportion of users will agree to data collection, but it ensures that all data collection would be fully opt-in without precluding the possibility of achieving consent from a representative quantity of users. Several vocal users are still very unhappy that opting in would be the suggested action. The proposal owner does not believe it will be possible to reach a compromise acceptable to these users.
The "Benefit to Fedora" section of the change proposal is too short
The proposal's most concrete examples of benefit to Fedora from collecting particular metrics are difficult to find because they are located in the "What data might we collect?" section of the change proposal instead of the "Benefit to Fedora" section, which is short and insufficiently-persuasive. It is difficult to show concrete benefit to Fedora without more examples of what metrics would be collected.
The proposal should specify how ongoing benefit to Fedora will be demonstrated
One user suggested we should publish occasional blog posts describing specific examples of how the collected data has been used to improve Fedora. Fedora Magazine would be a good place for this.
The proposal does not specify which metrics would be collected
The proposal provides a few examples of metrics that Fedora might collect in the future, but it does not propose collecting any particular metrics. Many users found the few example metrics insufficient to determine whether they were comfortable with allowing Fedora to collect data in general.
The proposal does not specify technical mechanisms that would provide anonymity
The proposal promises to collect only anonymous data, but there is no technical mechanism to enforce anonymity. At the point when the telemetry server initially receives the data, it is not yet anonymous because the server can see both the user's IP address and an entire batch of metrics. The privacy guarantees rely on the server ignoring the IP address and storing all the metrics separately into its database. These guarantees would fail if the server were to be compromised by an attacker or if the server operator is malicious and installs code that does not match the azafea open source releases. Many users do not actually trust that Fedora will operate the server securely and as promised, and it is impossible to guarantee that any service will be secure even if best practices are followed.
Several users were concerned at the possibility that IP addresses would be taken from web server logs and correlated to data in the metrics database. The proposal owner believes the server can be configured such that web server logs are retained only for a very short period of time, but many users do not trust that Fedora will not change log retention policy in the future. Additionally, users familiar with Fedora infrastructure suggested that Fedora's proxy service would keep logs of all traffic, exacerbating this concern.
One way to reduce concerns would be for users to connect directly to a separate proxy server. The proxy server would connect to the real telemetry server, so neither the telemetry server nor Fedora's proxy service would see any user IP addresses. Metrics would be encrypted using a public key encryption scheme such that the proxy server cannot see any of the data that the user is submitting. Then both servers would need to be hacked in order to deanonymize users. However, most likely both services would be deployed on Fedora's OpenShift cluster if hosted by Fedora CPE, so in practice this might not provide much actual benefit unless the proxy server were to be controlled separately.
Several users suggested investigating use of differential privacy techniques to provide provable privacy guarantees. The proposal owner is unfamiliar with the topic but intends to investigate it.
Each metric will be stored in the database separately. For example, say we were to keep track of two metrics: a boolean to indicate whether the user launched GNOME Builder today, and the model of the user's GPU. Fedora would know how many users launched GNOME Builder on a given day, and it would know how many users have particular GPUs, but there would be no way to know that a user with a particular GPU launched GNOME Builder, because the metrics are not stored together.
However, the proposal failed to include this information, leading users to worry that the server could collect enough information to fingerprint them. This is not possible when the server is operating as expected. (However, fingerprinting would be possible if the server were operating maliciously and not storing the data as designed.)
The proposal would keep the metrics database private and does not specify who would be allowed to access it
The proposal has been criticized by many users for failing to make the database public. Many users will not trust that the data is truly anonymous unless the entire database is publicly available to ensure the server truly does not contain any personal data.
The proposal owner has indicated willingness to open up the database, but is not convinced this is really a good idea. It may be possible to apply unknown statistical techniques to the database to guess that unconnected records were submitted by the same user, which, if successful, would anger users. Additionally, if we fail to exercise due care when defining how metrics will be collected, it is possible we may inadvertently collect a metric that contains sensitive personal data. Users would be very upset if such data were made public.
One suggestion is to start with only a guinea pig group of volunteer users. The database will be made public, then we would wait and see if anybody is able to deanonymize the data. If nobody is able to demonstrate that unconnected records can be correlated, and if nobody finds any private information in the database, then we have increased confidence that it would be safe to make the entire database public.
Another suggestion is to periodically make public small samples of the database that have been reviewed by humans to ensure they do not inadvertently contain personal data.
The proposal provides insufficient user visibility into what data would be collected
The proposal says instructions will be provided so you can run your own metrics server to see exactly what data is being collected and verify that it is not creepy. However, most users do not have enough technical skill to set up a server, even if the instructions are relatively easy for technical users to follow. There should be easier ways to see what data is being collected. Many users proposed displaying this data in the OS itself, before asking users to consent to data collection. However, this would clutter the user interface significantly, and incorrectly imply that the data to be collected will not change. Instead, the proposal owner suggests developing an extra application that can be installed by users who wish to have a detailed view of what data would be collected. The OS would link to a wiki page containing detailed descriptions of each metric to be collected and showing examples of what they might look like.
Users have requested increased control over what data would be uploaded
Many users have suggested that, instead of a on/off switch, we instead provide a slider to allow adjusting the amount of data to be collected. Other users have requested fine-grained controls to enable or disable each particular metric. The proposal owner hopes that we will collect so little data that users who are willing to enable data collection will not feel the need to disable any particular metric, but acknowledges that it would be nice to have a mechanism to determine which metrics would be submitted.
The proposal references policies and procedures that do not exist yet
The proposal says metrics should be collected only in accordance with a metrics collection policy that does not exist yet. The proposal owner thought that leaving the policy initially undefined would be useful as it would allow the community to participate in its development. However, not having a proposed policy ready to share makes it harder to understand the possible scope of data collection. (Note this policy is required for legal purposes and will need to be approved by Fedora Legal.)
The proposal additionally says all metrics to be collected should be approved via an undefined community process. The proposal contains only a placeholder suggestion that metrics could be approved by FESCo. This section of the change proposal is ambiguously-worded and has been misunderstood to allow a "blank check" for the proposal owners to collect any desired metric without oversight if the change proposal is approved by FESCo. The intended interpretation was that the change proposal would not itself authorize the collection of any metrics; no metrics would be collected until individually approved via some additional process developed by the community. The proposal owner thought that leaving the process undefined would allow the community to participate in defining how the process would work; however, the Fedora community was not actually interested in doing this despite efforts to solicit feedback on this topic. Much confusion could have been avoided had a process for approving collection of metrics been specified from the beginning. The proposal owner now suggests that a thread be posted on Discourse each time a new metric is desired, specifying precisely which data will be collected including what the database schema and GVariant would look like. The community would have at least two weeks to provide feedback, then FESCo would vote on whether to approve the metric.
Collection of particular metrics should be limited to defined periods of time
The proposal does not include any limitations on how long particular metrics may be collected. A time limit for how long the metric needs to be collected for should be specified when requesting approval to collect a metric. Fedora should discontinue collection of the metric when the end time is reached. We might want to collect some metrics in perpetuity, but most metrics should only need to be collected for a short while to provide enough data to inform design decisions.
Collection of metrics should be limited to small cohorts of users
The proposal envisions collecting a particular data point from 100% of users who consent to data collection. For most metrics, this is overkill. A small percentage of users should be adequate to provide useful representative data so long as the set of participating users is a representative population. We can significantly reduce load on the telemetry service and avoid the database growing huge and unwieldy by randomly collecting most metrics from only a small, random subset of users.
Enabling local collection when uploading is disabled is confusing
The proposal says that local metrics collection will be initially enabled while uploading will be initially disabled. This means metrics would be collected locally before the user consents, even though they will all be deleted and never uploaded to Fedora if the user does not consent. Users who upgrade from previous versions of Fedora or who install the eos- packages by mistake would have local collection enabled indefinitely, waiting for the user to consent to uploading the data. The change proposal devotes several paragraphs to explaining how this would work. This feature is unpopular and is only needed to collect metrics early in the initial setup process, which we probably won't need to do. It would be simpler to have local collection disabled by default, matching the upload setting.
The GDPR section of the proposal is confusing
Many users complained that the proposal owner was ignoring GDPR concerns by refusing to respond to posts that mention GDPR. Although it was probably wise for the proposal owner to refuse to debate a topic he is not qualified to discuss, the change proposal did not explain the reasons for this choice, leading some users to conclude the proposal was drafted with a cavalier attitude towards legal requirements. Additionally, the proposal did not directly mention that Fedora Legal had already reviewed and approved the plans for data collection. Several users complained that implementing the change proposal would be illegal in Europe, unaware that the plan had already been approved by Fedora's data protection attorney.
The proposal does not specify that the telemetry packages may be removed
Some users were concerned it would not be possible to uninstall the eos- packages. The proposal should clarify that this will be possible because the eos- packages will be only weak dependencies of other packages.
Clarify that the proposal does not apply to packages that have their own upstream telemetry
The change proposal is intended to apply only to data that is collected by Fedora. It has no impact on Fedora packages that have their own upstream telemetry, like Firefox. The proposal should clarify this.
Consider collaborating with a trusted third party
More users might consent to data collection if Fedora were to collaborate with a trusted digital privacy organization to audit and review the proposal. The proposal owner would like to hear from anyone interested in such a collaboration.
The telemetry packages should support reproducible builds
The telemetry packages should support reproducible builds to ensure that the built package corresponds to the published source code for the package and has not been maliciously modified. Unlike Debian and some other distributions, Fedora has not historically invested in reproducible builds. The proposal owner is uncertain whether this is currently possible. A new issue tracker exists, so at least some work seems to be underway.