From Fedora Project Wiki

No edit summary
 
(9 intermediate revisions by 2 users not shown)
Line 17: Line 17:
:::: One could easily add all the ISO downloader / network installer IPs over the same time period, regardless of release, to the set of yum updaters. Of course, that might still exclude someone who got the N-3 release and uses it but never updates or adds a package. --mhuhtala Jun 10 04:59:13 UTC 2010
:::: One could easily add all the ISO downloader / network installer IPs over the same time period, regardless of release, to the set of yum updaters. Of course, that might still exclude someone who got the N-3 release and uses it but never updates or adds a package. --mhuhtala Jun 10 04:59:13 UTC 2010


In order to reduce the dynamic IP's error in estimating the number of active installations we could show the number of unique IP's for a month and the same number for a week. Supposing that most users in a month login at least weekly those numbers should be very close except the dynamic IP's error. If the user logins at least daily, the update is daily, the ip is different every time and is not reused in a month we have 7 IP's for that user a week and 31 a month. For n total users(excluding nat) and d dynamic IP users we would have in a month at most n + d * 30 unique IP's and in a week n + d*6, so the difference is d*24. After we find d we simply subtract d*6 from the weekly statistics and we find the corrected n. If the users updates or changes IP once a week (similar for IP recycling) we have n + d*4 and n + d, so d*3 the difference and we subtract d from the weekly figures. For Ni unique weekly IP's and a difference Di we would calculate in the first case (Ni - (Di/4)) and in the second case (Ni - (Di/3)). The real value is likely to be between this 2 estimates. The estimate could be made as very accurate for the lower bound in the number of active installation. Best would be to take the week in the middle of the month some moths after the release.
In order to reduce the dynamic IP's error in estimating the number of active installations we could show the number of unique IP's for a month and the same number for a week. Supposing that most users in a month login at least weekly those numbers should be very close except the dynamic IP's error and the percent of new users. If the user logins at least daily, the update is daily, the ip is different every time and is not reused in a month we have 7 IP's for that user a week and 31 a month. For n total users(excluding nat) and d dynamic IP users we would have in a month at most n + d * 30 unique IP's and in a week n + d*6, so the difference is d*24. After we find d we simply subtract d*6 from the weekly statistics and we find the corrected n. If the users updates or changes IP once a week (similar for IP recycling) we have n + d*4 and n + d, so d*3 the difference and we subtract d from the weekly figures. For Ni unique weekly IP's and a difference Di we would calculate in the first case (Ni - (Di/4)) and in the second case (Ni - (Di/3)). The real value is likely to be between this 2 estimates. The estimate could be made as very accurate for the lower bound in the number of active installations. Best would be to take the week in the middle of the month some moths after the release. If most new users stay we can add to the corrected weekly estimate somewhere around 65% of the percent of new users in that month in order to be more accurate(a big proportion of one time users means adding somewhat more to the corrected estimate).  --[[User:mihaiv|mihaiv]] 10:35, 21 September 2010 (UTC+2)
 
== Correcting for Dynamic IPs and NAT'd connections ==
 
While reading the details of how the numbers are gathered, I've come up with a rather simple method of counting unique installations, that may be less prone to error of dynamic IPs and NAT'd connections. Instead of unique IP visits to the repository, a count of core package downloads, such as kernel updates. Once a new install is complete, most users will download any updates and the most common should be the kernel package. This means that not only will we get a good idea about how many individual installs are behind NAT'd connections, but it will also effectively minimize any concern about inflation from dynamic IPs. Fedup upgrades and PXE/net.iso installs can also be registered this way.
 
The only caveats I can think of would be users behind a cached repository or caching proxy, offline DVD installations that will rarely, if ever, see an internet connection, and users who may need to redownload the kernel package for any reason. That being said, I would hazard a guess that only the first scenario would be common, however possibly less common than NAT, and there is no acceptable way to account for the second.--[[User:Acidgrim|ACiDGRiM]] ([[User talk:Acidgrim|talk]]) 06:11, 17 July 2013 (UTC)

Latest revision as of 06:11, 17 July 2013

I may not have understood the yum method correctly, but as far as I can tell, the IPs are only unique per release version. I.e. if I install F12 on a system with a static IP, do updates, then wipe it and install F13 instead and do updates to that, my IP gets counted as unique in both the F12 and the F13 number. Thus the total across releases is not really the number of unique IPs.

Jef Spaleta estimated the total number of Fedora clients to be 16 million in May 2009. If the non-smolt-corrected number is now 22 million, Fedora has had a nearly 40 % increase in installed base over the past year. That doesn't seem quite realistic. --mhuhtala 2010-06-08

The sum of IP addresses at the right of that chart could be misleading, and I've been considering dropping it. However, the total IP address count is unique across all releases. So the case you mention above would not be counted twice in that number. --pfrields 14:52, 8 June 2010 (UTC)
Then it seems that the Statistics/Commands page doesn't quite describe the whole method. It clearly counts unique IPs per release, and the total (22 million) number is exactly the sum of the per-release numbers. If the total sum is unique IPs only, are the addresses that downloaded F12 excluded from F13, and so on? If not, how can you just add the numbers together without taking overlap into account? --mhuhtala Jun 9 12:35:18 UTC 2010
It is not exactly the sum of the per-release numbers, which you can check for yourself by adding the figured shown in the chart -- they total well over 27,000,000 without counting Rawhide. An IP address that pulls updates from F12 and F13 is counted only once. --pfrields 01:52, 10 June 2010 (UTC)
Ooops, my mistake, sorry. I did the sum and thought I got the 22 million number exactly, but on recount I get 27 million. I probably missed one release from the sum on the first attempt. --mhuhtala Jun 10 04:59:13 UTC 2010
On a second thought, maybe a time-based number would give a better idea of the current active user base. E.g. the number of unique IPs that have fetched yum updates in the last 12 or 18 months regardless of release version. This would include even updates to EOLed releases (people booting up their Fedoras after a hiatus and getting updates to an EOLed system). --mhuhtala Jun 9 12:44:21 UTC 2010
That is one way of counting, certainly. It excludes people who install and do not regularly update at all, and has a few other drawbacks but it's no more flawed than anything else I've seen. I'll try to pull these numbers when I get time to write some more scripts for them. --pfrields 01:52, 10 June 2010 (UTC)
One could easily add all the ISO downloader / network installer IPs over the same time period, regardless of release, to the set of yum updaters. Of course, that might still exclude someone who got the N-3 release and uses it but never updates or adds a package. --mhuhtala Jun 10 04:59:13 UTC 2010

In order to reduce the dynamic IP's error in estimating the number of active installations we could show the number of unique IP's for a month and the same number for a week. Supposing that most users in a month login at least weekly those numbers should be very close except the dynamic IP's error and the percent of new users. If the user logins at least daily, the update is daily, the ip is different every time and is not reused in a month we have 7 IP's for that user a week and 31 a month. For n total users(excluding nat) and d dynamic IP users we would have in a month at most n + d * 30 unique IP's and in a week n + d*6, so the difference is d*24. After we find d we simply subtract d*6 from the weekly statistics and we find the corrected n. If the users updates or changes IP once a week (similar for IP recycling) we have n + d*4 and n + d, so d*3 the difference and we subtract d from the weekly figures. For Ni unique weekly IP's and a difference Di we would calculate in the first case (Ni - (Di/4)) and in the second case (Ni - (Di/3)). The real value is likely to be between this 2 estimates. The estimate could be made as very accurate for the lower bound in the number of active installations. Best would be to take the week in the middle of the month some moths after the release. If most new users stay we can add to the corrected weekly estimate somewhere around 65% of the percent of new users in that month in order to be more accurate(a big proportion of one time users means adding somewhat more to the corrected estimate). --mihaiv 10:35, 21 September 2010 (UTC+2)

Correcting for Dynamic IPs and NAT'd connections

While reading the details of how the numbers are gathered, I've come up with a rather simple method of counting unique installations, that may be less prone to error of dynamic IPs and NAT'd connections. Instead of unique IP visits to the repository, a count of core package downloads, such as kernel updates. Once a new install is complete, most users will download any updates and the most common should be the kernel package. This means that not only will we get a good idea about how many individual installs are behind NAT'd connections, but it will also effectively minimize any concern about inflation from dynamic IPs. Fedup upgrades and PXE/net.iso installs can also be registered this way.

The only caveats I can think of would be users behind a cached repository or caching proxy, offline DVD installations that will rarely, if ever, see an internet connection, and users who may need to redownload the kernel package for any reason. That being said, I would hazard a guess that only the first scenario would be common, however possibly less common than NAT, and there is no acceptable way to account for the second.--ACiDGRiM (talk) 06:11, 17 July 2013 (UTC)