From Fedora Project Wiki

No edit summary
 
(8 intermediate revisions by 2 users not shown)
Line 1: Line 1:
{{admon/warning|This page was a draft proposal from a long time ago. This is kept for posterity, but users should not use the URLs in this page. Please use mirrorlist for all repo configurations.}}
Initial thoughts by Matt Domsch
Initial thoughts by Matt Domsch


* Use Reduced Redundancy Storage.  All the content will be replicated easily.
* Use Reduced Redundancy Storage.  All the content will be replicated easily.
* Use s3cmd sync to keep content in buckets in sync
* Use s3cmd sync to keep content in buckets in sync
** exclude ISOs
* see exclude list below
** exclude debuginfo? I think so.
* Use bucket policies to limit access to each region (if needed; not implemented yet)
* Use bucket policies to limit access to each region
* Need list of IP addresses for each region to populate MM.  Would be nice if we could get that programmatically.
* Need list of IP addresses for each region to populate MM.  Would be nice if we could get that programmatically.
** https://forums.aws.amazon.com/ann.jspa?annID=1351
** https://forums.aws.amazon.com/ann.jspa?annID=1453
* Per FI meeting 20120216, suggest using secondary01 or a releng* box for the copying. secondary01 does not have /pub/epel or /pub/fedora currently mounted.
** This moves around, so start there, then go to Forums, Elastic Compute Cloud, and look for the sticky thread.
* syncing from bapp01 at present because it has r/o access to the trees


* bucket names s3-mirror-<region>.fedoraproject.org allow for CNAME s3-mirror.fedoraproject.org to s3.amazon.com in our DNS
* bucket names s3-mirror-<region>.fedoraproject.org allow for CNAME s3-mirror.fedoraproject.org to s3.amazon.com in our DNS
Line 16: Line 19:
| US Standard || s3-website-us-east-1.amazonaws.com || s3-mirror-us-east-1.fedoraproject.org || s3-mirror-us-east-1.fedoraproject.org CNAME s3-mirror-us-east-1.fedoraproject.org.s3-website-us-east-1.amazonaws.com
| US Standard || s3-website-us-east-1.amazonaws.com || s3-mirror-us-east-1.fedoraproject.org || s3-mirror-us-east-1.fedoraproject.org CNAME s3-mirror-us-east-1.fedoraproject.org.s3-website-us-east-1.amazonaws.com
|-
|-
| US West (Oregon) Region || s3-website-us-west-2.amazonaws.com
| US West (Oregon) Region || s3-website-us-west-2.amazonaws.com || s3-mirror-us-west-2.fedoraproject.org
|-
|-
| US West (Northern California) Region || s3-website-us-west-1.amazonaws.com
| US West (Northern California) Region || s3-website-us-west-1.amazonaws.com || s3-mirror-us-west-1.fedoraproject.org
|-
|-
| EU (Ireland) Region || s3-website-eu-west-1.amazonaws.com
| EU (Ireland) Region || s3-website-eu-west-1.amazonaws.com
Line 46: Line 49:
* none for all uploads
* none for all uploads
* none for intra-region requests
* none for intra-region requests
* 0.093/GB/month for data, 200GB = $30-40/month/region.  7 Regions.
* 0.093/GB/month for data, ~218GB = $20/month/region.  7 Regions, but start in US East 1 only for now.
* a good number of PUT and LIST requests due to mirroring several times a day. Could be another $20/month just for those.
* no way guess number of GET requests.  $40 assumes 10M requests, while $30/month assumes 1M requests.
* no way guess number of GET requests.  $40 assumes 10M requests, while $30/month assumes 1M requests.
Total: ~$280/month, or $3360/yr
Total: ~$280/month, or $3360/yr
Line 91: Line 95:
== Problems Encountered ==
== Problems Encountered ==


* s3cmd sync processes excludes after walking the whole local directory tree with os.walk().  This means it recurses over .snapshot/ and all the directories we want to exclude, increasing processing time by 20x (>700k files vs ~35k files we'll actually upload).  Matt has a patch to s3cmd to fix this, but it's ugly and needs love.
* s3cmd sync processes excludes after walking the whole local directory tree with os.walk().  This means it recurses over .snapshot/ and all the directories we want to exclude, increasing processing time by 20x (>700k files vs ~35k files we'll actually upload).  Matt has an upstream pull request to fix this. https://github.com/s3tools/s3cmd/pull/27
* On subsequent syncs, got this error from the /pub/epel tree:
* On subsequent syncs, got this error from the /pub/epel tree:
  <pre>
  <pre>
Line 99: Line 103:
** This is caused by file names that have plus characters in their name.
** This is caused by file names that have plus characters in their name.
** Upstream bug: https://github.com/s3tools/s3cmd/issues/28
** Upstream bug: https://github.com/s3tools/s3cmd/issues/28
** Matt made a little patch to avoid trying to delete such files, so that syncs can complete again, but they leave behind obsolete files.
** this doesn't impact the ability to download files from S3 URLs which have plus signs in their file name
** this doesn't impact the ability to download files from S3 URLs which have plus signs in their file name
** Using urlencoding_mode=fixbucket does not work. Yum will download the file w/o URL encoding (e.g. with plus chars in the file name), and S3 won't return that file because it doesn't exist.
* The MD5 checks don't happen at all for files uploaded via multipart, which seems to affect larger files.  This defeats the purpose of MD5 checking.  But, we can't disable MD5 checking for all files, because repomd.xml often changes content but doesn't change file size.  So, we need MD5 checking only for some files.
* The MD5 checks don't happen at all for files uploaded via multipart, which seems to affect larger files.  This defeats the purpose of MD5 checking.  But, we can't disable MD5 checking for all files, because repomd.xml often changes content but doesn't change file size.  So, we need MD5 checking only for some files.
* turns out we need MD5 checking for package re-signing too.  I guess we're stuck with using MD5 checking.
** We could put additional custom metadata (e.g. RPM package field data) into S3 if we wanted to.  That would make s3cmd more tailored to our use case.
** It does store mtime/ctime values in the metadata.  Need to add code to check those.
** It does store mtime/ctime values in the metadata.  Need to add code to check those.
* upload of initial bucket for EPEL took real 651m57.042s, /pub/fedora took real 892m52.286s.
* upload of initial bucket for EPEL took real 651m57.042s, /pub/fedora took real 892m52.286s.
* subsequent syncs failed because of the above element error, but took 12m and 21m respectively (w/o transferring any changes due to the error)
* subsequent syncs failed because of the above element error, but took 12m and 21m respectively (w/o transferring any changes due to the error)
* MirrorManager's report_mirror program needs to be run after the sync, because this will be a private mirror.  But, it also blindly does os.walk(), without a concept of excludes.  Solutions are to either make a private copy of the whole content (ugh!), or add --exclude-from=<file> handling to report_mirror.  Matt did the latter.
* MirrorManager's report_mirror program needs to be run after the sync, because this will be a private mirror.  But, it also blindly does os.walk(), without a concept of excludes.  Solutions are to either make a private copy of the whole content (ugh!), or add --exclude-from=<file> handling to report_mirror.  Matt did the latter.
* no hardlinks.  Total /pub/epel and /pub/fedora trees take ~77GB.
* no hardlinks.  Total /pub/epel and /pub/fedora trees take ~77GB. Bug filed https://github.com/s3tools/s3cmd/issues/29
** If we use S3->S3 copy within one region, we can get 17-22MB/sec out of the copy, instead of ~700KB/sec doing uploads or S3->S3 inter-region copying.  So let's get hardlinks working!!
* no softlinks.  This could pose a problem for EPEL consumers, where the version strings look like '4AS' which we've pointed at '4'.  May be able to work around it with MM redirects.
* no softlinks.  This could pose a problem for EPEL consumers, where the version strings look like '4AS' which we've pointed at '4'.  May be able to work around it with MM redirects.
* No --delete-after option, deletes occur before new content is uploaded, which puts the repository into an inconsistent state during the upload.  Matt has an upstream pull request to fix this: https://github.com/s3tools/s3cmd/pull/30

Latest revision as of 12:17, 30 June 2019

This page was a draft proposal from a long time ago. This is kept for posterity, but users should not use the URLs in this page. Please use mirrorlist for all repo configurations.


Initial thoughts by Matt Domsch

  • Use Reduced Redundancy Storage. All the content will be replicated easily.
  • Use s3cmd sync to keep content in buckets in sync
  • see exclude list below
  • Use bucket policies to limit access to each region (if needed; not implemented yet)
  • Need list of IP addresses for each region to populate MM. Would be nice if we could get that programmatically.
  • syncing from bapp01 at present because it has r/o access to the trees
  • bucket names s3-mirror-<region>.fedoraproject.org allow for CNAME s3-mirror.fedoraproject.org to s3.amazon.com in our DNS
Region Region Server Bucket Name CNAME
US Standard s3-website-us-east-1.amazonaws.com s3-mirror-us-east-1.fedoraproject.org s3-mirror-us-east-1.fedoraproject.org CNAME s3-mirror-us-east-1.fedoraproject.org.s3-website-us-east-1.amazonaws.com
US West (Oregon) Region s3-website-us-west-2.amazonaws.com s3-mirror-us-west-2.fedoraproject.org
US West (Northern California) Region s3-website-us-west-1.amazonaws.com s3-mirror-us-west-1.fedoraproject.org
EU (Ireland) Region s3-website-eu-west-1.amazonaws.com
Asia Pacific (Singapore) Region s3-website-ap-southeast-1.amazonaws.com
Asia Pacific (Tokyo) Region s3-website-ap-northeast-1.amazonaws.com
South America (Sao Paulo) Region s3-website-sa-east-1.amazonaws.com



Torrents:

  • if we upload ISOs, we get .torrent links "for free".
  • no tracker stats :-(
  • Can't group multiple files together into a single torrent
  • we're paying for outbound bandwidth
  • bucket policies keeping traffic in a single region means we need separate buckets for torrent content


Costs:

  • none for all uploads
  • none for intra-region requests
  • 0.093/GB/month for data, ~218GB = $20/month/region. 7 Regions, but start in US East 1 only for now.
  • a good number of PUT and LIST requests due to mirroring several times a day. Could be another $20/month just for those.
  • no way guess number of GET requests. $40 assumes 10M requests, while $30/month assumes 1M requests.

Total: ~$280/month, or $3360/yr

Open questions:

  • do we sync to one region, then COPY to others? If so, what tool? That'll cost $ for bandwidth.

Proposed Excludes:

source/
SRPMS/
debug/
beta/
ppc/
ppc64/
repoview/
Fedora/
Live/
isolinux/
images/
EFI/
drpms/
core/
extras/
LiveOS/
updates/8
updates/9
updates/10
updates/11
updates/12
updates/13
updates/14
updates/testing/8
updates/testing/9
updates/testing/10
updates/testing/11
updates/testing/12
updates/testing/13
updates/testing/14
releases/test/


Problems Encountered

  • s3cmd sync processes excludes after walking the whole local directory tree with os.walk(). This means it recurses over .snapshot/ and all the directories we want to exclude, increasing processing time by 20x (>700k files vs ~35k files we'll actually upload). Matt has an upstream pull request to fix this. https://github.com/s3tools/s3cmd/pull/27
  • On subsequent syncs, got this error from the /pub/epel tree:
 ERROR: no element found: line 1, column 0
 ERROR: Parameter problem: Bucket contains invalid filenames. Please run: s3cmd fixbucket s3://your-bucket/
 
    • This is caused by file names that have plus characters in their name.
    • Upstream bug: https://github.com/s3tools/s3cmd/issues/28
    • this doesn't impact the ability to download files from S3 URLs which have plus signs in their file name
    • Using urlencoding_mode=fixbucket does not work. Yum will download the file w/o URL encoding (e.g. with plus chars in the file name), and S3 won't return that file because it doesn't exist.
  • The MD5 checks don't happen at all for files uploaded via multipart, which seems to affect larger files. This defeats the purpose of MD5 checking. But, we can't disable MD5 checking for all files, because repomd.xml often changes content but doesn't change file size. So, we need MD5 checking only for some files.
  • turns out we need MD5 checking for package re-signing too. I guess we're stuck with using MD5 checking.
    • We could put additional custom metadata (e.g. RPM package field data) into S3 if we wanted to. That would make s3cmd more tailored to our use case.
    • It does store mtime/ctime values in the metadata. Need to add code to check those.
  • upload of initial bucket for EPEL took real 651m57.042s, /pub/fedora took real 892m52.286s.
  • subsequent syncs failed because of the above element error, but took 12m and 21m respectively (w/o transferring any changes due to the error)
  • MirrorManager's report_mirror program needs to be run after the sync, because this will be a private mirror. But, it also blindly does os.walk(), without a concept of excludes. Solutions are to either make a private copy of the whole content (ugh!), or add --exclude-from=<file> handling to report_mirror. Matt did the latter.
  • no hardlinks. Total /pub/epel and /pub/fedora trees take ~77GB. Bug filed https://github.com/s3tools/s3cmd/issues/29
    • If we use S3->S3 copy within one region, we can get 17-22MB/sec out of the copy, instead of ~700KB/sec doing uploads or S3->S3 inter-region copying. So let's get hardlinks working!!
  • no softlinks. This could pose a problem for EPEL consumers, where the version strings look like '4AS' which we've pointed at '4'. May be able to work around it with MM redirects.
  • No --delete-after option, deletes occur before new content is uploaded, which puts the repository into an inconsistent state during the upload. Matt has an upstream pull request to fix this: https://github.com/s3tools/s3cmd/pull/30