Proxy01 down issue
Description
Proxy01 went down with lots of nagios notifications following. This also dragged down koji from internal (internally, koji.fp.o is only mapped to proxy01).
When the issue presented itself
February 18, 2017, 09:15 UTC was the first nagios notification.
When it recovered (or got fixed)
09:20 UTC, rerouted internal proxy01 to proxy10 and disabled proxy01 in DNS to prevent user issues.
09:40 UTC, after clearing koji access logs, proxy01 httpd started.
09:45 UTC, proxy01 re-enabled in DNS.
Root cause
Koji access logs filled up proxy01's drive to 100%. For some reason, logrotate had not rotated the 20170218 logs out to an xz-compressed form, meaning there were lots of multi-GB access files, filling up the disk entirely.
Service owners
Follow-up steps
- Figure out why logrotate didn't work. => looks like it did work, but the files just growed so hard that it never got the chance to rotate.
- Figure out why access logs were bigger than usual (access logs on proxy01 gone, will need to come from hubs).
- Make internal use both proxy01 and proxy10 for koji access
Future ideas
- Make it easier to disable proxy by only providing name in cmds/ files.