Proxy01 down issue

Description

Proxy01 went down with lots of nagios notifications following. This also dragged down koji from internal (internally, koji.fp.o is only mapped to proxy01).

When the issue presented itself

February 18, 2017, 09:15 UTC was the first nagios notification.

When it recovered (or got fixed)

09:20 UTC, rerouted internal proxy01 to proxy10 and disabled proxy01 in DNS to prevent user issues.

09:40 UTC, after clearing koji access logs, proxy01 httpd started.

09:45 UTC, proxy01 re-enabled in DNS.

Root cause

Koji access logs filled up proxy01's drive to 100%. For some reason, logrotate had not rotated the 20170218 logs out to an xz-compressed form, meaning there were lots of multi-GB access files, filling up the disk entirely.

Service owners

Follow-up steps

Figure out why logrotate didn't work. => looks like it did work, but the files just growed so hard that it never got the chance to rotate.
Figure out why access logs were bigger than usual (access logs on proxy01 gone, will need to come from hubs).
Make internal use both proxy01 and proxy10 for koji access

Future ideas

Make it easier to disable proxy by only providing name in cmds/ files.

Search

User:Puiterwijk/FedoraInfraRCAWriteups/20170218-proxy01-disk-full

Contents