|
|
(8 intermediate revisions by 5 users not shown) |
Line 1: |
Line 1: |
| = Nagios: Standard Operating Procedure =
| | {{header|infra}} |
|
| |
|
| | | {{admon/important|All SOPs have been moved to the Fedora Infrastructure [https://pagure.io/infra-docs/ SOP git repository]. Please consult the [https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/index.html online documentation] for the current version of this document.}} |
| | |
| == Contact Information ==
| |
| Owner: Fedora Infrastructure Team
| |
| | |
| Contact: #fedora-admin, sysadmin-main & sysadmin-noc groups
| |
| | |
| Location: Anywhere
| |
| | |
| Servers: noc1, noc2, puppet1
| |
| | |
| Purpose: This SOP is to describe nagios configurations
| |
| | |
| = Initial Configuration =
| |
| == CGI Access ==
| |
| To view information in nagios (anything with cgi-bin in the path) you need to be able to grant yourself access. After checking out the Puppet CVS tree as described in the [[Infrastructure/SOP/Puppet |Puppet SOP]] you first need to edit configs/system/nagios/cgi.cfg and append your FAS username to 'authorized_for_system_commands'
| |
| == Contact Information ==
| |
| {{Admon/caution | You must configure a contacts file to be able to acknowledge [[Infrastructure/SOP/Outage |outages]]}} | |
| | |
| Create a new file named 'fasname.cfg' in configs/system/nagios/contacts/ with the following details:
| |
| <pre>
| |
| define contact{
| |
| contact_name fasname
| |
| alias Real Name
| |
| service_notification_period 24x7
| |
| host_notification_period 24x7
| |
| service_notification_options w,u,c,r
| |
| host_notification_options d,u,r
| |
| service_notification_commands notify-by-email
| |
| host_notification_commands host-notify-by-email
| |
| email Email address (any)
| |
| }
| |
| </pre>
| |
| {{Admon/warning | Using the 24x7 notification period may cause duplicate messages if you are a member of sysadmin-main, in which case you can specify 'never' instead}}
| |
| | |
| Next append your name to the 'members' section of configs/system/nagios/contactgroups/fedora-sysadmin-email.cfg
| |
| | |
| == nagios-external ==
| |
| The same changes will need to be applied with the nagios-external configuration (configs/system/nogios-external)
| |
| | |
| == Commit Changes ==
| |
| {{Admon/caution | Remember to "cvs add" the contacts/fasname.cfg files}}
| |
| | |
| Commit changes by running <code>cvs commit -m "Adding fasname to Nagios"</code> and then mark the changes for distribution by <code>make install</code>
| |
| | |
| = Configuration =
| |
| == Instances ==
| |
| Fedora Project runs two nagios instances, [https://admin.fedoraproject.org/nagios nagios] (noc1) and [http://admin.fedoraproject.org/nagios-external nagios-external] (noc2), you must be in the 'sysadmin' group to accesss them.
| |
| | |
| == nagios (noc1) ==
| |
| The nagios configuration on noc1 should only monitor general host statistics - puppet status, uptime, apache status (up/down), SSH etc.
| |
| | |
| The configurations are found at <code>configs/system/nagios/</code> in the puppet tree.
| |
| | |
| == nagios-external (noc2) ==
| |
| The nagios configuration on noc2 is located outside of our main datacenter and should monitor our user websites/applications (fedoraproject.org, FAS, PackageDB, Bodhi/Updates).
| |
| | |
| The configurations are found at <code>configs/system/nagios-external/</code> in the puppet tree.
| |
| | |
| = Understanding the Messages =
| |
| == General ==
| |
| Nagios notifications are generally easy to read, and follow this consistent format:
| |
| <pre>
| |
| ** PROBLEM/ACKNOWLEDGEMENT/RECOVERY alert - hostname/Check is WARNING/CRITICAL/OK **
| |
| ** HOST DOWN/UP alert - hostname **
| |
| </pre>
| |
| Reading the message will provide extra information on what is wrong.
| |
| | |
| == Disk Space Warning/Critical ==
| |
| Disk space warnings normally include the following information:
| |
| <pre>
| |
| DISK WARNING/CRITICAL/OK - free space: mountpoint freespace(MB) (freespace(%) inode=freeinodes(%)):
| |
| </pre>
| |
| | |
| A message stating "(1% inode=99%)" means that the diskspace is critical '''not''' the inode usage and is a sign that more diskspace is required.
| |
| | |
| = Further Reading =
| |
| * [[Infrastructure/SOP/Puppet |Puppet SOP]]
| |
| * [[Infrastructure/SOP/Outage |Outages SOP]]
| |
|
| |
|
| [[Category:Infrastructure SOPs]] | | [[Category:Infrastructure SOPs]] |