Monitoring Best Practices
The following monitoring processes are considered best practices for reviewing and troubleshooting potential issues with Cumulus RMP environments. In addition, several of the more common issues have been listed, with potential solutions included.
This document aims to provide two sets of outputs:
Metrics that can be polled from Cumulus RMP and used in trend analysis
Critical log messages that can be monitored for triggered alerts
Trend Analysis via Metrics
A metric is a quantifiable measure that is used to track and assess the status of a specific infrastructure component. It is a check collected over time. Examples of metrics include bytes on an interface, CPU utilization and total number of routes.
Metrics are more valuable when used for trend analysis.
Alerting via Triggered Logging
Triggered issues are normally sent to syslog, but could go to another
log file depending on the feature. On Cumulus RMP,
rsyslog handles all
logging including local and remote logging. Logs are the best method to
use for generating alerts when the system transitions from a stable
Sending logs to a centralized collector, then creating an alerts based on critical logs is optimal solution for alerting.
smond process provides monitoring functionality for various switch
hardware elements. Mininmum/maximum values are output, depending on the
flags applied to the basic command. The hardware elements and applicable
commands/flags are listed in the table below:
Front Panel LED
Fan speed issues
Cumulus RMP includes a number of ways to monitor various aspects of system data. In addition, alerts are issued in high risk situations.
CPU Idle Time
When a CPU reports five high CPU alerts within a span of 5 minutes, an alert is logged.
Short High CPU Bursts
Short bursts of high CPU can occur during switchd churn or routing protocol startup. Do not set alerts for these short bursts.
Cumulus RMP 3.0 and later monitors CPU, memory and disk space via
sysmonitor. The configurations for the thresholds are stored in
/etc/cumulus/sysmonitor.conf. More information is available via
|Use||Alert: 90% Crit: 95%|
|Process Load||Alarm: 95% Crit: 125%|
In Cumulus RMP 2.5, CPU logs are created with each unique threshold:
|CPU measure||< 2.5 Threshold|
Cumulus RMP 2.5, CPU and Memory warnings are generated via jdoo. The configuration for the thresholds are stored in /etc/jdoo/jdoorc.d/cl-utilities.rc.
When the memory utilization exceeds 90% a warning is logged and a cl-support is generated.
When monitoring disk utilization tmpfs can be excluded from monitoring.
In Cumulus RMP 3.0 and later, systemd is responsible for monitoring and restarting processes.
View processes monitored by systemd
Cumulus RMP 2.5.4 through 2.5 ESR uses a forked version of monit called jdoo to monitor processes. If the process ever fails, jdoo then invokes init.d to restart the process.
View processes monitored by jdoo
View process restarts
View current process state
Layer 1 Protocols and Interfaces
Link and port state interface transitions are logged to /var/log/syslog and /var/log/switchd.log.
Interface counters are obtained from either querying the hardware or the Linux kernel. The two outputs should align, but the Linux kernel aggregates the output from the hardware.
Interface Counter Element
Layer 1 Logs
Link failure/Link flap
Prescriptive Topology Manager (PTM) uses LLDP information to compare against a topology.dot file that describes the network. It has built in alerting capabilities, so it is preferable to use PTM on box rather than polling LLDP information regularly. The PTM code is available on the Cumulus Networks github repository. Additional PTM, BFD and associated logs are documented in the code.
Peering information should be tracked through PTM. For more information, refer to the Prescriptive Topology Manager documentation.
Prescriptive Topology Manager
Layer 2 Protocols
Spanning tree is a protocol that prevents loops in a layer 2 infrastructure. In a stable state, the spanning tree protocol should stably converge. Monitoring the Topology Change Notifications (TCN) in STP helps identify when new BPDUs were received.
Interface Counter Element
STP TCN Transitions
Layer 2 Logs
Spanning Tree Working
Spanning Tree Blocking
Layer 3 Protocols
Layer 3 Logs
Routing protocol process crash
The table below covers the various log files, and what they should be used for:
Catch all log file. Identifies memory leaks and CPU spikes.
Hardware Abstraction Layer (HAL).
Protocols and Services
Run the following command to confirm the NTP process is working correctly, and that the switch clock is synced with NTP:
cumulus@switch:~$ /usr/bin/ntpq -p
Device Access Logs
User Authentication and Remote Login
Device Super User Command Logs
Super User Command Logs
Executing commands using sudo