Monitoring packet buffers and their utilization is vital for proper traffic management on a network. It is quite useful for:
- Identifying microbursts that result in longer packet latency
- Giving early warning signs of packet buffer congestion that could lead to packet drops
- Quickly identifying a network problem with a particular switch, port or traffic class
You can use buffer utilization monitoring to quickly filter out non-problematic switches so you can focus on the ones causing trouble on the network.
The monitoring involves a set of configurable triggers, that, when triggered can lead to any or all of the following three actions:
- Log actions, which involves writing to
- Snapshot actions, which involves writing to a file detailing the current state
- Collect actions, where the switch can collect more information
The monitoring is managed by the
asic-monitor service, which is in
turn managed by
Buffer monitoring is supported on Mellanox switches only.
The Mellanox Spectrum ASIC provides a mechanism to measure and report egress queue lengths in histograms. You can configure the ASIC to measure up to 64 egress queues. Each queue is reported through a histogram with 10 bins, where each bin represents a range of queue lengths.
You configure the histogram with a minimum size boundary (Min) and a histogram size - the monitor.histogram_pg.histogram.minimum_bytes_boundary and monitor.histogram_pg.histogram.bin_size_bytes settings, which are described in the table below.
You then derive the maximum size boundary (Max) by adding the Min and the histogram size.
The 10 bins are numbered 0 through 9. Bin 0 represents queue lengths up to the Min specified, including queue length 0.
Bin 9 represents queue lengths of Max and above.
Bins 1 through 8 represent equal-sized ranges between the Min and Max, which is determined by dividing the histogram size by 8.
For example, consider the following histogram queue length ranges, in bytes:
- Min = 960
- Histogram size = 12288
- Max = 13248
- Range size = 1536
- Bin 0: 0:959
- Bin 1: 960:2495
- Bin 2: 2496:4031
- Bin 3: 4032:5567
- Bin 4: 5568:7103
- Bin 5: 7104:8639
- Bin 6: 8640:10175
- Bin 7: 10176:11711
- Bin 8: 11712:13247
- Bin 9: 13248:*
When using the snapshot action, all of this information is captured in the file specified by the monitor.histogram_pg.snapshot.file setting.
Configuring Buffer Monitoring
asic-monitor tool has a number of settings you need to configure
before you can start monitoring. They’re described in the following
A user-defined list of all the port groups in the monitor file. The configuration file contains the following port group names as examples:
You must specify at least one port group. If the port group list is empty, then
The range of ports for which histograms are configured. This setting can take GLOBs and comma-separated lists, like swp1-swp4,swp8,swp10-swp50.
Each port group monitors one kind of hardware state, in this case, a histogram.
Each CoS (Class of Service) value in the list has its own histogram on each port.
The type of trigger that can initiate state collection. The only valid option is timer. This setting is optional.
If no port group has its trigger type set to timer, the
The frequency at which the histogram triggers; for example, a setting of 1s indicates it executes once per second.
The timer can be set to:
State collection is initiated when triggered by the
The prefix file name for the snapshot file. All snapshots use this name, with a sequential number appended to it. For example, /var/lib/cumulus/histogram_stats_0.
The number of snapshots that can be created before the first snapshot file is overwritten. While more snapshots can provide you with more data, they can occupy a lot of disk space on the switch. See Caveats and Errata below.
The minimum boundary size for the histogram in bytes. On a Mellanox switch, this number must be a multiple of 96.
Adding this number to the size of the histogram produces the maximum boundary size.
The size of the histogram in bytes.
Adding this number and the minimum_bytes_boundary value together produces the maximum boundary size.
The sampling time of the histogram in nanoseconds.
The length of the queue in bytes before the log action writes a message to
During state collection, when this queue length (measured in bytes) is reached, the collect action initiates another state collection.
The port groups that get triggered by the histogram_pg collect action.
The configuration is stored in the
file. You edit the settings in the file directly with a text editor.
There is no default configuration. Here is a sample configuration:
cumulus@switch:~$ cat /etc/cumulus/datapath/monitor.conf monitor.port_group_list = [discards_pg,histogram_pg,all_packet_pg,buffers_pg] # The queue length histograms are collected every second # and the results are written to a snapshot file. # Sixty-four snapshot files will be maintained. monitor.histogram_pg.port_set = swp1-swp50 monitor.histogram_pg.stat_type = histogram monitor.histogram_pg.cos_list =  monitor.histogram_pg.trigger_type = timer monitor.histogram_pg.timer = 1s monitor.histogram_pg.action_list = [snapshot,collect,log] monitor.histogram_pg.snapshot.file = /var/lib/cumulus/histogram_stats monitor.histogram_pg.snapshot.file_count = 64 monitor.histogram_pg.histogram.minimum_bytes_boundary = 1024 monitor.histogram_pg.histogram.bin_size_bytes = 1024 monitor.histogram_pg.histogram.sample_time_ns = 1024 monitor.histogram_pg.log.queue_bytes = 500 monitor.histogram_pg.collect.queue_bytes = 500 monitor.histogram_pg.collect.port_group_list = [buffers_pg,all_packet_pg]
Restarting the asic-monitor Service
After you modify the configuration in the
monitor.conf file, you need
to restart the
asic-monitor service. This does not disrupt traffic,
nor does it require you to restart
switchd in order for the changes to
cumulus@switch:~$ sudo systemctl restart asic-monitor.service
The service is enabled by default when you boot the switch and is
restarted whenever you restart
During state collection, the monitoring service may respond to a threshold being crossed, which triggers a monitoring action.
At this time, the only type of trigger that initiates state collection is a timer. The timer is the frequency at which the histogram triggers and reads the ASIC state.
When a monitoring statistic meets a configured threshold, it can trigger an action. Triggers can include:
- Queue length, as measured by a histogram
- Packet drops due to packet buffer congestion
- Packet drops due to errors
If no trigger is configured for a monitoring action, the action happens unconditionally and always occurs.
Understanding Monitoring Actions
Monitoring actions are responses to triggers issued by the
There are three types of monitoring actions: collect, log and snapshot. And any or all three of these actions can be triggered by one monitoring step.
A collect action triggers the collection of additional ASIC state. Multiple port groups can be daisy chained into a single collect action. A collect action requires a port group.
A log action writes out the state to
syslog. For example:
2017-04-26T20:14:41.560840+00:00 cumulus asic-monitor-module INFO: 2017-04-26 20:14:41.559967: Egress queue(s) greater than 500 bytes in monitor port group histogram_pg
A snapshot action takes a snapshot of the current state that was
collected and writes it out to a file. You specify the prefix for the
snapshot file name - including the path, like
example - and the number of snapshots that can be taken before the
system starts overwriting the earliest snapshot files. For example, if
the snapshot file is called
/var/lib/cumulus/snapshot and the snapshot
file count is set to 64, then the first snapshot file is named
snapshot_0 and the 64th snapshot is named snapshot_63. When the 65th
snapshot has taken, the original snapshot file -
/var/lib/cumulus/snapshot_0 - is overwritten and the files are
overwritten in sequence..
Caveats and Errata
Keep in mind that a lot of overhead is involved in collecting this data,
hitting the CPU and SDK process, which can affect execution of
switchd. Snapshots and logging can occupy a lot of disk space if
you’re not limiting the number of files to copy.