Monitor Events

Two event workflows, the Alarms card workflow and the Info card workflow, provide a view into the events occurring in the network. The Alarms card workflow tracks critical severity events, whereas the Info card workflow tracks all warning, info, and debug severity events.

To focus on events from a single device perspective, refer to Monitor Switches.

Monitor Alarms

You can easily monitor critical events occurring across your network using the Alarms card. You can determine the number of events for the various system, interface, and network protocols and services components in the network. The content of the cards in the workflow is described first, and then followed by common tasks you would perform using this card workflow.

Alarms Card Workflow Summary

The small Alarms card displays:

Item

Description

Indicates data is for all critical severity events in the network

Alarm trend

Trend of alarm count, represented by an arrow:

  • Pointing upward and bright pink: alarm count is higher than the last two time periods, an increasing trend

  • Pointing downward and green: alarm count is lower than the last two time periods, a decreasing trend

  • No arrow: alarm count is unchanged over the last two time periods, trend is steady

Alarm score

Current count of alarms during the designated time period

Alarm rating

Count of alarms relative to the average count of alarms during the designated time period:

  • Low: Count of alarms is below the average count; a nominal count

  • Med: Count of alarms is in range of the average count; some room for improvement

  • High: Count of alarms is above the average count; user intervention recommended

Chart

Distribution alarms received during the designated time period and a total count of all alarms present in the system

The medium Alarms card displays:

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

Indicates data is for all critical events in the network

Count

Total number of alarms received during the designated time period

Alarm trend

Trend of alarm count, represented by an arrow:

  • Pointing upward and bright pink: alarm count is higher than the last two time periods, an increasing trend

  • Pointing downward and green: alarm count is lower than the last two time periods, a decreasing trend

  • No arrow: alarm count is unchanged over the last two time periods, trend is steady

Alarm score

Current count of alarms received from each category (overall, system, interface, and network services) during the designated time period

Chart

Distribution of all alarms received from each category during the designated time period

The large Alarms card has one tab.

The System, Trace and Interfaces tab displays:

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

Indicates data is for all system, trace and interface critical events in the network

Alarm Distribution

Chart: Distribution of all alarms received from each category (services, NetQ Agents, Cumulus Linux licenses, sensors, ports, links, MTU, LLDP and configuration changes) during the designated time period

Count: Total number of alarms received from each category during the designated time period

Table

Listing of items that match the filter selection for the selected alarm categories:

  • Events by Most Recent: Most recent event are listed at the top

  • Devices by Event Count: Devices with the most events are listed at the top

Show All Events

Opens full screen Events | Alarms card with a listing of all events

The full screen Alarms card provides tabs for all events.

Item

Description

Title

Events | Alarms

Closes full screen card and returns to workbench

Default Time

Range of time in which the displayed data was collected

Results

Number of results found for the selected tab

All Events

Displays all events (both alarms and info) received in the time period. By default, the requests list is sorted by the date and time that the event occurred (Time). This tab provides the following additional data about each request:

  • Source: Hostname of the given event

  • Message: Text describing the alarm or info event that occurred

  • Type: Name of network protocol and/or service that triggered the given event

  • Severity: Importance of the event–critical, warning, info, or debug

Export

Enables export of all or selected items in a CSV or JSON formatted file

Enables manipulation of table display; choose columns to display and reorder columns

View Alarm Status Summary

A summary of the critical alarms in the network includes the number of alarms, a trend indicator, a performance indicator, and a distribution of those alarms.

To view the summary, open the small Alarms card.

In this example, there are a small number of alarms (0), the number of alarms is steady (no arrow), and there are fewer alarms right now than the average number of alarms during this time period. This would indicate no further investigation is needed. Note that with such a small number of alarms, the rating may be a bit skewed.

View the Distribution of Alarms

It is helpful to know where and when alarms are occurring in your network. The Alarms card workflow enables you to see the distribution of alarms based on its source—network services, interfaces, or other system services. You can also view the trend of alarms in each source category.

To view the alarm distribution, open the medium Alarms card. Scroll down to view all of the charts.

Monitor System and Interface Alarm Details

The Alarms card workflow enables users to easily view and track critical severity system and interface alarms occurring anywhere in your network.

View All System and Interface Alarms

You can view the alarms associated with the system and interfaces using the Alarms card workflow. You can sort alarms based on their occurrence or view devices with the most network services alarms.

To view network services alarms, open the large Alarms card.

From this card, you can view the distribution of alarms for each of the categories over time. Scroll down to view any hidden charts. A list of the associated alarms is also displayed.

By default, the list of the most recent alarms for the systems and interfaces is displayed when viewing the large cards.

View Devices with the Most Alarms

You can filter instead for the devices that have the most alarms.

To view devices with the most alarms, open the large Alarms card, and then select Devices by event count from the dropdown.

Filter Alarms by System or Interface

You can focus your view to include alarms for selected system or interface categories.

To filter for selected categories:

  1. Click the checkbox to the left of one or more charts to remove that set of alarms from the table on the right.
  2. Select the Devices by event count to view the devices with the most alarms for the selected categories.
  3. Switch back to most recent events by selecting Events by most recent.
  4. Click the checkbox again to return a category’s data to the table.

In this example, we removed the Services from the event listing.

Compare Alarms with a Prior Time

You can change the time period for the data to compare with a prior time. If the same devices are consistently indicating the most alarms, you might want to look more carefully at those devices using the Switches card workflow.

To compare two time periods:

  1. Open a second Alarm Events card. Remember it goes to the bottom of the workbench.
  2. Switch to the large size view.
  3. Move the card to be next to the original Alarm Events card. Note that moving large cards can take a few extra seconds since they contain a large amount of data.
  4. Hover over the card and click .

  5. Select a different time period.

  6. Compare the two cards with the Devices by event count filter applied.

    In this example, both the total alarm count and the devices with the most alarms in each time period are unchanged. You could go back further in time to see if this changes or investigate the current status of the largest offenders.

View All Events

You can view all events in the network either by clicking the Show All Events link under the table on the large Alarm Events card, or by opening the full screen Alarm Events card.

OR

To return to your workbench, click in the top right corner of the card.

Monitor Info Events

You can easily monitor warning, info, and debug severity events occurring across your network using the Info card. You can determine the number of events for the various system, interface, and network protocols and services components in the network. The content of the cards in the workflow is described first, and then followed by common tasks you would perform using this card workflow.

Info Card Workflow Summary

The Info card workflow enables users to easily view and track informational alarms occurring anywhere in your network.

The small Info card displays:

Item

Description

Indicates data is for all warning, info, and debug severity events in the network

Info count

Number of info events received during the designated time period

Alarm count

Number of alarm events received during the designated time period

Chart

Distribution of all info events and alarms received during the designated time period

The medium Info card displays:

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

Indicates data is for all warning, info, and debug severity events in the network

Types of Info

Chart which displays the services that have triggered events during the designated time period. Hover over chart to view a count for each type.

Distribution of Info

Info Status

  • Count: Number of info events received during the designated time period

  • Chart: Distribution of all info events received during the designated time period

Alarms Status

  • Count: Number of alarm events received during the designated time period

  • Chart: Distribution of all alarm events received during the designated time period

The large Info card displays:

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

Indicates data is for all warning, info, and debug severity events in the network

Types of Info

Chart which displays the services that have triggered events during the designated time period. Hover over chart to view a count for each type.

Distribution of Info

Info Status

  • Count: Current number of info events received during the designated time period

  • Chart: Distribution of all info events received during the designated time period

Alarms Status

  • Count: Current number of alarm events received during the designated time period

  • Chart: Distribution of all alarm events received during the designated time period

Table

Listing of items that match the filter selection:

  • Events by Most Recent: Most recent event are listed at the top

  • Devices by Event Count: Devices with the most events are listed at the top

Show All Events

Opens full screen Events | Info card with a listing of all events

The full screen Info card provides tabs for all events.

Item

Description

Title

Events | Info

Closes full screen card and returns to workbench

Default Time

Range of time in which the displayed data was collected

Results

Number of results found for the selected tab

All Events

Displays all events (both alarms and info) received in the time period. By default, the requests list is sorted by the date and time that the event occurred (Time). This tab provides the following additional data about each request:

  • Source: Hostname of the given event

  • Message: Text describing the alarm or info event that occurred

  • Type: Name of network protocol and/or service that triggered the given event

  • Severity: Importance of the event–critical, warning, info, or debug

Export

Enables export of all or selected items in a CSV or JSON formatted file

Enables manipulation of table display; choose columns to display and reorder columns

View Info Status Summary

A summary of the informational events occurring in the network can be found on the small, medium, and large Info cards. Additional details are available as you increase the size of the card.

To view the summary with the small Info card, simply open the card. This card gives you a high-level view in a condensed visual, including the number and distribution of the info events along with the alarms that have occurred during the same time period.

To view the summary with the medium Info card, simply open the card. This card gives you the same count and distribution of info and alarm events, but it also provides information about the sources of the info events and enables you to view a small slice of time using the distribution charts.

Use the chart at the top of the card to view the various sources of info events. The four or so types with the most info events are called out separately, with all others collected together into an Other category. Hover over segment of chart to view the count for each type.

To view the summary with the large Info card, open the card. The left side of the card provides the same capabilities as the medium Info card.

Compare Timing of Info and Alarm Events

While you can see the relative relationship between info and alarm events on the small Info card, the medium and large cards provide considerably more information. Open either of these to view individual line charts for the events. Generally, alarms have some corollary info events. For example, when a network service becomes unavailable, a critical alarm is often issued, and when the service becomes available again, an info event of severity warning is generated. For this reason, you might see some level of tracking between the info and alarm counts and distributions. Some other possible scenarios:

  • When a critical alarm is resolved, you may see a temporary increase in info events as a result.
  • When you get a burst of info events, you may see a follow-on increase in critical alarms, as the info events may have been warning you of something beginning to go wrong.
  • You set logging to debug, and a large number of info events of severity debug are seen. You would not expect to see an increase in critical alarms.

View All Info Events Sorted by Time of Occurrence

You can view all info events using the large Info card. Open the large card and confirm the Events By Most Recent option is selected in the filter above the table on the right. When this option is selected, all of the info events are listed with the most recently occurring event at the top. Scrolling down shows you the info events that have occurred at an earlier time within the selected time period for the card.

View Devices with the Most Info Events

You can filter instead for the devices that have the most info events by selecting the Devices by Event Count option from the filter above the table.

View All Events

You can view all events in the network either by clicking the Show All Events link under the table on the large Info Events card, or by opening the full screen Info Events card.

OR

To return to your workbench, click in the top right corner of the card.

Events Reference

The following table lists all event messages organized by type.

The messages can be viewed through third-party notification applications. For details about configuring notifications using the NetQ CLI, refer to Integrate NetQ with Notification Applications.

Type

Trigger

Severity

Message Format

Example

agent

NetQ Agent state changed to Rotten (not heard from in over 15 seconds)

Critical

Agent state changed to rotten

Agent state changed to rotten

agent

NetQ Agent state changed to Fresh

Info

Agent state changed to fresh

Agent state changed to fresh

bgp

BGP Session state changed

Critical

BGP session with peer @peer @neighbor vrf @vrf state changed from @old_state to @new_state

BGP session with peer leaf03 leaf04 vrf mgmt state changed from Established to Failed

bgp

BGP Session state changed from Failed to Established

Info

BGP session with peer @peer @peerhost @neighbor vrf @vrf session state changed from Failed to Established

BGP session with peer swp5 spine02 spine03 vrf default session state changed from Failed to Established

bgp

BGP Session state changed from Established to Failed

Info

BGP session with peer @peer @neighbor vrf @vrf state changed from established to failed

BGP session with peer leaf03 leaf04 vrf mgmt state changed from down to up

bgp

The reset time for a BGP session changed

Info

BGP session with peer @peer @neighbor vrf @vrf reset time changed from @old_last_reset_time to @new_last_reset_time

BGP session with peer spine03 swp9 vrf vrf2 reset time changed from 1559427694 to 1559837484

cable

Link speed is not the same on both ends of the link

Critical

@ifname speed @speed, mismatched with peer @peer @peer_if speed @peer_speed

swp2 speed 10, mismatched with peer server02 swp8 speed 40

cable

The speed setting for a given port changed

Info

@ifname speed changed from @old_speed to @new_speed

swp9 speed changed from 10 to 40

cable

The transceiver status for a given port changed

Info

@ifname transceiver changed from @old_transceiver to @new_transceiver

swp4 transceiver changed from disabled to enabled

cable

The vendor of a given transceiver changed

Info

@ifname vendor name changed from @old_vendor_name to @new_vendor_name

swp23 vendor name changed from Broadcom to Mellanox

cable

The part number of a given transceiver changed

Info

@ifname part number changed from @old_part_number to @new_part_number

swp7 part number changed from FP1ZZ5654002A to MSN2700-CS2F0

cable

The serial number of a given transceiver changed

Info

@ifname serial number changed from @old_serial_number to @new_serial_number

swp4 serial number changed from 571254X1507020 to MT1552X12041

cable

The status of forward error correction (FEC) support for a given port changed

Info

@ifname supported fec changed from @old_supported_fec to @new_supported_fec

swp12 supported fec changed from supported to unsupported

swp12 supported fec changed from unsupported to supported

cable

The advertised support for FEC for a given port changed

Info

@ifname supported fec changed from @old_advertised_fec to @new_advertised_fec

swp24 supported FEC changed from advertised to not advertised

cable

The FEC status for a given port changed

Info

@ifname fec changed from @old_fec to @new_fec

swp15 fec changed from disabled to enabled

clag

CLAG remote peer state changed from up to down

Critical

Peer state changed to down

Peer state changed to down

clag

Local CLAG host MTU does not match its remote peer MTU

Critical

SVI @svi1 on vlan @vlan mtu @mtu1 mismatched with peer mtu @mtu2

SVI svi7 on vlan 4 mtu 1592 mistmatched with peer mtu 1680

clag

CLAG SVI on VLAN is missing from remote peer state

Warning

SVI on vlan @vlan is missing from peer

SVI on vlan vlan4 is missing from peer

clag

CLAG peerlink is not opperating at full capacity. At least one link is down.

Warning

Clag peerlink not at full redundancy, member link @slave is down

Clag peerlink not at full redundancy, member link swp40 is down

clag

CLAG remote peer state changed from down to up

Info

Peer state changed to up

Peer state changed to up

clag

Local CLAG host state changed from down to up

Info

Clag state changed from down to up

Clag state changed from down to up

clag

CLAG bond in Conflicted state was updated with new bonds

Info

Clag conflicted bond changed from @old_conflicted_bonds to @new_conflicted_bonds

Clag conflicted bond changed from swp7 swp8 to @swp9 swp10

clag

CLAG bond changed state from protodown to up state

Info

Clag conflicted bond changed from @old_state_protodownbond to @new_state_protodownbond

Clag conflicted bond changed from protodown to up

clsupport

A new CL Support file has been created for the given node

Critical

HostName @hostname has new CL SUPPORT file

HostName leaf01 has new CL SUPPORT file

configdiff

Configuration file deleted on a device

Critical

@hostname config file @type was deleted

spine03 config file /etc/frr/frr.conf was deleted

configdiff

Configuration file has been created

Info

@hostname config file @type was created

leaf12 config file /etc/lldp.d/README.conf was created

configdiff

Configuration file has been modified

Info

@hostname config file @type was modified

spine03 config file /etc/frr/frr.conf was modified

evpn

A VNI was configured and moved from the up state to the down state

Critical

VNI @vni state changed from up to down

VNI 36 state changed from up to down

evpn

A VNI was configured and moved from the down state to the up state

Info

VNI @vni state changed from down to up

VNI 36 state changed from down to up

evpn

The kernel state changed on a VNI

Info

VNI @vni kernel state changed from @old_in_kernel_state to @new_in_kernel_state

VNI 3 kernel state changed from down to up

evpn

A VNI state changed from not advertising all VNIs to advertising all VNIs

Info

VNI @vni vni state changed from @old_adv_all_vni_state to @new_adv_all_vni_state

VNI 11 vni state changed from false to true

license

License state is missing or invalid

Critical

License check failed, name @lic_name state @state

License check failed, name agent.lic state invalid

license

License state is missing or invalid on a particular device

Critical

License check failed on @hostname

License check failed on leaf03

link

Link operational state changed from up to down

Critical

HostName @hostname changed state from @old_state to @new_state Interface:@ifname

HostName leaf01 changed state from up to down Interface:swp34

link

Link operational state changed from down to up

Info

HostName @hostname changed state from @old_state to @new_state Interface:@ifname

HostName leaf04 changed state from down to up Interface:swp11

lldp

Local LLDP host has new neighbor information

Info

LLDP Session with host @hostname and @ifname modified fields @changed_fields

LLDP Session with host leaf02 swp6 modified fields leaf06 swp21

lldp

Local LLDP host has new peer interface name

Info

LLDP Session with host @hostname and @ifname @old_peer_ifname changed to @new_peer_ifname

LLDP Session with host spine01 and swp5 swp12 changed to port12

lldp

Local LLDP host has new peer hostname

Info

LLDP Session with host @hostname and @ifname @old_peer_hostname changed to @new_peer_hostname

LLDP Session with host leaf03 and swp2 leaf07 changed to exit01

lnv

VXLAN registration daemon, vxrd, is not running

Critical

vxrd service not running

vxrd service not running

mtu

VLAN interface link MTU is smaller than that of its parent MTU

Warning

vlan interface @link mtu @mtu is smaller than parent @parent mtu @parent_mtu

vlan interface swp3 mtu 1500 is smaller than parent peerlink-1 mtu 1690

mtu

Bridge interface MTU is smaller than the member interface with the smallest MTU

Warning

bridge @link mtu @mtu is smaller than least of member interface mtu @min

bridge swp0 mtu 1280 is smaller than least of member interface mtu 1500

ntp

NTP sync state changed from in sync to not in sync

Critical

Sync state changed from @old_state to @new_state for @hostname

Sync state changed from in sync to not sync for leaf06

ntp

NTP sync state changed from not in sync to in sync

Info

Sync state changed from @old_state to @new_state for @hostname

Sync state changed from not sync to in sync for leaf06

ospf

OSPF session state on a given interface changed from Full to a down state

Critical

OSPF session @ifname with @peer_address changed from Full to @down_state

OSPF session swp7 with 27.0.0.18 state changed from Full to Fail

OSPF session swp7 with 27.0.0.18 state changed from Full to ExStart

ospf

OSPF session state on a given interface changed from a down state to full

Info

OSPF session @ifname with @peer_address changed from @down_state to Full

OSPF session swp7 with 27.0.0.18 state changed from Down to Full

OSPF session swp7 with 27.0.0.18 state changed from Init to Full

OSPF session swp7 with 27.0.0.18 state changed from Fail to Full

ptm

Physical interface cabling does not match configuration specified in topology.dot file

Critical

PTM cable status failed

PTM cable status failed

ptm

Physical interface cabling matches configuration specified in topology.dot file

Critical

PTM cable status passed

PTM cable status passed

runningconfigdiff

A fan or power supply unit sensor has changed state

Info

@commandname config result was modified

@commandname config result was modified

sensor

A fan or power supply unit sensor has changed state

Critical

Sensor @sensor state changed from @old_s_state to @new_s_state

Sensor fan state changed from up to down

sensor

A temperature sensor has crossed the maximum threshold for that sensor

Critical

Sensor @sensor max value @new_s_max exceeds threshold @new_s_crit

Sensor temp max value 110 exceeds the threshold 95

sensor

A temperature sensor has crossed the minimum threshold for that sensor

Critical

Sensor @sensor min value @new_s_lcrit fall behind threshold @new_s_min

Sensor psu min value 10 fell below threshold 25

sensor

A temperature, fan, or power supply sensor state changed

Info

Sensor @sensor state changed from @old_state to @new_state

Sensor temperature state changed from critical to ok

Sensor fan state changed from absent to ok

Sensor psu state changed from bad to ok

sensor

A fan or power supply sensor state changed

Info

Sensor @sensor state changed from @old_s_state to @new_s_state

Sensor fan state changed from down to up

Sensor psu state changed from down to up

services

A service status changed from down to up

Critical

Service @name status changed from @old_status to @new_status

Service bgp status changed from down to up

services

A service status changed from up to down

Critical

Service @name status changed from @old_status to @new_status

Service lldp status changed from up to down

services

A service changed state from inactive to active

Info

Service @name changed state from inactive to active

Service bgp changed state from inactive to active

Service lldp changed state from inactive to active

vxlan

Replication list is contains an inconsistent set of nodes

Critical

VNI @vni replication list inconsistent with @conflicts diff:@diff

VNI 14 replication list inconsistent with ["leaf03","leaf04"] diff:+:["leaf03","leaf04"] -:["leaf07","leaf08"]