This documentation is for an older version of the software. If you are using the current version of Cumulus Linux, this content may not be up to date. The current version of the documentation is available here. If you are redirected to the main page of the user guide, then this page may have been renamed; please search for it there.

Monitor Network Health

As with any network, one of the challenges is keeping track of all of the moving parts. With the NetQ GUI, you can view the overall health of your network at a glance and then delve deeper for periodic checks or as conditions arise that require attention. For a general understanding of how well your network is operating, the Network Health card workflow is the best place to start as it contains the highest view and performance rollups.

Network Health Card Workflow Summary

The small Network Health card displays:

/images/download/thumbnails/10464197/image2019-2-19-15_16_13.png

Item

Description

/images/old_doc_images/lBqcsAlS_csR2RPZ7eyK-SXLOidTjHYZh0JfJJsF6NTR2zZ7bHlrat6pE1t8bUAz8EecWCz6FiryJ6p7kAvHuS2LZ6nHJIFX6fu472Ce7eN1xogZr4ke-4klOEczoAxp3-j7qych

Indicates data is for overall Network Health

Health trend

Trend of overall network health, represented by an arrow:

  • Pointing upward and green: Health score in the most recent window is higher than in the last two data collection windows, an increasing trend

  • Pointing downward and bright pink: Health score in the most recent window is lower than in the last two data collection windows, a decreasing trend

  • No arrow: Health score is unchanged over the last two data collection windows, trend is steady

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health score

Average of health scores for system health, network services health, and interface health during the last data collection window. The health score for each category is calculated as the percentage of items which passed validations versus the number of items checked.

The collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health rating

Performance rating based on the health score during the time window:

  • Low: Health score is less than 40%

  • Med: Health score is between 40% and 70%

  • High: Health score is greater than 70%

Chart

Distribution of overall health status during the designated time period

The medium Network Health card displays the distribution, score, and trend of the:

/images/download/attachments/10464197/image2019-2-19-15_18_19.png

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

/images/old_doc_images/lBqcsAlS_csR2RPZ7eyK-SXLOidTjHYZh0JfJJsF6NTR2zZ7bHlrat6pE1t8bUAz8EecWCz6FiryJ6p7kAvHuS2LZ6nHJIFX6fu472Ce7eN1xogZr4ke-4klOEczoAxp3-j7qych

Indicates data is for overall Network Health

Health trend

Trend of system, network service, and interface health, represented by an arrow:

  • Pointing upward and green: Health score in the most recent window is higher than in the last two data collection windows, an increasing trend

  • Pointing downward and bright pink: Health score in the most recent window is lower than in the last two data collection windows, a decreasing trend

  • No arrow: Health score is unchanged over the last two data collection windows, trend is steady

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health score

Percentage of devices which passed validation versus the number of devices checked during the time window for:

  • System health: NetQ Agent health, Cumulus Linux license status, and sensors

  • Network services health: BGP, CLAG, EVPN, LNV, NTP, and VXLAN health

  • Interface health: interfaces MTU, VLAN health

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Chart

Distribution of overall health status during the designated time period

The large Network Health card contains three tabs.

The System Health tab displays:

/images/download/attachments/10464197/image2019-2-19-15_26_38.png

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

/images/old_doc_images/lBqcsAlS_csR2RPZ7eyK-SXLOidTjHYZh0JfJJsF6NTR2zZ7bHlrat6pE1t8bUAz8EecWCz6FiryJ6p7kAvHuS2LZ6nHJIFX6fu472Ce7eN1xogZr4ke-4klOEczoAxp3-j7qych

Indicates data is for overall Network Health

Health trend

Trend of NetQ Agents, Cumulus Linux licenses, and sensor health, represented by an arrow:

  • Pointing upward and green: Health score in the most recent window is higher than in the last two data collection windows, an increasing trend

  • Pointing downward and bright pink: Health score in the most recent window is lower than in the last two data collection windows, a decreasing trend

  • No arrow: Health score is unchanged over the last two data collection windows, trend is steady

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health score

Percentage of devices which passed validation versus the number of devices checked during the time window for NetQ Agents, Cumulus Linux license status, and platform sensors.

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Charts

Distribution of health score for NetQ Agents, Cumulus Linux license status, and platform sensors during the designated time period

Table

Listing of items that match the filter selection:

  • Most Failures: Devices with the most validation failures are listed at the top

  • Recent Failures: Most recent validation failures are listed at the top

Show All Devices

Opens full screen Network Health card with a listing of all events

The Network Services Health tab displays:

/images/download/attachments/10464197/image2019-2-19-15_28_2.png

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

/images/old_doc_images/lBqcsAlS_csR2RPZ7eyK-SXLOidTjHYZh0JfJJsF6NTR2zZ7bHlrat6pE1t8bUAz8EecWCz6FiryJ6p7kAvHuS2LZ6nHJIFX6fu472Ce7eN1xogZr4ke-4klOEczoAxp3-j7qych

Indicates data is for overall Network Health

Health trend

Trend of BGP, CLAG, EVPN, LNV, NTP, and VXLAN services health, represented by an arrow:

  • Pointing upward and green: Health score in the most recent window is higher than in the last two data collection windows, an increasing trend

  • Pointing downward and bright pink: Health score in the most recent window is lower than in the last two data collection windows, a decreasing trend

  • No arrow: Health score is unchanged over the last two data collection windows, trend is steady

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health score

Percentage of devices which passed validation versus the number of devices checked during the time window for BGP, CLAG, EVPN, LNV, NTP, and VXLAN protocols and services.

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Charts

Distribution of passing validations for BGP, CLAG, EVPN, LNV, NTP, and VXLAN services during the designated time period

Table

Listing of devices that match the filter selection:

  • Most Failures: Devices with the most validation failures are listed at the top

  • Recent Failures: Most recent validation failures are listed at the top

Show All Devices

Opens full screen Network Health card with a listing of all events

The Interfaces Health tab displays:

/images/download/attachments/10464197/image2019-2-19-15_28_22.png

Item

Description

Time period

Range of time in which the displayed data was collected; applies to all card sizes

/images/old_doc_images/lBqcsAlS_csR2RPZ7eyK-SXLOidTjHYZh0JfJJsF6NTR2zZ7bHlrat6pE1t8bUAz8EecWCz6FiryJ6p7kAvHuS2LZ6nHJIFX6fu472Ce7eN1xogZr4ke-4klOEczoAxp3-j7qych

Indicates data is for overall Network Health

Health trend

Trend of interfaces, MTU, and VLAN health, represented by an arrow:

  • Pointing upward and green: Health score in the most recent window is higher than in the last two data collection windows, an increasing trend

  • Pointing downward and bright pink: Health score in the most recent window is lower than in the last two data collection windows, a decreasing trend

  • No arrow: Health score is unchanged over the last two data collection windows, trend is steady

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Health score

Percentage of devices which passed validation versus the number of devices checked during the time window for for interfaces, MTUs, and VLANs.

The data collection window varies based on the time period of the card. For a 24 hour time period (default), the window is one hour. This gives you current, hourly, updates about your network health.

Charts

Distribution of health score for interfaces, MTUs, and VLANs during the designated time period

Table

Listing of devices that match the filter selection:

  • Most Failures: Devices with the most validation failures are listed at the top

  • Recent Failures: Most recent validation failures are listed at the top

Show All Devices

Opens full screen Network Health card with a listing of all events

The full screen Network Health card displays all events in the network.

/images/download/attachments/10464197/image2019-2-19-15_54_1.png

Item

Description

Title

Network Health

Closes full screen card and returns to workbench

Default Time

Range of time in which the displayed data was collected

Results

Number of results found for the selected tab

All Events

Displays all events (both alarms and info) received in the time period. By default, the requests list is sorted by the date and time that the event occurred (Time). This tab provides the following additional data about each request:

  • Source: Hostname(, IP address or MAC address?) of the given event

  • Type: Name of network protocol and/or service that triggered the given event

  • Message: Text describing the alarm or info event that occurred

  • Severity: Importance of the event-critical, warning, info, or debug

Export

Enables export of all or selected items in a CSV or JSON formatted file

/images/old_doc_images/TxyRotE-Ks3VoU0rMfISNSl_V0m0yXqQyq8cn7CI6da54YIrMvzU8ttAOXmnbpUJdXBIQBG9OothePcEuJ-DoNYR1SdJIpW6RAlGd5wXxJdRcI0HPR3eMMcrSwotbHTrjqUNFH3w

Enables manipulation of table display; choose columns to display and reorder columns

View Network Health Summary

Overall network health is based on successful validation results. The summary includes the percentage of successful results, a trend indicator, and a distribution of the validation results.

To view a summary of your network health, open the small Network Health card.

/images/download/thumbnails/10464197/image2019-2-19-15_16_130.png

In this example, the overall health is quite low and digging further for causes is definitely warranted. Refer to the next section for viewing the key health metrics.

View Key Metrics of Network Health

Overall network health is a calculated average of several key health metrics: System, Network Services, and Interface health.

To view these key metrics, open the medium Network Health card. Each metric is shown with the the percentage of successful validations, a trend indicator, and a distribution of the validation results.

/images/download/attachments/10464197/image2019-2-19-15_18_190.png

In this example, the health of each of the three key metrics are all good. You might choose to dig further on the system health if it did not continue to improve. Refer to the following section for additional details.

View System Health

The system health is a calculated average of the NetQ Agent, Cumulus Linux license, and sensor health metrics. In all cases, validation is performed on the agents and licenses.. If you are monitoring platform sensors, the calculation includes these as well. You can view the overall health of the system from the medium Network Health card and information about each component from the large Network Health card.

To view information about each system component:

  1. Open the large Network Health card.

  2. Hover over the card and click .

    /images/download/attachments/10464197/image2019-2-19-15_26_380.png

The health of each protocol or service is represented on the left side of the card by a distribution of the health score, a trend indicator, and a percentage of successful results. The right side of the card provides a listing of devices running the services.

View Devices with the Most Issues

It is useful to know which devices are experiencing the most issues with their system services in general, as this can help focus troubleshooting efforts toward selected devices versus the service itself. To view devices with the most issues, select Most Failures from the filter above the table on the right.

/images/download/thumbnails/10464197/image2019-2-19-15_41_13.png

Devices with the highest number of issues are listed at the top. Scroll down to view those with fewer issues. To further investigate the critical devices, open the Event cards and filter on the indicated switches.

View Devices with Recent Issues

It is useful to know which devices are experiencing the most issues with their network services right now, as this can help focus troubleshooting efforts toward selected devices versus the service itself. To view devices with recent issues, select Recent Failures from the filter above the table on the right. Devices with the highest number of issues are listed at the top. Scroll down to view those with fewer issues. To further investigate the critical devices, open the Switch card or the Event cards and filter on the indicated switches.

Filter Results by System Service

You can focus the data in the table on the right, by unselecting one or more services. Click the checkbox next to the service you want to remove from the data. In this example, we have unchecked Licenses.

/images/download/attachments/10464197/image2019-4-9-11_3_32.png

This grays out the associated chart and temporarily removes the data related to that service from the table.

View Network Services Health

The network services health is a calculated average of the individual network protocol and services health metrics. In all cases, validation is performed on NTP. If you are running BGP, CLAG, EVPN, LNV, OSPF, or VXLAN protocols the calculation includes these as well. You can view the overall health of network services from the medium Network Health card and information about individual services from the large Network Health card.

To view information about each network protocol or service:

  1. Open the large Network Health card.

  2. Hover over the card and click .

    /images/download/attachments/10464197/image2019-2-19-15_33_28.png

The health of each protocol or service is represented on the left side of the card by a distribution of the health score, a trend indicator, and a percentage of successful results. The right side of the card provides a listing of devices running the services.

If you have more services running than fit naturally into the chart area, a scroll bar appears for you to access their data.

Use the scroll bars on the table to view more columns and rows.

View Devices with the Most Issues

It is useful to know which devices are experiencing the most issues with their network services in general, as this can help focus troubleshooting efforts toward selected devices versus the protocol or service. To view devices with the most issues, open the large Network Health card. Select Most Failures from the dropdown above the table on the right.

/images/download/thumbnails/10464197/image2019-2-19-15_41_13.png

Devices with the highest number of issues are listed at the top. Scroll down to view those with fewer issues. To further investigate the critical devices, open the Event cards and filter on the indicated switches.

View Devices with Recent Issues

It is useful to know which devices are experiencing the most issues with their network services right now, as this can help focus troubleshooting efforts toward selected devices versus the protocol or service. To view devices with the most issues, open the large Network Health card. Select Recent Failures from the dropdown above the table on the right. Devices with the highest number of issues are listed at the top. Scroll down to view those with fewer issues. To further investigate the critical devices, open the Switch card or the Event cards and filter on the indicated switches.

Filter Results by Network Service

You can focus the data in the table on the right, by unselecting one or more services. Click the checkbox next to the service you want to remove. In this example, we are removed NTP and LNV and are in the process of removing OSPF.

/images/download/attachments/10464197/image2019-4-9-11_0_18.png

This grays out the charts and temporarily removes the data related to that service from the table.

View All Events

The Network Health card workflow enables you to view all of the alarms and info events in the network during the designated time period.

To view all events:

  1. Open the full screen Network Health card.

  2. Click All Events tab in the navigation panel.

  3. Sort event data by Time column to view events in most recent to least recent order.

/images/download/attachments/10464197/image2019-2-19-15_54_1.png

Where to go next depends on what data you see, but a few options include:

  • Sort or filter event data instead by severity, for example, or type.
  • Export the data for use in another analytics tool, by clicking Export and providing a name for the data file.