Link Quality SLA Issue - Network Playbook

This playbook describes steps to resolve issues when there is an issue with the link quality SLA for a site.

Overview

Cato constantly monitors the link quality SLA KPIs between a site and the PoP. You can define Quality Health Rules to set the link quality SLA, and if there is an link quality issue Last-Mile Quality event with the Alert action is generated. The when the issue is resolved, another event with the Clear Alert action is generated.

Each Link Quality rule can monitor one or more of these thresholds:

  • Packet Loss

  • Jitter

  • Latency

  • Congestion

There are different ways to discover that a Link Quality issue has occurred:

  • Go to the Stories Workbench page and use the Network XDR preset to find the Link quality SLA stories.

    sla-quality-story.png

    The story provides information about the current status, incident timeline, and more.

  • Connectivity alert, sent as an email notification to the Admin group.

  • Connectivity event, with the Last-Mile Quality sub-type and the action Alert

Step 1 - Identifying the Link Quality Issue

This section discusses different Cato tools that you can use to identify the root cause of your link SLA quality issue. Often times, an issue with one of the metrics can lead to issues with other metrics. For example, what might start as a congestion issue, can lead to problems with jitter and packet loss.

Reviewing Site Network Analytics

Use the Network Analytics page for a site to determine the cause for the link quality issue. Network Analytics can help you determine which link is experiencing quality issues, if the issue is on the upstream or downstream connection, or perhaps the issue is with the ISP.

By looking at the different graphs, you can identify a trend within your system. For example, when you see a sudden spike in the Distance graph, does that coincide with an increase in the Packet Loss graph? Do both of those coincide with an increase in the throughput graphs? Maybe there is also a sudden jump in the Flows or Hosts graphs, too?

Based on the trend in the graphs, use the Events page to see if there are any SLA Quality events that can be tracked to the same time as the information in the Analytics, which can help you determine the root cause of your issue.

Suggested steps:

  1. Review the Network Analytics widgets for the site, and identify a link quality trend between different widgets in the same time frame.

  2. In the Events page, search for related events with the Link Health field.

  3. Review the event data.

Step 2 - Remediation

This section provides information about steps you can take to remedy your issues.

Reviewing Recent Configuration Changes

Review changes in the Audit Trail page for the Cato Management Application, and see if there is a configuration that is related to this issue.

Troubleshooting the Various Issues

  • If you are experiencing packet loss, see How to Troubleshoot Socket Site Packet Loss

  • If you are experiencing congestion

    • Check the application analytics to determine if you have a specific host, user, or application, that is causing the congestion

    • Tweak your QoS settings to prioritize your bandwidth. For more information, see Configuring Bandwidth Management Profiles

    • Consider purchasing more bandwidth

  • If you are experiencing jitter or latency

    • If the issues persists, consider changing your Configuration SLA settings. If the site is reconnecting to a different PoP too often, set your Configuration SLA settings to be more permissive. If the site is not reconnecting to a different PoP despite continued high latency, set your Configuration SLA settings to be more aggressive.

    • By default, Cato ensures that you are connected to the closest PoP for best performance. However, sometimes, it is necessary to force the site to Manually reconnect to a preferred PoP.

  • Check the status of your Cato PoP to ensure that there are no issues

Step 3 - Adjusting Quality Health Rules

If any of the issues persist after you apply the Remediation steps, consider adjusting your Quality Health Rules to align with the needs of your organization. For example, you can create an exception in the rules for a specific site to allow for greater packet-loss tolerance, or perhaps use the local routing feature on a Socket so traffic on the local networks does not go out to the PoP, but rather is handled by the Socket.

Verifying that the Link Quality is Restored

After the link quality for a site is restored to acceptable SLA, a Last-Mile Quality event is generated with the Action Alert Cleared. In the Events page, you can manually configure the event filter for Action IS Alert Cleared and Source Site with the site name, to show the event.

Was this article helpful?

0 out of 0 found this helpful

0 comments

Add your comment