This article discusses High Availability (HA) configurations and failover conditions for sites using a pair of physical Cato Sockets.
To improve site resiliency, Cato strongly recommends deploying each site with a pair of Sockets that operate in High Availability (HA) mode. This mode of operation ensures service continuity for the site in the event of a single Socket failure. During a failover, the Cato Cloud maintains the flows state and there is minimal impact on the end-user experience.
Supported Socket HA Sites
Cato supports Socket HA for the following environments:
-
Physical Socket site
-
AWS vSocket site
-
Azure vSocket Site
This article explains how HA works for a physical Socket site. For more about setting up Socket HA in a few clicks, see Using Sockets in a High Availability (HA) Deployment.
-
For more about AWS vSocket HA, see Configuring High Availability (HA) for AWS vSockets
-
For more about Azure vSocket HA, see Configuring High Availability for Azure vSockets
Socket HA sites can use two Sockets with the same Socket type X1500, X1600, X1600 LTE, or X1700. However, you can't use different Socket types, so a site with an X1600 and an X1700 Socket is not supported.
You can't use an X1600 Socket and X1600 LTE Socket in the same HA site.
In a Socket HA deployment, two Cato Sockets are assigned to a site. The first Socket assigned to the site is identified as the primary Socket, the second one is the secondary Socket. The Sockets operate in HA Active/Standby mode. During a site’s normal operation, the primary Socket has the HA Master status, while the secondary Socket has the HA Standby status. Only the Socket with the HA Master status handles the traffic.
-
The secondary (Standby) Socket continuously monitors the state (liveliness) of the Master Socket by listening for the periodic keepalive messages that the primary Socket sends. The keepalive messages are sent over the designated interface with the destination set to LAN & VRRP or VRRP (see below LAN Connectivity and Socket HA).
-
Once the secondary (Standby) Socket detects that the primary Socket is down, it changes its HA state to Master and starts handling the traffic. This happens after three seconds of missed HA keepalive messages.
-
The secondary Socket sends a GARP message to LAN networks to speed up the Layer 2 convergence.
-
When the primary Socket recovers and is restored to regular functionality, then it preemptively becomes the Master and the secondary Socket returns to Stand-by status.
The following image shows the HA configuration page for X1500 Sockets in the Cato Management Application in Network > {site name} > Site Configuration > Socket:
The following diagrams show an example of an issue in the primary Socket that causes a failover to the secondary Socket. When the secondary Socket discovers that the primary Socket is down, it then changes its status to Master. The Cato Cloud transfers the traffic flows to the WAN links in the secondary Socket.
A split-brain condition is when both Sockets have the Master role at the same time. This can happen due to a LAN connectivity problem between the Sockets that creates a situation where the HA keepalive messages do not reach the secondary Socket.
You can identify a split-brain condition by checking the Socket page (shown above) in the Cato Management Application.
-
The primary and secondary Sockets will be shown as status Master (item 2)
-
The Keepalive condition (in item 4) will be shown as Failed and this causes the HA Status (item 3) to be shown as NOT READY
After the LAN connectivity issue is resolved, the secondary Socket identifies that the primary Socket is the Master and the secondary Socket returns to Stand-by status.
The following process makes sure that during a split-brain condition, only the secondary Socket handles the traffic for the site (even if there is a split-brain condition).
-
For downstream traffic (from the PoP to the site):
-
The PoP detects that the secondary Socket is now the Master.
-
The PoP sets the preferred metric for the secondary Socket tunnels.
The downstream traffic is now only routed to the secondary Socket.
-
-
For upstream traffic (from the site to the PoP):
-
When the secondary Socket changes the HA state from Standby to Master, it sends a GARP message to the LAN to update the ARP and MAC tables that it is now the Master.
The upstream traffic from the LAN is now only routed to the secondary Socket.
-
Both primary and secondary Sockets establish DTLS tunnels to the same Cato Cloud PoP on each of WAN ports. In the Upstream direction, only the Master Socket sends the traffic to the PoP. In the Downstream direction, the PoP uses only the Master Socket tunnels to send the traffic to the site. In case of a Socket HA failover event, the secondary Socket becomes the new Master and the PoP shifts the traffic from the failed primary Socket tunnels to the secondary Socket tunnels. The PoP maintains the flow state and the NAT state to make sure that all user applications continue to operate during and after the failover.
Below are sample physical and logical topologies for the Socket HA:
For optimal WAN connectivity, performance and HA functionality, Cato requires symmetrical (mirrored) cabling layout for both Sockets. For example, if the primary Socket port WAN1 is connected to ISP1 and port WAN2 is connected to ISP2, the secondary Socket must have the same ports connected to the same ISPs as the primary Socket.
These symmetrical topologies can include direct connections to the ISP routers or using a stack of switches.
Note
Note: For standard HA configurations, Cato recommends that you use a symmetrical layout for both the Primary and Secondary Sockets.
When using LTE, there are scenarios where you might want to use SIM cards from different carriers to ensure better coverageת or only use a SIM card on the secondary Socket.
Cato requires that both the primary and secondary Sockets have a symmetrical (mirrored) cabling layout for the LAN connectivity. For example, LAN port 1 for both the primary and secondary Sockets is connected to the LAN switch (or LAN ports 1 and 2 for configurations with multiple LAN ports).
This section discusses the following LAN connectivity options for Socket HA:
-
Single LAN port
-
Multiple LAN ports
-
LAN link aggregation (recommended option)
-
Dedicated port for HA keepalive messages
Some of these options require additional configurations of the site in the Cato Management Application. For example, the LAN port is configured for LAN & VRRP or VRRP.
There are configurations that use a single LAN port to connect the primary and secondary Sockets to the LAN switch. With this configuration, the same port number must be used on both Sockets. The user traffic and the HA keepalive messages run over a single link. This topology doesn’t provide LAN link redundancy.
The following diagram shows a sample Socket HA topology with a single LAN port on each Socket connected to a switch:
This section discusses when both the primary and secondary Sockets are connected to the LAN switches via two or more independent LAN ports. With this configuration, the same ports must be used on both Sockets for the LAN connectivity.
By default, the LAN port with the lowest number is used both for the HA keepalive traffic and for the user traffic. The remaining LAN ports carry only the user traffic.
You can choose any LAN port for the HA keepalive traffic by changing the port Destination from LAN to LAN & VRRP. The following screenshot shows port 3 for LAN user traffic and port 4 for the HA keepalive traffic and for the user traffic.
For more about changing the LAN port for HA keepalive traffic, see Using Sockets in a High Availability (HA) Deployment. This topology doesn’t provide LAN link redundancy.
Socket HA failover (where the secondary Socket becomes the Master) only occurs when both of these conditions are met:
-
The secondary Socket stops receiving the HA keepalive messages from the primary Socket for a period of three seconds.
-
The LAN & VRRP port on the secondary Socket is in the CONNECTED state.
If the Secondary Socket LAN port is DISCONNECTED, it will not become the Master to avoid a possible split-brain condition.
Both the primary and secondary Sockets are connected to the LAN switches via two or more LAN ports bundled in a link aggregation (LAG). With this configuration, the same ports must be used on both Sockets for the LAN connectivity. This topology provides LAN links redundancy both for the user traffic and for the HA keepalive messages. If one of the LAG member ports fails, the other member ports will continue to carry the user traffic and the HA keepalive traffic.
This topology provides both link resiliency and Socket resiliency and is considered a best practice.
To learn more about LAN LAG, see Configuring Link Aggregation for a Socket.
The following diagram is an example of Socket HA LAN connectivity topology using a LAN LAG with a stack of switches:
In this configuration, you isolate the HA keepalive traffic from the LAN traffic. You can allocate a single port (LAN, WAN, or USB ports) only for the HA keepalive traffic while using one or more remaining LAN ports for the LAN traffic.
To set the dedicated LAN port for the HA keepalive traffic, set the Destination for the port to VRRP. Then set the HA link between sockets option to Direct or Via Switch.
These are the dedicated port configurations:
-
Direct (back-to-back cable between the Sockets) – With this configuration, if the secondary Socket stops receiving the HA keepalive messages, it becomes the Master regardless of the VRRP port state.
-
Via Switch – With this configuration, the VRRP port on both Sockets is connected to a switch. The failover behavior depends on the secondary Socket VRRP port state:
-
When the secondary Socket port state is Connected but it doesn't receive keepalive messages – the secondary Socket becomes the Master.
The secondary Socket assumes that the state is caused by primary Socket failure.
-
When the secondary Socket port state is Disconnected - the Secondary Socket does not become the Master (assuming that it is a local problem between itself to the switch.
The secondary Socket assumes that the primary Socket is operating correctly, and it does not become the Master to avoid a possible split-brain condition.
-
These are diagrams of the direct and via switch dedicated port configurations:
The section describes the conditions that cause a failover from the primary Socket to the secondary Socket.
This failover scenario is caused by a failure to the primary Socket. The Socket is considered as being in a down state based on one of these reasons:
-
General Socket failure or a loss of power
-
LAN connectivity (no keepalive for more than three seconds)
-
No Internet connectivity for more than ten seconds
There is also a failover scenario that is caused when the secondary Socket does not receive keepalive messages from the primary Socket for a period of three seconds.
When the secondary Socket discovers that the primary Socket is down, it then changes its status to Master. The Cato Cloud transfers the traffic flows to the WAN links in the secondary Socket. The following diagram shows this scenario.
The Sockets use a probing mechanism to determine the Internet connectivity status. If the primary Socket determines that Internet connectivity is down on all the Internet links (Cato links) for more than 10 seconds, then it stops transmitting the HA keepalive messages. This causes a failover to the secondary Socket.
Note
Note: It is possible for a situation where the primary Socket has Internet connectivity, however, all the DTLS tunnels are in the DISCONNECTED state. Because the Sockets have Internet and WAN recovery mechanisms, this situation does not trigger a failover to the secondary Socket. These recovery mechanisms allow the Socket to reconnect to a different PoP in the Cato Cloud.
This section discusses different pages in the Cato Management Application that you can use to monitor the status and events for Socket HA.
There are different pages in the Cato Management Application that show the status of the Socket HA for a site.
Page Name |
Description |
Path |
---|---|---|
Sites |
Shows all the sites in the account. The HA Status column shows the status of Socket HA for each site. |
Network > Sites |
Socket |
Shows the details of Socket HA for a site. See above Understanding Socket High Availability and Failover. |
Network > <site name> > Site Configuration > Socket |
Network Analytics |
Shows network data for a site and the HA Status. |
Network > <site name> > Site Monitoring > Network Analytics |
Whenever a Socket failover occurs, when the secondary Socket is active for more than 35 seconds, then a Socket Fail-Over event is generated. For example, if the primary Socket upgrades to a new Socket version, and the upgrade process takes 20 seconds, then a Socket Fail-Over event is NOT generated because the secondary Socket was only active for 20 seconds.
You can see the event In the Cato Management Application in the Monitoring > Events page. Here is a sample event showing a failover from the primary to the secondary Socket.
You can use the Link Health Rule page (Network > Link Health Rules) to create a Connectivity Health Rule to send email notifications for the Socket HA failover events. The email notifications are sent to all recipients in the Mailing List that you configure in the Cato Management Application. The Mailing List can include email addresses that are not defined for users and admins in the Cato Management Application.
This is a sample Connectivity Health Rule for Socket failover:
For more about configuring a Connectivity Health Rule, see Working with Link Health Rules.
9 comments
How do we force failover from CC2 portal?
Yamin,
You can't force failover from the Cato Management Application. If you physically remove (or disable) the LAN cable that is connected to a Socket, it will failover to the other Socket.
Thanks for your comment!
Yaakov
>If there is no connection between the Primary and the Secondary Socket on the LAN ports, the Secondary Socket will not receive the VRRP messages and will default to becoming the Master.
In this case is Primary socket still master?
How does one force the Sockets to update their "Version"? My Primary socket shows version "13.0.11291" and my Secondary socket shows version "12.0.7955". How do I get the secondary socket to update? I have rebooted it several times hoping that it would update automatically, but i has not!!
Billy,
Thanks for the comment. You can't force the Socket to upgrade to a new version.
Please contact Support and they can help you to upgrade the Socket version.
Yaakov
Takeshitah,
We completely updated this article, and I think this section contains the information that you need, LAN Connectivity and Socket HA.
Thanks,
Yaakov
Added section that describes, Site Traffic During Split-Brain Condition
Previously there was a need to insert a router between the ISP and Socket when configuring a HA configuration. Has that restriction been lifted?
According to this article it seems like it is possible to directly connect the Socket to the ISP in a HA configuration. Can you please confirm?
As per above post from Khai, is it possible to deploy CATO sockets in a HA pair where they are also the ISP CE routers?
We want to use CATO to replace traditional MPLS WAN and that includes the CE routers. If we have to retain traditional CE routers just so we can deploy CATO sockets in a HA pair that is a huge limitation and expense.
What is the reason CATO don't officially support this topology? There are no details in the KB article.
Please sign in to leave a comment.