Overview of WAN Recovery
Cato Networks’ WAN Recovery feature is one of multiple recovery options that provide resiliency if your Socket sites can't communicate using the Cato Cloud. If sites can't connect to the Cato Cloud, WAN Recovery uses VPN tunnels over the Internet to preserve the connectivity for the WAN traffic between your sites.
How Does WAN Recovery Work?
This section describes how the WAN Recovery feature works, how the Sockets send the traffic, and when do the Sockets activates the WAN Recovery.
Explaining WAN Recovery Topology
WAN Recovery is based on a full mesh topology and by default each Socket creates a direct DTLS tunnel to every other one. The Sockets regularly send keep alive messages over the tunnel and keep an open live tunnel to reduce the recovery time. This topology provides maximum resiliency for the Socket sites in your account.
The following diagram shows an example where one Socket is disconnected from the Cato Cloud. WAN Recovery is enabled for that site to provide a direct connection between the two Sockets:
WAN Recovery Requirements and Limitations
WAN Recovery must be enabled on sites that you want to have connectivity for the off-cloud traffic. For example, if WAN recovery is enabled on sites A and B, but not for site C. Then during the recovery, site C can't communicate with the other sites.
Traffic that is passed over the WAN recovery transport isn’t transferred through the Cato Cloud. Therefore, this traffic is not protected by Cato’s security services such as: firewalls, MDR, and IPS.
This feature is supported for Socket version 5.2 or later and is enabled by default. You can use the Cato Management Application to disable the WAN Recovery either for a specific site or for the entire account.
Note: Due to regulatory reasons, WAN Recovery is not supported in China.
WAN Recovery in Active/Active or Active/Passive
For all deployments, when WAN Recovery is enabled, each Socket establishes secure DTLS tunnels to the remote Socket site on all WAN interfaces that are enabled for off-cloud traffic. For active/active deployments, the Socket randomly selects one of the active links for WAN recovery. For active/passive deployments, the Socket uses the active link.
Using NAT Punching to Connect Sockets
WAN Recovery relies on NAT punching to establish the WAN connectivity between your sites. When a Socket connects to the Cato Cloud, the PoP informs the Socket on all the other endpoints, and the Socket opens a DTLS tunnel to each one of them. The Socket uses NAT punching technique to establish a direct connection with the other Sockets.
Note: The negotiation of the NAT punching starts over the Cato Cloud. Therefore, the Sockets must be connected to the Cato Cloud to allow the NAT punching.
The NAT punching technique works for each pair of Sockets in the following way:
- The PoP selects one of the Sockets as the initiator to establish a direct connection (Socket 1) based on the site ID (the site with the highest ID value is the initiator).
- The initiator Socket sends a request to the Cato Cloud for the following details: IP address and port number
- The Cato PoP sends to Socket 1 his source IP address and port
- Socket 1 sends his IP address and port to Socket 2 over the Cato tunnel
- Socket 2 sends a request to the Cato Cloud for the following details: IP address and port
- The Cato PoP sends to Socket 2 his source IP address and port
- Socket 2 sends his IP address and port to Socket 1 over the Cato tunnel
- Socket 1 sends 32 packets to Socket 2 in the range of the source port, each packet with a different port number
- Socket 2 sends 32 packets to Socket 1 in the range of the source port, each packet with a different port number
- Once the correct port found, the Sockets open a DTLS tunnel with the source IP address and the port number
- From now on, the Sockets send keep alive messages every 15 seconds to keep the connection open
The following diagram shows the flow to establish a direct connection between two Sockets for WAN Recovery:
- The PoP selects Socket1 as the connection initiator and sends the IP address and port number to Socket1 (Step #1)
- Socket1 sends a notification to Socket2 with his external IP address and port number. For example: IP address 188.8.131.52 and port number 4444 (Step #2)
- Socket2 performs step #1 and #2 in the other direction
- Each Socket sends 32 packets with different port number to the other Socket
- Socket1 opens a tunnel with source IP address: 10.10.10.22 and port number: 4444 (step #3)
- When Socket2 connects with Socket1, Router1 adds the NAT entry to its routing table
Minimizing Reconnection Time with NAT Punching
After NAT punching succeeds, the Socket saves this NAT data. In the case of a Socket restart, it can immediately reconnect to the other Sockets with that NAT data. Saving the NAT data significantly reduces the Socket reconnection time. For Sockets that are behind a network firewall or a router, if your firewall or router restarts, then the NAT entries are changed. The NAT data is no longer relevant, and the Sockets must perform the NAT punching process again.
Recovering WAN Traffic
The Socket keeps an open tunnel for WAN Recovery, so if it loses connectivity with the Cato Cloud, the Socket recovers the connections with the other sites and minimizes the disconnection time. The Socket then immediately start sending the WAN traffic over the WAN recovery link.
For accounts that enable recovery via Alt. WAN , if the Socket disconnects from the Cato Cloud, the Alt. WAN link has a higher priority than WAN Recovery. Therefore, the Socket first moves the traffic to the Alt. WAN link. If the Alt. WAN link is unavailable, the Socket then moves the WAN traffic to the WAN Recovery link. Generally, the WAN Recovery has the lowest priority as a transport option, and it’s only used when the other transport options are unavailable.
Once connectivity to the Cato Cloud is restored, recovery ends and the traffic is sent over the Cato Cloud.
Analyzing WAN Recovery with Event Discovery
The Cato Management Application generates the following events for WAN recovery:
- Off-Cloud Recovery Activated – this event is generated when the Socket starts to send the WAN traffic over the WAN Recovery transport.
- Off-Cloud Recovery Stopped – this event is generated when the connection to the Cato Cloud is restored and the Socket stop sending WAN traffic over the WAN Recovery transport.