Socket Site Resiliency with WAN Recovery

This article discusses the WAN Recovery feature for Socket sites that provides resiliency in the very unlikely circumstance that there is a connectivity issue with the Cato Cloud.

Overview of WAN Recovery

The WAN Recovery feature is one of multiple recovery options that provide resiliency if your Socket sites can't communicate using the Cato Cloud. WAN Recovery uses VPN tunnels between Socket sites over the Internet to preserve the connectivity for the WAN traffic between your sites if there is a connectivity issue with the Cato Cloud.

How Does WAN Recovery Work?

WAN Recovery is based on a full mesh topology and is enabled by default for all Socket sites. Each Socket creates a direct DTLS tunnel to every other one over the public Internet. They regularly send keep-alive messages over the tunnel and keep an open live tunnel to reduce the recovery time. This topology provides maximum resiliency for the Socket sites in your account.

The following diagram shows an example where one Socket is disconnected from the Cato Cloud. WAN Recovery is enabled for that site to provide a direct connection between the two Sockets:

blobid0.png

Best Practices for WAN Recovery

To ensure the smoothest transition for sites to WAN Recovery, you can use a static IP for the site and define the Socket interface Public IP and Static Port settings for a site to improve establishing the off-cloud tunnels between the sites.

For accounts where it is difficult to configure the static IP settings for all the Sockets, we recommend that you use static IP settings for a few key sites, such as data centers, that act as hubs for WAN Recovery. The IP address for the hub sites is sent to the PoPs and propagated to the other Sockets in your account that are configured for WAN Recovery.

Recovering WAN Traffic

The Socket keeps an open tunnel for WAN Recovery, so if it loses connectivity with the Cato Cloud, the Socket recovers the connections with the other sites and minimizes the disconnection time. The Socket then immediately starts sending the WAN traffic over the WAN recovery link.

You can use the Cato Management Application to disable the WAN Recovery either for a specific site or for the entire account. For more information, see Working with Advanced Configuration for the Account.

Once connectivity to the Cato Cloud is restored, recovery ends and the traffic is sent over the Cato Cloud.

Reviewing WAN Recovery Status for a Site

The Socket page for a site shows the Off Cloud Status for the WAN links. When the status is Enabled, the links are ready for WAN Recovery.

off_cloud_status.png

To review the WAN Recovery status for a site:

  1. From the navigation menu, select Network > Sites, and select the site.

  2. From the navigation menu, select Site Configuration > Socket.

Configuring Sites for WAN Recovery

We recommend that you use static IP addresses for key sites, such as data centers, that act as hubs for WAN Recovery. Define the off-cloud Public IP and Static Port for each WAN link in the hub sites.

You can use the Monitoring > Best Practices page to confirm that all sites are enabled in the Advanced Configuration settings to support WAN Recovery.

To configure a site for WAN Recovery:

  1. From the navigation menu, select Network > Sites, and select the site.

  2. From the navigation menu, select Site Configuration > Socket.

  3. Configure the WAN link for WAN Recovery:

    1. Click the WAN link. The Edit Socket Interface panel opens.

      off_cloud_publicIP_port.png
    2. Set the Traffic Status to Enabled.

    3. (Optional) Define the static Public IP and Static Port for the link. We recommend this setting for key hub sites.

  4. Repeat step 3 for all Socket WAN links.

  5. Click Apply, and then click Save.

    The site is configured for WAN Recovery.

Analyzing WAN Recovery Events

The CMA generates the following events for WAN recovery:

  • Off-Cloud Recovery Activated – this event is generated when the Socket starts to send the WAN traffic over the WAN Recovery transport.

  • Off-Cloud Recovery Stopped – this event is generated when the connection to the Cato Cloud is restored and the Socket stops sending WAN traffic over the WAN Recovery transport.

Impact to the Account During WAN Recovery

WAN Recovery is enabled by default for all Socket sites to provide resiliency using off-cloud traffic, if it is disabled for one or more sites, then they can't communicate with the other. For example, if WAN recovery is enabled on sites A and B, but not for site C, during the recovery, site C can't communicate with the other sites, and sites A and B can't communicate with site C.

The LAN Firewall policy is not impacted and continues to function normally during WAN Recovery because the Socket applies the policy.

Note

Note: Due to regulatory reasons, WAN Recovery is not supported in China.

Don't Reboot the Socket

During WAN Recovery, make sure that you do NOT reboot the Socket, otherwise, there can be a negative impact to the site and it might not be able to re-establish connectivity with the other sites.

WAN Recovery in Active/Active or Active/Passive

For all deployments, when WAN Recovery is enabled, each Socket establishes secure DTLS tunnels to the remote Socket site on all WAN interfaces that are enabled for off-cloud traffic. For active/active link configuration, the Socket randomly selects one of the active links for WAN recovery. For active/passive, the Socket uses the active link.

Impact on the CMA During WAN Recovery

The Cato Management Application (CMA) does not receive all site data because it is not connected to the PoP and is not aware of the status of the impacted sites.

You can log in to the Socket WebUI and use the SD-WAN tab to monitor traffic and off-cloud tunnels. This is an example of the monitoring traffic with the Socket WebUI:

sokcet_webui_sdWAN.png

PoP Limitations During WAN Recovery

Traffic that is passed over the WAN Recovery off-cloud transport isn’t processed by PoPs in the Cato Cloud. This means that during WAN Recovery, the PoP services are not applied to traffic, including the following items:

  • Security

    • WAN and Internet firewall policies

    • Threat Prevention services (ie. IPS, Anti-Malware)

  • Networking

    • NAT policy

    • Complex Network Rules

    • DNS Forwarding

    • DHCP Relay

    • Static Range Translation (SRT)

  • Access

    • Client Access (ie. Client Connectivity policy)

    • Device Posture

WAN Recovery and Alt. WAN Recovery

For accounts that enable recovery via Alt. WAN (ie. MPLS), if the Socket disconnects from the Cato Cloud, the Alt. WAN link has a higher priority than WAN Recovery. Therefore, the Socket first moves the traffic to the Alt. WAN link. If the Alt. WAN link is unavailable, the Socket then moves the WAN traffic to the WAN Recovery link. Generally, the WAN Recovery has the lowest priority as a transport option, and it’s only used when the other transport options are unavailable.

Understanding NAT Punching to Connect Sockets

WAN Recovery relies on NAT punching to establish the WAN connectivity between your sites. When a Socket connects to the Cato Cloud, the PoP informs the Socket on all the other endpoints, and the Socket opens a DTLS tunnel to each one of them. The Socket uses the NAT punching technique to establish a direct connection with the other Sockets.

Note: The negotiation of the NAT punching starts over the Cato Cloud. Therefore, the Sockets must be connected to the Cato Cloud to allow the NAT punching.

The following diagram shows the flow to establish a direct connection between two Sockets for WAN Recovery:

blobid1.png

The NAT punching technique works for each pair of Sockets in the following way:

  1. The PoP selects one of the Sockets as the initiator to establish a direct connection (Socket 1) based on the site ID (the site with the highest ID value is the initiator).

  2. The initiator Socket sends a request to the Cato Cloud for the following details: IP address and port number, for example: IP address 82.128.1.1 and port number 4444 (Step #2)

  3. The Cato PoP sends the source IP address and port to Socket 1

  4. Socket 1 sends its IP address and port to Socket 2 over the Cato tunnel

  5. Socket 2 sends a request to the Cato Cloud for the following details: IP address and port

  6. The Cato PoP sends the source IP address and port to Socket 2

  7. Socket 2 sends its IP address and port to Socket 1 over the Cato tunnel

  8. Socket 1 sends 32 packets to Socket 2 in the range of the source port, each packet with a different port number

  9. Socket 2 sends 32 packets to Socket 1 in the range of the source port, each packet with a different port number

  10. Once the correct port is found, the Sockets open a DTLS tunnel with the source IP address and the port number

    When Socket 2 connects with Socket 1, the router adds the NAT entry to its routing table

  11. From that point on, the Sockets send keep-alive messages every 15 seconds to keep the connection open

Minimizing Reconnection Time with NAT Punching

After NAT punching succeeds, the Socket saves this NAT data. In the case of a Socket restart, it can immediately reconnect to the other Sockets with that NAT data. Saving the NAT data significantly reduces the Socket reconnection time. For Sockets that are behind a network firewall or a router, if your firewall or router restarts, the NAT entries are changed. The NAT data is no longer relevant, and the Sockets must perform the NAT punching process again.

Was this article helpful?

1 out of 2 found this helpful

4 comments

  • Comment author
    Bert-Jan Kamp

    Missing information is the dynamic port range that Cato is using during setup of the tunnel.

  • Comment author
    Jan Van Moere

    Hi,

     

    Why isn't Cato allowing to choose between sites and sdp clients to activate the security features like IPS, NGFW, antimalware or not.  Now we have to choose those features on all sites and sdp clients.  This way we can't even evaluate and compare Cato with other security tooling we have already in place.

    Kind regards,

    Jan

     

  • Comment author
    Dermot - Community Manager Only 42 of these badges will be awarded.  They are reserved for people who have played a key role in helping build the Cato Community through their contributions! Community Pioneer The chief of community conversations. Community manager

    Hello Bert-Jan!

    My apologies that neither of your comments have been responded to yet!  I will get some answers for you and, if required, ensure the KB article is updated accordingly.

    Kind Regards,

    Dermot Doran (Cato Community Manager)

  • Comment author
    Yoshihiro Toyomasu

    Hi, Cato Team,

    Thank you for a useful article. Let me ask you one question.
    How do DNS and DHCP features work when the WAN Recovery is working?

    [Scenario]
    - Normal
    ・Account's DNS setting is Primary: 10.254.254.1, Secondary: 8.8.8.8.
    ・To resolve the internal domain, the DNS forwarding setting specifies the DNS server under the site as the forwarding destination.

    - While the recovery function is running 
    ・As a DHCP server, is the value distributed by Socket different from normal time?
    ・As a DNS server, how does Socket provide the function?

    Regards,
    Yoshihiro Toyomasu

Add your comment