Hurricane season 2018 has already left a trail of devastation and outages with Florence sweeping the U.S. East Coast in September and Michael clobbering the Florida Panhandle a month later. Disasters challenge enterprise networks, keeping IT managers up at night. Disaster recovery and high availability (HA) designs are only as good as their last full-scale test, which all too often is too complicated for many enterprises. Here’s where self-healing SD-WANs can help.
The four actions of self-healing SD-WAN
Self-healing SD-WAN dynamically compensates for errors in the network in such a way that minimizes the disruption to higher-layer services, namely your applications. Self-healing, as it relates to SD-WANs, involves four functions:
- Monitoring the devices and links supporting the SD-WAN, collecting low-level statistics, such as packet loss, latency, and jitter.
- Detecting when line characteristics exceed defined thresholds or parts of the underlay become unavailable.
- Taking action to, ideally, preempt and alert on the network problem.
- Adapting security and other supporting networking services to minimize, and ideally prevent, disruption at the application layer.
(Image: Shutterstock)
These functions must be implemented at all four tiers of a global network -- device, site, regional, and global networks. Let’s talk a look at each one.
At the device level, self-healing SD-WANs must protect against component failures, such as in the power supplies or hard disks, in the supporting appliances. At a minimum that means preventing against component failures in the SD-WAN access device. Should other appliances be essential to the enterprise network, such as firewalls or WAN optimization devices, those too need to be protected against component failure.
At the site level, self-healing SD-WANs will not only recover from the complete failure of the appliances supporting the SD-WAN but also address brownouts and blackouts in the path to that device and across the last mile. Redundant pathing to the appliances at the location and then dual-homed access lines to the Internet should be in the network design. As this formula shows, even dual broadband connections can approach the theoretical availability of MPLS, but that assumes all parts of the broadband infrastructure are made redundant, including the accessing SD-WAN devices. Practically, this means HA configuration of SD-WAN branch devices must be simple enough to be deployed by anyone and affordable enough for widespread enterprise adoption.
Self-healing SD-WANs must be able to compensate for packet loss, in particular, the network conditions most common to the last mile. Packet loss correction technologies, such as packet duplication, are well established. Should packet loss rates be too high, or should there be a blackout, the self-healing SD-WANs will steer traffic around the underperforming link in accordance with business priorities. Practically, most SD-WAN solutions provide last-mile self-healing, running in active/active configurations, monitoring line performance in real-time, and then using policy-based routing and application profiles to steer traffic to the optimum link.
Self-healing in regional and global tiers
As you move towards the network core, the issues will differ depending on your SD-WAN architecture. When SD-WAN solutions only rely on the Internet, as is the concern with SD-WAN edge solutions, there’s less concern about the reachability of the regional and global networks. There’s often, but certainly not always, significant pathing and redundancy built into these networks to prevent complete outages. There is considerable concern, though, around brownouts. The erratic and often poor performance of global Internet connections is well documented, showing latencies and loss rates substantially higher than privately managed networks.
The opposite is true for SD-WANs who rely on the Internet for access to a local point of a presence (PoP) but a privately-managed backbone, not the Internet, to connect those PoPs. With private backbones, latency across the SD-WAN will generally exceed that of the Internet, reachability, though, is another issue. IT managers must check that sufficient intelligence, pathing, and redundancy is built into these SD-WANs to ensure self-healing.
More specifically that means at the regional level, PoP components need to be made redundant with failover between them in the event of an internal outage. Should sites lose connectivity to a PoP, whether because of a pathing issue or because of a component failure, they should automatically reconnect to next closest PoP, a process that should continue until connecting to a PoP.
Within the core of the global network, PoPs themselves need to recover from line outages and steer traffic around failed links. Latency in the global middle-mile is particularly critical. PoPs should be able to consider both direct and indirect routing, selecting the optimum path between them for each application. Should there be a brownout or blackout on one PoP-to-PoP connection, alternative paths should be taken.
Self-healing and applications
Ultimately, the goal of a self-healing is to prevent network outages from impacting the business. It’s not enough to failover or failback between connections one has to accommodate changing traffic patterns in line with business priorities. Similarly, self-healing networks must also adjust the other devices and services supporting the delivery of applications across global networks, such as security policies and WAN optimization rules.
Failure to do so can prevent users from accessing applications and resources even though the SD-WAN has adapted to an outage. For example, vMotioning workloads between two physical data centers can be successful, but users may still unable to access the application. Security rules in the firewall at the second data center need to be updated to permit access to the now new IPs appearing at the second data center. Self-healing networks must address this situation, updating the necessary network services to prevent disruption to the application layer.
Entertaining every possible failover scenario is too much to expect when designing an SD-WAN. We need to build the necessary intelligence into the SD-WAN to anticipate problems and take corrective action. Only then will our SD-WANs continue to work when outages come and, even more importantly, will IT managers get a good night sleep during disaster season.
Dave has spent more than 20 years as an award-winning journalist and independent technology consultant. Today, he works in the office of the CTO as the secure networking evangelist for Cato Networks.