Many organizations with on-premises IT infrastructures have turned to the cloud when configuring for disaster recovery (DR). If a catastrophic event such as an earthquake or flood were to render the on-prem infrastructure unusable, the organization could spin up the cloud-based infrastructure and continue to operate. But what if your primary infrastructure is already in the cloud? You could configure a disaster recovery (DR) solution with backup infrastructure in a remote region of AWS, Google, or Azure, but what are your options if you want to ensure that your business can operate if the entire AWS, Google, or Azure cloud goes down? If that’s your concern, a multi-cloud DR solution may be worth considering.
Building out a multi-cloud infrastructure
For our purposes, “multi-cloud” means two cloud infrastructures – say, Azure and AWS EC2 or AWS and the Google Cloud Platform (GCP). If your primary infrastructure is running on AWS, for example, your DR infrastructure would be configured on Azure or GCP. If the entire AWS cloud were to go down, your DR infrastructure on Azure or GPS would enable you to continue to operate with minimal interruption.
In reality, though, you’re unlikely to duplicate all of your resources in both clouds. A more likely scenario would involve configuring multi-cloud instances of those systems — say, your production SQL Server systems or your SAP ERP landscape — whose inaccessibility would be most detrimental in terms of revenue generation and cost of downtime.
When it comes to configuring the infrastructure in the second cloud, you’ll want to mirror the infrastructure in the primary cloud as closely as possible – with similar computing resources, storage, and networking services. You’ll need to install and configure all aspects of the application infrastructure you’re mirroring – from the application software to the databases, utilities, support services, and so on – so that the second cloud infrastructure appears operationally the same as the first. You might consider configuring your second cloud infrastructure in a region of the second cloud that is distinct from the region in which your primary cloud infrastructure is running. While no event has yet compromised an entire cloud infrastructure – let alone two at the same time – a catastrophe in one region could affect data centers in multiple clouds simultaneously.
Ensuring multi-cloud DR
In each of these cloud environments, you'll then need to deploy the DR services that will keep the two cloud infrastructures synchronized and that will facilitate failover between cloud environments if a disaster takes down or threatens the primary cloud environment.
The critical aspect of infrastructure synchronization lies in the replication of production data from the primary cloud to the secondary cloud. Because you're focused on ensuring the operational resilience of critical systems, you won't want to rely on a replication strategy that depends on recurring database backup between the clouds or, worse, some kind of log-shipping scheme that would enable you to roll forward a database snapshot from some earlier date. Those approaches would dramatically increase the amount of time it takes to recover your critical systems.
A better approach to DR will rely on a highly efficient replication system, such as block-level replication. Unlike slower file-level replication, this approach ensures that each change to data in the production environment is replicated in the second cloud environment. Be sure your replication solution supports your failover clustering environment. In a high availability (HA) configuration, the tools performing the block-level replication would replicate data synchronously between separate but nearby cloud data centers, but in a DR configuration, the replication will likely occur asynchronously to accommodate the latency caused by moving data between clouds and likely between geographic regions. The data in the second cloud infrastructure may be a few seconds out of step with the data in the primary cloud, but if the second cloud were called into action, suddenly you would be able to bring the service online faster than having to restore backups.
Finally, there remains the question of orchestrating the failover between cloud infrastructures in the event of a whole-cloud catastrophe. A variety of tools exist that can streamline and automate failover (including some that can automatically update the DNS servers to reroute inbound traffic to the second cloud). Unlike those tools that are built into specific applications, those that are application-agnostic will provide you with greater flexibility if you are moving more than one critical application into a multi-cloud configuration. Be wary of manual failover management features, too. At some point, the primary cloud infrastructure will become operational again, and your DR tool should make it easy to replicate any recent updates to your databases back to the primary cloud infrastructure and then to move operational traffic back to that infrastructure without any business interruption.
Dave Bermingham is the Senior Technical Evangelist at SIOS Technology.