Friday, 11 August 2017

Implementing SQL Server Failover Clustering in Azure

Deploying Infrastructure as a Service (IaaS) solutions in Microsoft Azure offers a number of benefits that leverage agility, resiliency, and scalability built into the underlying platform. However, when dealing with business-critical workloads, customers typically want to also provide high-availability and disaster recovery capabilities in a manner that they can control. Trying to implement this approach in the cloud by following the procedures applicable in on-premises datacenters frequently presents challenges that results from differences between the two environments. In this article, we will focus on these differences in the context of deployment of SQL Server Failover Clustering in Azure.

The high-availability capabilities of SQL Server Failover Clustering rely on the Failover Clustering feature built into the Windows Server operating system. In general, the two infrastructure services that affect how Failover Clustering operates are storage and networking.

From the storage perspective, a number of Failover Clustering solutions depend on the ability to provide shared access to the same set of disks from multiple cluster nodes. Historically, this required higher-end hardware-based solutions, such as a Serial Attached SCSI or Fibre Channel Storage Area Network. While it was possible to also use iSCSI storage for this purpose, its support statement did not inspire a tremendous amount of confidence. With advancements in network-based storage access, primarily in the area of the SMB protocol, it became possible to provide shared storage without relying on specialized hardware configurations. Instead cluster nodes could collectively use commodity storage in JBOD (just a bunch of disks) hosted by shared disk enclosures. In addition, starting with SQL Server 2012, it became possible to store system and user databases on SMB shares. Both of these options are fully supported in on-premises deployments of SQL Server Failover Clustering.

However, when it comes to Azure, attempts to implement these options encounter one major obstacle. This results from the fact that Azure virtual machines place an exclusive lock on every one of their virtual disks, which effectively precludes sharing locally attached storage. There are third-party solutions, such as SIOS DataKeeper, that emulate shared disks by provisioning a cluster resource, which represents two separate, but synchronously replicated virtual disks. However, they come at extra licensing cost. Fortunately, with the introduction of Windows Server 2016, there is another approach that you can take in order to provide the same functionality by relying on the operating system features. This approach involves implementing Storage Spaces Direct (S2D), which takes the SMB-based shared storage to the next level. With S2D you are able to create a pool of disks that are attached locally to individual nodes of a cluster and to present them as one or more volumes. You can mount these volumes on the same set of cluster nodes (in so-called hyper-converged configuration) or on nodes of another cluster (in so-called converged configuration). Since this technology uses local disks, it is fully functional and supported in on-premises and Azure-resident clusters.

With S2D, you finally have the ability to provision Failover Cluster Instance in Azure virtual machines. Note that without it (or without using third-party products), you could still provide high-availability of your SQL Server workloads in the cloud by using SQL Server Always On availability groups or database mirroring, since neither of them require shared storage.

The other aspect that warrants additional consideration in the context of SQL Server Failover Clustering deployments in Azure is networking. The primary difference between on-premises and cloud-based implementations is that the latter requires a load balancer for every client access point that must be accessible from outside of the cluster. The underlying reason for it is lack of support for gratuitous Reverse Address Resolution Protocol (ARP). This applies to all Failover Clustering roles that must be accessible via a network name, which means that it must be part of your implementation of Failover Cluster Instances and SQL Server Always On availability groups.

In our upcoming articles, we will step through sample implementations that will illustrate usage of S2D and Azure internal load balancers in SQL Server Failover Clustering implementations in Azure.