Rediger

Del via


Business continuity and disaster recovery overview

Business continuity and disaster recovery in Azure Data Explorer enables your business to continue operating in the face of a disruption. This article details multiple disaster recovery configurations depending on recoverability requirements (RPO and RTO), needed effort, and cost.

For more information about the reliability options available for Azure Data Explorer, including availabilitry zone support, backup, and protection against some types of human error, see Reliability in Azure Data Explorer.

Disaster recovery configurations

Recovery time objective (RTO) refers to the time to recover from a disruption. For example, RTO of 2 hours means the application has to be up and running within two hours of a disruption. Recovery point objective (RPO) refers to the interval of time that might pass during a disruption before the quantity of data lost during that period is greater than the allowable threshold. For example, if the RPO is 24 hours, and an application has data beginning from 15 years ago, they're still within the parameters of the agreed-upon RPO.

Ingestion, processing, and curation processes need diligent design upfront when planning for disaster recovery. Ingestion refers to data integrated into Azure Data Explorer from various sources; processing refers to transformations and similar activities; curation refers to materialized views, exports to the data lake, and so on.

The following are popular disaster recovery configurations:

Active-active-active configuration

This configuration is also called always-on. For critical application deployments with no tolerance for outages, you should use multiple Azure Data Explorer clusters across Azure paired regions. Set up ingestion, processing, and curation in parallel to all of the clusters. The cluster SKU must be the same across regions. Azure ensures that updates are rolled out and staggered across Azure paired regions. An Azure region outage doesn't cause an application outage. You might experience some latency or performance degradation.

Active-active-active-n configuration.

Configuration RPO RTO Effort Cost
Active-Active-Active-n 0 hours 0 hours Lower Highest

Active-Active configuration

This configuration is identical to the active-active-active configuration, but only involves two Azure paired regions. Configure dual ingestion, processing, and curation. Users are routed to the nearest region. The cluster SKU must be the same across regions.

Active-active configuration.

Configuration RPO RTO Effort Cost
Active-Active 0 hours 0 hours Lower High

Active-Hot standby configuration

The Active-Hot configuration is similar to the Active-Active configuration in dual ingest, processing, and curation. While the standby cluster is online for ingestion, process, and curation, it isn't available to query. The standby cluster doesn't need to be in the same SKU as the primary cluster. It can be of a smaller SKU and scale, which might result in it being less performant. In a disaster scenario, users are redirected to the standby cluster, which can optionally be scaled up to increase performance.

Active-hot standby configuration.

Configuration RPO RTO Effort Cost
Active-Hot Standby 0 hours Low Medium Medium

On-demand data recovery configuration

This solution offers the least recoverability (highest RPO and RTO), is the lowest in cost and highest in effort. In this configuration, there's no data recovery cluster. Configure continuous export of curated data (unless raw and intermediate data is also required) to a storage account that is configured GRS (Geo Redundant Storage). A data recovery cluster is spun up if there's a disaster recovery scenario. At that time, DDLs, configuration, policies, and processes are applied. Data is ingested from storage with the ingestion property kustoCreationTime to override the ingestion time that defaults to system time.

On-demand data recovery cluster configuration.

Configuration RPO RTO Effort Cost
On-demand data recovery cluster Highest Highest Highest Lowest

Summary of disaster recovery configuration options

Configuration Recoverability RPO RTO Effort Cost
Active-Active-Active-n Highest 0 hours 0 hours Lower Highest
Active-Active High 0 hours 0 hours Lower High
Active-Hot Standby Medium 0 hours Low Medium Medium
On-demand data recovery cluster Lowest Highest Highest Highest Lowest

Best practices

Regardless of which disaster recovery configuration is chosen, follow these best practices:

  • All database objects, policies, and configurations should be persisted in source control so they can be released to the cluster from your release automation tool. For more information, see Azure DevOps support for Azure Data Explorer.
  • Design, develop, and implement validation routines to ensure all clusters are in-sync from a data perspective. Azure Data Explorer supports cross cluster joins. A simple count or rows across tables can help validate.
  • Release procedures should involve governance checks and balances that ensure mirroring of the clusters.
  • Be fully cognizant of what it takes to build a cluster from scratch.
  • Create a checklist of deployment units. Your list is unique to your needs, but should include: deployment scripts, ingestion connections, BI tools, and other important configurations.

Next step