Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Azure Operator Nexus is built on basic constructs like compute servers, storage appliances, and network fabric devices. Azure Operator Nexus cluster represents an on-premises deployment of the platform. The lifecycle of platform-specific resources is dependent on the cluster state.
Cluster deployment overview
During the cluster deployment, cluster undergoes various lifecycle phases, which have specific roles designated to ensure the target state is achieved.
Hardware Validation phase:
Hardware Validation (HWV) is initiated during the cluster deployment process. During this phase, each machine defined in the cluster’s rack configuration undergoes a thorough hardware check before deployment. HWV evaluates the health and status of all hardware components on these machines. HWV attempts to fix misconfigured BIOS boot settings and firmware components as needed.
Based on the results of these HWV checks and any user skipped machines, a determination is done on whether sufficient nodes passed and/or are available to meet the thresholds necessary for deployment to continue. The results of HWV for each server are recorded in the Log Analytics Workspace (LAW), which is created as part of the cluster setup.
Note
Hardware validation thresholds are enforced for various node types to ensure reliable cluster operation: Management nodes are divided into two roles: Kubernetes Control Plane (KCP) nodes and Nexus Management Plane Nodes (NMP) nodes.
- KCP nodes: Must achieve a 100% hardware validation success rate since they make up the control plane.
- NMP nodes: These are grouped into two management groups, with each group required to meet a 50% hardware validation success rate.
- Compute nodes: Must meet the thresholds specified by the deployment input.
This article provides a detailed overview of the HWV process Hardware Validation Overview
This article provides instructions on how to check and troubleshoot HWV results Troubleshoot Hardware Validation
Azure Prerequisites Validation phase
Before cluster deployment proceeds, Operator Nexus validates that user-provided Azure resources are accessible using the configured managed identity. This validation ensures that the cluster can successfully use the following resources during and after deployment:
- Log Analytics Workspace (LAW): Required for software extension installation and metrics collection
- Storage Account: Used for storing run command output
- Key Vault: Used for credential rotation and secret storage
The validation phase runs as the "Validate Azure prerequisites" step in the deployment action. Each resource undergoes a connectivity and permissions check:
| Resource | Validation test |
|---|---|
| Log Analytics Workspace | Verifies the managed identity can call the GetSharedKeys API |
| Storage Account | Verifies the managed identity can upload and commit blobs to the container |
| Key Vault | Verifies the managed identity can write, read, and delete secrets |
If any validation fails, the deployment action reports the failure with an error message indicating which resource failed and why. Common causes include:
- Missing role assignments on the target resource
- Incorrect resource identifiers (workspace ID, container URL, or vault URI)
- Firewall rules blocking access from trusted Azure services
The deployment doesn't proceed until all Azure prerequisites pass validation. After correcting any issues, validation automatically retries.
For detailed configuration steps and troubleshooting guidance, see Cluster Managed Identity and User Provided Resources.
Bootstrap phase:
Once hardware validation is successful and deployment thresholds are met, a bootstrap image is generated on the cluster manager to initiate cluster deployment. This image iso URL is used to bootstrap the ephemeral node, which would deploy the target cluster components, which are provisioning the kubernetes control plane (KCP), Nexus Management plane (NMP), and storage appliance. These various states are reflected in the cluster status, which these stages are executed as part of the ephemeral bootstrap workflow.
The ephemeral bootstrap node sequentially provisions each KCP node, and if a KCP node fails to provision, the cluster deployment action fails, marking the cluster status as failed. The Bootstrap operator manages the provisioning process for bare-metal nodes using the Preboot Execution Environment (PXE) boot approach.
Once KCP nodes are successfully provisioned, the deployment action proceeds to provision NMP nodes in parallel. Each management group must achieve at least a 50% provisioning success rate. If this requirement isn't met, the cluster deployment action fails, resulting in the cluster status being marked as failed.
Upon successful provisioning of NMP nodes, up to two storage appliances are created before the deployment action proceeds with provisioning the compute nodes. Compute nodes are provisioned in parallel, and once the defined compute node threshold is met, the cluster status transitions from Deploying to Running. However, the remaining nodes continue undergoing the provisioning process until they too are successfully provisioned.
Cluster operations
- List cluster: List cluster information in the provided resource group or subscription.
- Show cluster: Get properties of the provided cluster.
- Update cluster: Update properties or tags of the provided cluster.