Unhealthy scale set instance for scale set "FE"

Question

Unhealthy scale set instance for scale set "FE"

Harini Gopinath 0

Hi Team,

All Scale set instance under VMSS "FE" are unhealthy, I have tried restarting scale set instance to see if it helps but it didn't what can be done here?

Tenant is GME. It is in prod environment.

Because of this, auto upgrade is getting failed which is causing issues.

Thanks

0 comments

2 answers

Your answer

Answer 1

Hello Harini

We understand that all instances under the Virtual Machine Scale Set “FE” are currently marked as Unhealthy, and restarting the instances did not resolve the issue. As a result, automatic upgrades are failing, which is impacting the production workload.

In Azure Virtual Machine Scale Sets (VMSS), automatic OS upgrades and instance repairs rely on health signals from either:

An Application Health Extension, or
An Azure Load Balancer health probe

If these health checks are misconfigured, not responding with expected results (for example, HTTP 2xx responses), or not enabled at all, the platform marks the instances as Unhealthy. When all instances are unhealthy, Azure intentionally blocks automatic upgrades to prevent potential service downtime.

https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/restart-stop/instances-not-repaired

Please review and validate the following configuration:

Confirm Health Monitoring is Enabled
- Ensure that only one health signal source is configured (either Application Health Extension or Load Balancer probe).
- If using the Application Health Extension, verify that the application endpoint consistently returns a healthy response (HTTP 2xx or successful TCP handshake).
  https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-health-extension?tabs=rest-api
Validate the Health Endpoint
- Confirm the configured probe endpoint is reachable and responding as expected from inside the VM instances.
- Any incorrect response will cause Azure to mark the instance as Unhealthy.
  https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machine-scale-sets/restart-stop/instances-not-repaired
Check Automatic Repair / Upgrade Eligibility
- Automatic OS upgrades and instance repairs will proceed only after instances report a Healthy state.
- Once health is restored, upgrades should resume automatically without manual intervention. https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-automatic-upgrade

If the issue persists after validating the health configuration, we are happy to assist further. Please add your comments in the Comment section, and we will respond promptly to help you move forward.

Thanks,
Manish.

Manish Deshpande 5,255 Reputation points Microsoft External Staff Moderator

2026-03-20T11:34:47.03+00:00

Hello Harini

I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.
Manish Deshpande 5,255 Reputation points Microsoft External Staff Moderator

2026-03-24T07:46:34.1166667+00:00

Hello Harini

I wanted to check if my last response made sense. I’d be glad to assist further or explain anything in more detail and please accept as Yes and upvote if the answer is helpful so that it can help others in the community.

Answer 2

For a VM scale set where all instances show as Unhealthy and automatic OS image upgrade is failing, use the following checks and fixes:

Confirm automatic repairs are actually enabled and running
- Check the VMSS orchestrationServices property and verify the automatic repairs serviceState is Running, not NotRunning or Suspended. If it is not Running, enable or re-enable automatic instance repairs on the scale set.
Verify health monitoring configuration
- If all instances are Unhealthy, this often indicates a misconfigured health probe rather than a real outage.
- Confirm whether the VMSS is using the Application Health extension or Load Balancer health probes.
- For either method, ensure the configured endpoint returns at least a 2xx HTTP(S) status or a successful TCP handshake. Any other response (including unreachable) will mark instances Unhealthy.
- Validate protocol, port, and path (for example, /) configured under Health and repair for the VMSS, and that the application in each instance is listening and responding correctly.
Check for provisioning failures
- Use Get Instance View on the VMSS with API version 2019-12-01 or later and inspect virtualMachine.statusesSummary.
- If any instance shows ProvisioningState/failed, automatic instance repairs will not fix those VMs. Manually:
  - Remove the failed instance from the scale set.
  - Increase capacity (or add a new instance) so the scale set recreates a healthy VM.
Consider grace period and repair batching
- Automatic repairs only act after the configured automaticRepairsPolicy.gracePeriod (10–90 minutes). If instances were recently updated or restarted, wait for the grace period to elapse.
- Repairs are batched (max 5% of instances at a time, or one at a time if fewer than 20 instances). If many instances are Unhealthy, full recovery can take time.
Reconfigure or enable automatic repairs via portal (if needed)
- In the Azure portal, open the VMSS → Settings → Health and repair.
- Enable Monitor application health and configure either:
  - Application Health extension: set protocol (HTTP/HTTPS/TCP), port, and health path; ensure the endpoint returns 200 OK when healthy.
  - Load balancer probe: select or create a probe with correct protocol, port, and path.
- Turn Automatic repairs to On, set an appropriate Grace period (min), and save.
Re-run or resume automatic OS image upgrade after health is fixed
- Once instances start reporting Healthy and automatic repairs are functioning, re-trigger or resume the automatic OS image upgrade.
- If upgrade errors persist, use Get Instance View, Rolling Upgrades – Get Latest, and Get OS Upgrade History to inspect detailed error codes such as MaxUnhealthyInstancePercentExceededInRollingUpgrade or MaxUnhealthyUpgradedInstancePercentExceededInRollingUpgrade, and adjust health thresholds or fix underlying app issues accordingly.

These steps should restore instance health, allow automatic repairs to replace or reimage bad instances, and unblock the auto upgrade process in the GME prod environment.

References:

Share via

Unhealthy scale set instance for scale set "FE"

2 answers

Your answer