Troubleshooting vSAN Witness Node Isolation

This week I had a fun one with vSAN stretched clusters. After a failover test, on 2 stretched clusters both witnesses of those clusters stopped working.

Let the troubleshooting begin

First, you look at the corresponding KB article. Troubleshooting vSAN Witness Node Isolation (2150433)

Symptoms
  • A vSAN Witness Node (Virtual or Physical) is Isolated.
    To confirm witness node isolation run the command: esxcli vsan cluster get
    If the output of the command returns:
    Sub-Cluster Member Count: 1
    Local Node State: STANDALONE

    Then the Witness is confirmed to be isolated from the vSAN Cluster.
  • The vSAN Witness Host cannot form a Cluster with the remaining vSAN Data Nodes in a Stretched Configuration.
  • Pinging the Witness Host from a vSAN ESXi fails
  • Pinging an ESXi host from a Witness works, but not with a Full TCP Frame

After testing all those settings, and all passed I was still scratching my head why the witness was isolated and living in 2 partitions.

It formed a cluster just fine… Pinging all objects worked on the correct VMK. Routes were all there.
Unicastagent showed all hosts including the witness.

So why is it still isolated? what am I missing here? it worked before the failover…
Even redeployed a new witness on the same physical witness host and it still not worked!

Something the KB article does not mention

All the tests in the KB article on vSAN Witness Node Isolation only test TCP. not UDP…
The vSAN Clustering Service uses UDP!

TCP and UDP ports for VMware vSAN network :

Port Protocol Source Destination Service
12345 UDP ESXi hosts ESXi hosts vSAN Clustering Service.
23451 ESXi hosts vSAN Witness
12321 vSAN Witness ESXi hosts
2233 TCP ESXi hosts ESXi hosts vSAN Transport: Used for storage IO.
ESXi hosts vSAN Witness
vSAN Witness ESXi hosts
8080 TCP ESXi hosts ESXi hosts vSAN Management Service
vCenter ESXi hosts VMware vSphere Profile-Driven Storage Service and vSAN Management Service
3260 TCP iSCSI initiator ESXi hosts Default iSCSI port for vSAN ISCSI target service
5001 UDP ESXi hosts ESXi hosts Vsanhealth-multicasttest: vSAN Health Proactive Network test. This port is enabled on demand when Proactive Network Test is running.
8010 TCP Web browser vCenter vSAN observer default port number for live statistics. Custom port number can also be specified for vSAN observer.
80 TCP ESXi hosts ESXi hosts vSAN Performance Service

After deploying a new witness in a new network and another host it came up instantly. So this pointed me in the direction that it’s still a network issue!
Where the problem eventually was, was that UDP was disabled on the vSAN witness switch ACL due to network hardening. This kept on working before because the UDP connection was open at all times until the failover happened. after that, the UDP was blocked and hence the vSAN Clustering Service died.

So if you run in a similar issue with a vSAN Witness, check UDP traffic!

 Related Information

Blog Stats

  • 4,735 hits

1 thought on “Troubleshooting vSAN Witness Node Isolation

  1. Pingback: Technology Short Take 118 - s0x

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.