Troubleshooting vSAN Witness Node Isolation

This week I had a fun one with vSAN stretched clusters. After a failover test, on 2 stretched clusters both witnesses of those clusters stopped working.

Let the troubleshooting begin

First, you look at the corresponding KB article. Troubleshooting vSAN Witness Node Isolation (2150433)

Symptoms
  • A vSAN Witness Node (Virtual or Physical) is Isolated.
    To confirm witness node isolation run the command: esxcli vsan cluster get
    If the output of the command returns:
    Sub-Cluster Member Count: 1
    Local Node State: STANDALONE

    Then the Witness is confirmed to be isolated from the vSAN Cluster.
  • The vSAN Witness Host cannot form a Cluster with the remaining vSAN Data Nodes in a Stretched Configuration.
  • Pinging the Witness Host from a vSAN ESXi fails
  • Pinging an ESXi host from a Witness works, but not with a Full TCP Frame

After testing all those settings, and all passed I was still scratching my head why the witness was isolated and living in 2 partitions.

It formed a cluster just fine… Pinging all objects worked on the correct VMK. Routes were all there.
Unicastagent showed all hosts including the witness.

So why is it still isolated? what am I missing here? it worked before the failover…
Even redeployed a new witness on the same physical witness host and it still not worked!

Something the KB article does not mention

All the tests in the KB article on vSAN Witness Node Isolation only test TCP. not UDP…
The vSAN Clustering Service uses UDP!

TCP and UDP ports for VMware vSAN network :

PortProtocolSourceDestinationService
12345UDPESXi hostsESXi hostsvSAN Clustering Service.
23451ESXi hostsvSAN Witness
12321vSAN WitnessESXi hosts
2233TCPESXi hostsESXi hostsvSAN Transport: Used for storage IO.
ESXi hostsvSAN Witness
vSAN WitnessESXi hosts
8080TCPESXi hostsESXi hostsvSAN Management Service
vCenterESXi hostsVMware vSphere Profile-Driven Storage Service and vSAN Management Service
3260TCPiSCSI initiatorESXi hostsDefault iSCSI port for vSAN ISCSI target service
5001UDPESXi hostsESXi hostsVsanhealth-multicasttest: vSAN Health Proactive Network test. This port is enabled on demand when Proactive Network Test is running.
8010TCPWeb browservCentervSAN observer default port number for live statistics. Custom port number can also be specified for vSAN observer.
80TCPESXi hostsESXi hostsvSAN Performance Service

After deploying a new witness in a new network and another host it came up instantly. So this pointed me in the direction that it’s still a network issue!
Where the problem eventually was, was that UDP was disabled on the vSAN witness switch ACL due to network hardening. This kept on working before because the UDP connection was open at all times until the failover happened. after that, the UDP was blocked and hence the vSAN Clustering Service died.

So if you run in a similar issue with a vSAN Witness, check UDP traffic!

 Related Information

Blog Stats

  • 211,680 hits

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.