Today, I was responding to an e-mail from a system engineer from the VMware Partner Community. This was about an issue which they recently faced in a customer's VMware Infrastructure. This is reference with handling of Storage LUNS and Datastores on vSphere 4.x and 5.x platforms.
Here is the issue:-
"We have some unused data stores so we un-mounted those
from all ESXi hosts and deleted the LUNs from storage. After that, the deleted datastores goes in a dead path state. One of the hosts in a
cluster automatically disconnected from the vCenter. We re-scan HBA and
management services; however the dead paths still appear in the vCenter. We are not able to do vMotion and our HA master agent is also not working. How do we fix this??"
I have been asked similar questions time and again from customers, partners and fellow co-workers, hence I thought I should share this information with the larger community with this blog post.
How do you land up in this situation:-
Once you create a VMFS Datastore on a storage LUN, the state of this Datastore and the associated LUN is saved in the storage configuration of the ESXi Kernel. From here on, it is the responsibility of the vmkernel to ensure that the VMFS Datastore and the associated LUN is always available to the ESXi Hosts.
To ensure availability, the vmkernel sends I/O requests (SCSI Commands) to each of the Datastore after every few seconds and receives a response which ensures that the Datastores are up and running. This mechanism is to ensure that any transient storage conditions can be resolved and the LUNS/Datastores are available to the virtual machines.
Now coming to the situation mentioned above, where you have un-mounted, un-presented and destroyed the LUNS. In this situation, ESXi is not aware of the fact that whether you have un-presented these LUNS yourself or ESXi has lost them due to a technical failure. Now, with the default safety feature where-in ESXi will try to recover these LUNS, it will start sending requests to bring the state of the missing LUNS as ONLINE. Unfortunately, these LUNS are not available anymore; hence the requests of the ESXi hosts would not be honored.
Since these request would go into a stale state, they would cause the Hostd service to hang. Hostd keeps a track of all the agent based services and resources available to ESXi. As soon as the Hostd service is in trouble, the VPXA agent, HA Agents and the Hostd service itself will start falling apart, causing the host to disconnect from the vCenter Server. No vCenter means no vMotion etc.
Form above description, you can see the cascading effect of un-presenting LUNS from the ESXi servers without following a proper procedure which is available in a VMware KB article. Please note that this issue can happen in a day, weeks or even a month’s time of un-presenting the storage Luns.
How to solve this
Unfortunately, you are in a
scenario where you have already hit the situation wherein the hostd service is
crashing. You would have to perform a reboot of all the the ESXi servers which
are showing the APD (All Paths Down) or PDL (Permanent Device Lost) warning
messages in the vmkernel / Messages logs.
To read more on this you can
refer to the following KB Article - Permanent
Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.0
How to avoid this in the
future
To avoid getting into such a
situation in the future, please follow the proper procedure mention in the KB
article – Un-mounting
a LUN or Detaching a Datastore/Storage Device from multiple ESXi 5.x hosts
Hope this helps you to understand the basic concepts of how Storage LUNs need to be treated on an ESXi Server to ensure that storage operations do not affect the way how your VMware Infrastructure performs.