Friday, February 1, 2013

ESXi/ESX host becomes Inaccessible/Not-Responding in vCenter after Un-Presenting Storage Luns

Today, I was responding to an e-mail from a system engineer from the VMware Partner Community. This was about an issue which they recently faced in a customer's VMware Infrastructure. This is reference with handling of Storage LUNS and Datastores on vSphere 4.x and 5.x platforms.

Here is the issue:-

"We have some unused data stores so we un-mounted those from all ESXi hosts and deleted the LUNs from storage. After that, the deleted datastores goes in a dead path state. One of the hosts in a cluster automatically disconnected from the vCenter. We re-scan HBA and  management services; however the dead paths still appear in the vCenter. We are not able to do vMotion and our HA master agent is also not working. How do we fix this??"

I have been asked similar questions time and again from customers, partners and fellow co-workers, hence I thought I should share this information with the larger community with this blog post.

How do you land up in this situation:-

Once you create a VMFS Datastore on a storage LUN, the state of this Datastore and the associated LUN is saved in the storage configuration of the ESXi Kernel. From here on, it is the responsibility of the vmkernel to ensure that the VMFS Datastore and the associated LUN is always available to the ESXi Hosts.

To ensure availability, the vmkernel sends I/O requests (SCSI Commands) to each of the Datastore after every few seconds and receives a response which ensures that the Datastores are up and running. This mechanism is to ensure that any transient storage conditions can be resolved and the LUNS/Datastores are available to the virtual machines.

Now coming to the situation mentioned above, where you have un-mounted, un-presented and destroyed the LUNS. In this situation, ESXi is not aware of the fact that whether you have un-presented these LUNS yourself or ESXi has lost them due to a technical failure. Now, with the default safety feature where-in ESXi will try to recover these LUNS, it will start sending requests to bring the state of the missing LUNS as ONLINE. Unfortunately, these LUNS are not available anymore; hence the requests of the ESXi hosts would not be honored. 

Since these request would go into a stale state, they would cause the Hostd service to hang. Hostd keeps a track of all the agent based services and resources available to ESXi. As soon as the Hostd service is in trouble,  the VPXA agent, HA Agents and the Hostd service itself will start falling apart, causing the host to disconnect from the vCenter Server. No vCenter means no vMotion etc.

Form above description, you can see the cascading effect of un-presenting LUNS from the ESXi servers without following a proper procedure which is available in a VMware KB article. Please note that this issue can happen in a day, weeks or even a month’s time of un-presenting the storage Luns.

How to solve this

Unfortunately, you are in a scenario where you have already hit the situation wherein the hostd service is crashing. You would have to perform a reboot of all the the ESXi servers which are showing the APD (All Paths Down) or PDL (Permanent Device Lost) warning messages in the vmkernel / Messages logs.

To read more on this you can refer to the following KB Article - Permanent Device Loss (PDL) and All-Paths-Down (APD) in vSphere 5.0

How to avoid this in the future

To avoid getting into such a situation in the future, please follow the proper procedure mention in the KB article – Un-mounting a LUN or Detaching a Datastore/Storage Device from multiple ESXi 5.x hosts


Hope this helps you to understand the basic concepts of how Storage LUNs need to be treated on an ESXi Server to ensure that storage operations do not affect the way how your VMware Infrastructure performs.

5 comments:

  1. Hi Sunny,
    Thanks for the blog, you've got some good stuff here.

    I wanted to let you know that you may have an issue with your RSS feeds. I just tried to add it to my RSS reader (feedburner) and it threw this error "Unable to download http://feeds.feedburner.com/blogspot/wunQMP. Please make sure the URL is correct"

    I haven't had any trouble with other recent subscriptions.

    Cheers!
    GS

    ReplyDelete
    Replies
    1. Could you please try subscribing to the RSS feeds again.. Somehow the url which was saved was bad... It should work now..

      Delete
  2. Thanks buddy.. I will look into it.. In the meantime, please use the subscribe by email option and you should be able to get all the latest articles...

    Regards,
    Sunny

    ReplyDelete
  3. I would like to share something here which might help your readers.
    If ESX hosts get disconnected after LUN removal, immediately get your storage administrator re-present that LUN.In case you are not aware of LUN ID, read it from vmkernel logs and look for NAAID which hostd keeps looking. It is very easy to locate it as vmkernel continously keep looking for it.

    More delay there is in re-presenting this LUN, more chances are there in loosing ESXi and VM.This is more commonly seen in 4.1 but 5.1 is doing great job and saving System administrator's day.
    Please follow the article which Sunny has mentioned for 5.1. Thank you Sunny for sharing

    ReplyDelete
  4. Absolutely agreed... As I mentioned, this outstanding IO which can lead to APD, is either the User World (hostd, vmkernel, in short from ESXi hypervizor) or Virtual Machine IO.

    In case of planned activities, the suggestion in the above comment by @techstarts will help tremendously.

    -Sunny

    ReplyDelete