The Story
So the other day I tested the latest VMware patch that was released as blogged about here.
Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.
First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.
Logged into vCenter (SSH/Console) once in the shell:
passwd
To see the status of the account.
chage -l root
To set the root password to never expire (do so at your own risk, or if allowed by policies)
chage -I -1 -m 0 -M 99999 -E -1 root
Install patch update, and reboot vCenter.
All is good until…
ERROR: HA Down
So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.
For the cluster itself:
“cannot find vSphere HA master”
For the ESXi hosts
“Cannot install the vCenter Server agent service”
So off to the internet I go! I also ask people on IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.
Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.
Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”
That’s ironic, I Just updated to cause this problem, hahaha.
Anyway, my Colleague notices the “work around”…
“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”
I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.
The Reboot
So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….
Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..
Press F2/F12 at console “direct management as been disabled” like uhhh ok…
I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…
reboot -n -f
Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.
The Unexpected
At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.
Click reconnect host…
Error: failed to login with the vim admin password
Not gonna lie, at this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.
I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.
So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.
/etc/init.d/hostd restart
/etc/init.d/vpxa restart
Then re-added the host to vCenter and to the cluster, and it worked just fine.
The Next Server
Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.
Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.
This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?
I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.
Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?
Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.
It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.
Some weird answers the best being:
vim-cmd hostsvc/hostsummary|grep -i reboot
The real thing that made me raise my brow was this convo bit:
Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….
Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.
“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.
So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.
Moment of Truth
So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.
I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!
WOOOOOOOOO!!!
Summary
After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.
I hope this blog posts helps anyone experiencing the same type of issue.