I’ll keep this post short as I find it rather interesting turn of events…
I was ensuring config files were being backed up as good Sysadmin Posture would suggest. I noticed an issue on one of the hosts when I went to run the configuration backup. When I was hit with a “general system error”. I Wish I would have saved the actual error message but when I went to research it, it gave me more people complaining about the error when trying to vMotion (In this case I grabbed the most specific line from the call back in hopes to get something) most results simply pointed to saying restarted the ESXi hosts management agents. I decided to see if I could vMotion or if any other symptoms from the host, but everything showed green in vCenter, and vMotions worked without an issue.
I left it for the night (before) and decided to update the host since it was running 6.5-u2 and needed to be updated to 6.5-u3. I was hoping it would resolve the config backup issue (I had an older copy on hand but wanted most up to date) So I decided to do the update via a ISO image and a host reboot, moved VMs, no problem, placed host into Maintenance mode, no issues, send host for shutdown (I wasn’t using iLO or iDRAC, so needed to manually mount the USB hosting the installation media), while I was at the console waiting for the host to show “shutting down” it simply …. didn’t… after 10 minutes which is far more then generous I decided to press F2 to get into the console to have to tell me … uggghhh I wish I would have taken a picture of the error, something along the lines of you can’t use the console cause you been locked out…. it was weird, after waiting another 10 minutes and nothing happening (I’m assuming there must have maybe been some actual underlying issue with the datastore holding the scratch?) I decided to just do a hard shutdown (hold power for 5 seconds). I could rebuild this server faster if need be. Powering it back up was perfectly fine through all POST checks, booted the ESXi 6.5u3 installer, booted fined, checked for logical drives, found all of them perfectly fine including the SD card holding the OS (scratch changed to a persistent location, another datastore on the local host with multiple disks used for a datastore) anyway I selected it and it saw there was an existing installation and selected to update it, successful, reboot. reboots fine, but host doesn’t come back to vCenter…..
Log into host directly, and notice it states it has 18 VMs, all with a status of error, now I had moved them and they are even still running on another host (thankfully it didn’t do anything stupid to try and hijack them) at this point I had called VMware support and placed an SR. Once I had a tech on the line, I discussed my symptoms and issues. At this point I asked what was the best course of action as, I didn’t want to rebuild the host (I could, but time). In this case I simply un-registered all the VMs that were in error as they were not associated with the host), then attempted to re-connect the host. It first failed complaining about bad username and password (I’m not sure if this was the root account used to add it, or the VPX user it uses to manage the host) but it prompted the wizard like adding a new host, since all other settings were still fine on the host (network and vswitches w/ VMPGs and VMKs, and all Datatstores) after this wizard the host was re-added back into the cluster with the latest updates, and running the backup config command:
esxcli storage vmfs extent list
worked without error. So yay! all fixed, but one thing still bothered me… what was the root cause…
I decided to check my scratch settings real quick, and sure enough it hold good and is on redundant datastore on the local ESXi host, so not sure what the root cause was, but it was fixed. I’m posting this for my own reference. Just incase this issue re-arises.