Veeam: NFC storage connection is unavailable.
Failed to create NFC download stream.

I’ll keep this post short as well.

Run replication Job… ERROR. Check error, huh haven’t this one before…

4/16/2021 11:22:16 AM :: Processing VMName Error: NFC storage connection is unavailable. Storage: [stg:datastore-3,nfchost:host-2,conn:vcenter]. Storage display name: [ESXi-Local-Datastore].
Failed to create NFC upload stream. NFC path: [nfc://conn:vCenter,nfchost:host-23982,stg:datastore-23983@VMName_replica/VMName.vmx].

To note on this, I did some changes, I changed a route between sites (as I needed to reduce a entire network from being improperly routed, but some of services still required access from the main location, thus some dedicated /32 routes were put in their place).

I had also just patched the host in question and was testing jobs after the patch. Since I wasn’t exactly sure which was the cause. I decided to do regular troubleshooting to get more details to the root cause.

I love Veeam, they got a nice KB to help with this. So I followed along, checking the main Veeam server log areas didn’t have the log file in question, so was pretty confident it was still using the proxy at the alt site.

Checking the log as mentioned by the KB, sure enough the same error line showed up which indicated it was DNS related. Checking the proxie’s DNS settings…. DOH. It was using a server within the routes I had removed, and didn’t create a dedicated /32 for, as I wasn’t expecting any systems here to need to communicate to that subnet.

Now that I know what the issue is… this feels familiar… oh yeah the Veeam Soap Fault issue I had to deal with

The funny part about this is… 1) it’s the same server/proxy 2) Again DNS related 3) Again going to stick with host file to avoid dependencies on DNS servers

In my case the error showed the ESXi server by the hostname WAS fully qualified, but access to a DNS server to resolve it was unavailable. As soon as I saw this I had two options:

  1. Create a route to allow the Proxy to reach required DNS servers (which won’t be available in a DR case) OR
  2. Just add a static record in the Proxy host file. (DNS server not required, but if hostname or IP changes needs to be adjusted manually here)

As you can see I have the exactly 2 options similar to my first post, the difference is now it was fully dependent on DNS. Since this is a self hosted instance of a Veeam proxy, there’s a good chance DNS access might not be available when time comes, so this option was chosen.

It’s important to note that when these types of choices are made it is well documented WHY they were made.

So in this case… to resolve it I added a record in the Proxy’s local host file

172.x.x.x     ESXi.domain.postfix

You may notice that the ESXi hostname is not within the initial error, it only tells your the datastore, the Veeam logs will indicate which lookup failed. More than likely the hostname look up for the ESXi host in which the VM will be created on.

I really hope this post helps someone. Honestly I just followed the Veeam KB which was a great source reference to troubleshooting the issue. Your case maybe different, depending on the root cause your resolution maybe different then what was discussed here.

Cheers stay safe everyone.

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.

 

MSI installer – Network Resource Unavailable

The Start

The other day my boss came in with a colleagues laptop and told me that the VPN software failed to update, and by fail to update completely removed the old version and didn’t update to the latest version. Checking the laptops ‘Programs and Features’ sure enough showed no signs of the application.

I simply grabbed the latest version of the software installer and attempt to reinstall the application, yet to my dismay the latest installer complains with the following error ‘The feature you are trying to use is on a network resource that is unavailable.’ How insightful…

I figured it was a registry issue, as the registry is known to hold old settings from applications and do not get cleaned up properly. This place can be a cease pool on old machines that have been constantly upgraded.

The Dig

There are plenty of references online when it comes to this error. The main one people reference is the HKLM/Software/Classes/installer/products.

Even though I cleaned everything in this based on the application I was installing, it was still failing with with same error.

The Answer

Lucky for me it kept specifying what it was expecting for a network path, I decided to search the registry based on this, and found there was a key.

After removing the parent GUID based key, the installer ran successfully.

If my memory serves me correctly I believe the problematic key was under:

HKEY_CLASSES_ROOT\Installer\Products