Fixing [400] An error occurred while sending an authentication request to the vCenter Single Sign-On server

So After the last two blog posts about fixing vCenter7’s access issues due to it’s due certificate monument work flows. I was greeted with this error when trying to sign into the web UI on vCenter.

[400] An error occurred while sending an authentication request to the vCenter Single Sign-On server- An error occurred when processing meta data during vCenter Single Sign-On setup:the service provider validation failed. Verify that the server URL is correct and is in FQDN format, or that the hostname is a trusted service provider alias.

After a a quick google search I found yet another VMware KB discussing it.

 Resolution
This is an expected behavior.
VMware vSphere 7.0 enforce FQDN or IP address reverse resolvable to FQDN to allow authentication for Single-Sign on.
Greeeeeeeeeeeeeeaaaaaaaaat! Thanks VMware, just another example of security destroying functionality.
What did I do? Exactly what it stated, I navigated to the WebUI URL using the hostnames Fully Qualified Domain Name E.G: Hostname.domain.end
Cause I was attempting to access it just by just the hostname as domain info was being auto resolved by the domain suffix during queries.

Fixing vCenter [500] An error occurred while fetching identity providers.

Story

So The other day I posted about upgrading vCenter to 7.0.x while everything went fine during the upgrade. For some odd reason a couple days later when I went to navigate to the vCenter login page I was greeted with:

[500] An error occurred while fetching identity providers.

Kind of wished I had read this reddit post right off the hop, cause the first reply was is going to be my answer at the end of this post.

I did however first hit this KB about it as well I was a bit thrown off has it indicated to only do it if you see the following in the logs:

(/var/log/vmware/trustmanagement/trustmanagement-svcs.log)

2021-03-10T09:27:03.474Z [tomcat-exec-14  INFO  com.vmware.identity.token.impl.X509TrustChainKeySelector  opId=] Failed to find trusted path to signing certificate <STS Certificate Subject, example - C=US,CN=ssoserverSign\,dc\=vsphere\,dc\=local>
java.security.cert.CertPathBuilderException: Unable to find certificate chain.

Which I could not see, so I wasn’t sure if this was the issue or not. What I did see in my logs was the following:

2021-09-17T23:58:03.945Z [tomcat-exec-14 WARN com.vmware.vcenter.trustmanagement.impl.VcIdentityProviders opId=] com.vmware.sso.interop.ldap.NoSuchObjectLdapException: No such object
LDAP error [code: 32]

and

2021-09-18T01:19:01.322Z [tomcat-exec-26 INFO com.vmware.vapi.security.AuthenticationFilter opId=] Not successful authentication
java.lang.RuntimeException: Authentication data not found
Caused by: com.vmware.vapi.dsig.json.SignatureException: Cannot verify the signature over the provided data

So it wasn’t matching. Looking at my firewall I couldn’t see any LDAP connections from vCenter to my LDAP server since the upgrade. So I decided instead to try a reboot. This simply made things worse.

No Healthy Upsteam

Now when I’d try access vCEnter Web UI I was greeted with a blank white web page with simple text stating “No Healthy Upstream”, now looking into this, people reached this problem for several different reasons. As mentioned here and here and for some odd reason this guy just changed his IP address?! Weird.

For me I checked the local Hosts file and it was fine, and couple other mentioned fixes and they all didn’t work for me.

Try Anyway

For some reason at this point I decided to double the mentioned work around in the initial VMware KB I found as the main login symptom was exactly the same even though I couldn’t validate the same log entries within the logs.

How to Copy Files to VCSA via WinSCP

Now a couple real quick things to note here. You need to copy a script to the VCSA. If you get unable to agree on a cipher suite, you’ll need to update your copy of WinSCP to a newer version. Also instead of doing what VMware says to change the shell on the VCSA, do what this guy suggests instead:

“In the new connection dialog, specify the Host name, User name and then click the Advanced button,

(VCSA 6.5)

Choose the Environment/SFTP option

Specify for SFTP server: shell /usr/libexec/sftp-server”

so much easier.

I decided to take a look at the script after copying it to the VCSA, and it had this line which had me hopeful it would actually work to resolve my issue:

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE

So I followed along with the workaround specified in the KB…

1) Download the attached fixsts.sh script from this article and upload to the impacted PSC or vCenter Server with Embedded PSC to the /tmp folder.

2) If the connection to upload to the vCenter by the SCP client is rejected, run this from an SSH session to the vCenter:

chsh -s /bin/bash

3) Connect to the PSC or vCenter Server with an SSH session if you have not already per Step 2.

4) Navigate to the /tmp directory:

cd /tmp

5) Run chmod +x fixsts.sh to make the file executable.

chmod +x ./fixsts.sh

6) Run ./fixsts.sh.

./fixsts.sh

Restart services on all vCenters and/or PSCs in your SSO domain by using below commands:

service-control --stop --all
service-control --start --all

my results:

To my Amazement it actually worked, and I was able to login into the vCenter server!! Wooo!

*Update* Here’s a great blog post covering managing or creating custom certificates with vCenter 7

Kinda funny that 7.0 is stated as 6.8 in the scripts.. mhmm

Veeam: NFC storage connection is unavailable.
Failed to create NFC download stream.

I’ll keep this post short as well.

Run replication Job… ERROR. Check error, huh haven’t this one before…

4/16/2021 11:22:16 AM :: Processing VMName Error: NFC storage connection is unavailable. Storage: [stg:datastore-3,nfchost:host-2,conn:vcenter]. Storage display name: [ESXi-Local-Datastore].
Failed to create NFC upload stream. NFC path: [nfc://conn:vCenter,nfchost:host-23982,stg:datastore-23983@VMName_replica/VMName.vmx].

To note on this, I did some changes, I changed a route between sites (as I needed to reduce a entire network from being improperly routed, but some of services still required access from the main location, thus some dedicated /32 routes were put in their place).

I had also just patched the host in question and was testing jobs after the patch. Since I wasn’t exactly sure which was the cause. I decided to do regular troubleshooting to get more details to the root cause.

I love Veeam, they got a nice KB to help with this. So I followed along, checking the main Veeam server log areas didn’t have the log file in question, so was pretty confident it was still using the proxy at the alt site.

Checking the log as mentioned by the KB, sure enough the same error line showed up which indicated it was DNS related. Checking the proxie’s DNS settings…. DOH. It was using a server within the routes I had removed, and didn’t create a dedicated /32 for, as I wasn’t expecting any systems here to need to communicate to that subnet.

Now that I know what the issue is… this feels familiar… oh yeah the Veeam Soap Fault issue I had to deal with

The funny part about this is… 1) it’s the same server/proxy 2) Again DNS related 3) Again going to stick with host file to avoid dependencies on DNS servers

In my case the error showed the ESXi server by the hostname WAS fully qualified, but access to a DNS server to resolve it was unavailable. As soon as I saw this I had two options:

  1. Create a route to allow the Proxy to reach required DNS servers (which won’t be available in a DR case) OR
  2. Just add a static record in the Proxy host file. (DNS server not required, but if hostname or IP changes needs to be adjusted manually here)

As you can see I have the exactly 2 options similar to my first post, the difference is now it was fully dependent on DNS. Since this is a self hosted instance of a Veeam proxy, there’s a good chance DNS access might not be available when time comes, so this option was chosen.

It’s important to note that when these types of choices are made it is well documented WHY they were made.

So in this case… to resolve it I added a record in the Proxy’s local host file

172.x.x.x     ESXi.domain.postfix

You may notice that the ESXi hostname is not within the initial error, it only tells your the datastore, the Veeam logs will indicate which lookup failed. More than likely the hostname look up for the ESXi host in which the VM will be created on.

I really hope this post helps someone. Honestly I just followed the Veeam KB which was a great source reference to troubleshooting the issue. Your case maybe different, depending on the root cause your resolution maybe different then what was discussed here.

Cheers stay safe everyone.

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.

 

MSI installer – Network Resource Unavailable

The Start

The other day my boss came in with a colleagues laptop and told me that the VPN software failed to update, and by fail to update completely removed the old version and didn’t update to the latest version. Checking the laptops ‘Programs and Features’ sure enough showed no signs of the application.

I simply grabbed the latest version of the software installer and attempt to reinstall the application, yet to my dismay the latest installer complains with the following error ‘The feature you are trying to use is on a network resource that is unavailable.’ How insightful…

I figured it was a registry issue, as the registry is known to hold old settings from applications and do not get cleaned up properly. This place can be a cease pool on old machines that have been constantly upgraded.

The Dig

There are plenty of references online when it comes to this error. The main one people reference is the HKLM/Software/Classes/installer/products.

Even though I cleaned everything in this based on the application I was installing, it was still failing with with same error.

The Answer

Lucky for me it kept specifying what it was expecting for a network path, I decided to search the registry based on this, and found there was a key.

After removing the parent GUID based key, the installer ran successfully.

If my memory serves me correctly I believe the problematic key was under:

HKEY_CLASSES_ROOT\Installer\Products