vCenter Appliance Failed File Based Backup

Story Time

*UPDATE* VMware has pulled this garbage mess of an update version of vSphere. Why?

1) They PSOD ESXi Hosts...

2) Broke more shit then they fixed...

3) Broke and silently removed protocols for File Based Backups (This post)

As much as the backup failed, I failed along with it,

Task. Backup the vCenter Server using VAMI to create a file based backup.

Now for a ESXi host, you can do this super easy (at least the config so install new and simply load the config)

For a deep and better understanding of backing up and restoring ESXi host’s please read this really amazing blog post by Michael Bose from NAKIVO.

Back up ESXi configuration:

vim-cmd hostsvc/firmware/backup_config

and You will get a simple URL to download the file right to your management machine/computer.

Does vCenter have something like this? (from my research…) No.

You use the vCenter Server Interface to perform a file-based backup of the vCenter Server core configuration, inventory, and historical data of your choice. The backed-up data is streamed over FTP, FTPS, HTTP, HTTPS, SFTP, NFS, or SMB to a remote system. The backup is not stored on the vCenter Server.

Which hasn’t been updated since 2019. Let’s make a couple things here clear:

  1. The HTTP and HTTPS mentioned above are not like the ESXi style mentioned above where it creates a nice backup file locally on the VCSA and presents you with a simple URL to navigate to, to download it. It expects the HTTP/HTTPS to be a file based server to accept file transfers to (like dropbox).
  2. Lots of these “supported” protocols have pretty bad bugs, or simply don’t even work at all. Which well see below.

Doing the Theory

So OK, l log into VAMI, Click the Backup tab on the left hand nav, try to add a open SMB path I have available to use cause, why not, make my life some what easy…

Looking this up I get: VAMI Backup with SMB reports error: “Path not exported by the remote filesystem” (86069) (vmware.com) dated Oct 28,2021. Nice, nice.

Alrighty then, I’ll just spin up a dedicated FTP service on my freeNas box I guess. I learnt a couple things about chroot and local users via FTP, but the short and sweet was I created a local account on the FreeNAS box, created a Dataset under than existing mounted logical volume, and granted that account access to the path. Then enabled local user login for the FTP server, and specified that path as the user’s home path, and enabled chroot on the FTP service, so when this user logs in all they can see is their home path, which to that user appears as root. This (I felt) was a fair bit of security on it, even though its a lab and not needed, just nice…. ANYWAY… Once I had an FTP server ready….

Now I went to Start a File based backup of the vcenter server:

First Error: Service Not Running

In my case I got an error that the PSC Health service was not running, this might just be cause my lack of decent hardware for good performance might have caused some services to not start up in a timely manner. Either way, Navigating to Services in VAMI and started the PSC Health service. Lucky for me there was no further errors on this part.

If you have service errors you will have to check them out and get the required services up and running, which is out the scope of this post.

Second Error: Number of Connections

The next error I got complained about the allowed number of connections to the target.

Which in my case there was an option on the FreeNAS FTP service configurations for this, I adjusted it to “0” or unlimited in hopes to resolve this problem:

restart the service, and try again…

Third Error: Unknown

This is starting to get annoying…

What kind of vague error is that?!

Guy in this thread states the path has to be empty? what?

I tried that, cleared some more space, and it seems to have sorta worked?

Clear the FTP users home path, and try again:

Fourth Problem: Stuck @ 95%

The Job appeared to run but I noticed a couple things:

1) Even though the backup config said the overall size would only be roughly 400MB, the job ran to around 1.8 Gigs.

2)  All I/O appeared to stop and all Resources returned to an idle state, while the job remained stuck processing at 95%.

OK… I found this thread, which suggested to restart the autodeploy service, tried that and it didn’t work, the job remained stuck @ 95%.

I also found this VMware KB,  however,

1) I have a tiny deployment so no chance my DB would be 300Gigs.

2) When I went to check the “buggy python script” the “workaround” seemed to already have been implemented. So the versions of vCenter I was on (7.0u3a) already had this “fix” in place

3) The symptoms still remain to be exactly the same and the python scripts remain in a “sleeping” state.

FFS already….

Try Anyway

Well I saw the files were created, so I decided to try the restore method on the VCSA deployment wizard anyway…

I forgot to take a snippet here, but it basically stated there was a missing metafile.json file. I can only assume that when the backup process was stuck at 95% it never created this required json file…

FUCK….

One Scheduled Run

I noticed that I suppose overnight a scheduled job tried to run and provided yet a different error message:

Well that’s still pretty vague, as far as I know there should be no connectivity issues since file were created all the way up to 1.8 gigs, so I don’t see how it’s network, or permissions related, or even available space in this case, since all files were cleared, up to the already possible and shown to be written 1.8 gigs, which have been deleted to empty the path every time.

Liek seriously, wtf gives here. The fact there’s an entirely new KB with an entire Table of list of shit that apparently is wrong with this file based backup honestly begs the question, Where the FUCK is the QA in software these days? This shit is just fucking ridiculous already…

Check the Logs

*This Log file only gets created the first time you click “configure” under the backup section of VAMI.

Here’s how to access the logs:

Using putty or similar, SSH in as root on the appliance.
Type Shell at the prompt.
Type cd /var/log/vmware/applmgmt.
Type more backup.log or tail backup.log.

[VCDB-WAL-Backup:PID-42812] [VCDB::_backup_wal_files:VCDB.py:797] INFO: VCDB backup WAL start not received yet.

Checking the entry I find this thread. Along with this Reddit Post. Which leads right back to the first shared thread, which states some bitching about the /etc/issues files… and I have a strange feeling, just like the stuck @ 95% issue, I’ll look at the file and it will probably be correct just like the guy who created the Reddit post.

Try Alternative Protocols

When I tried alternative protocols I came across more issues:

NFS – Had the same path issue SMB did “Path not exported by remote system”

SCP – Was apparently silently dropped, much like what this thread mentioned. The amount of silence on that thread speaks volumes to me.

TFTP was also dropped.

You are so Fucked

Soo I wonder if I try to “upgrade” aka downgrade using the UI installer of a supposed version that works (7.0u2b)…

Alright so let me get this straight… I upgraded, and now I can’t make a backup cause the upgraded version is completely broken it terms of its File Basked Backups.

I can’t Roll back the upgrade without having kept the old VCSA, which was removed in my case since all other services was working, vSphere itself.

I can’t “downgrade” and existing one, I can’t make a backup to restore my old ones. OK fine well how about a huge FUCK YOU VMWARE. while I try to come up with some sort of work around for this utter fucking mess.

Infected Mushroom – U R So F**ked [HQ & 1080p] – YouTube

Work around option #1

Build a brand new vCenter, add hosts, and reconfigure.

The main issue here is the fact if you rely on CBT, you will be fucked and all the VM-IDs will have changed, so you will have to:

1) Edit and adjust all back up jobs to point to the new VM, via it’s new VM-IM.

2) Let the delta files be all recalculated (which can be major I/O on storage units depending on many different factors (# of VM, Size of VMs, change of files on VMs, etc)

Not and option I want to explore just yet.

Work Around option #2

Back and restore the config database?

Let’s try.. first backup…

copy python scripts (hope they not all buggy and messed up too..)

Stop required services:

service-control --stop vmware-vpxd
service-control --stop vmware-content-library

change the script permissions

chmod +x backup_lin.py

Run it:

Make a copy of it via WinSCP.

run the restore script… and

well was worth a shot but that failed too….

Lets try PG dump for shits…

I’d really recommend to read this blog post by Florian Grehl on Virden.net for great information around using postgres on vCenter.

Connect to server via SSH (SSH enabled required on vCenter).

“To connect to the database, you have to enable SSH for the vCenter Server, login as root, and launch the bash shell. When first connecting to the appliance, you see the “Appliance Shell”. Just enter “shell” to enter the fully-featured bash shell.

The simplest way to connect to the databases is by using the “postgres” user, which has no password. It is convenient to also use the -d option to directly connect to the VCDB instance.”

# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

Cool, this lets us know the postgres DB service is running. The most important take away from Florian’s post is:

“When connecting, make sure that you use the psql binaries located in /opt/vmware/vpostgres/current/bin/ and not just the psql command. The reason is that VMware uses a more recent version than it is provided by the OS. In vSphere 7.0 for example, the OS binaries are at version 10.5 while the Postgres server is running 11.6”

Kool, I could use pg_dumpall but I found it didn’t work (maybe that was wrong version of vcenter being mixed, not sure) either way lets try just the VCDB instance…

interesting, lol, as you see I got an error about version mismatch. I found this thread about it and with the info from Florians post, had an idea, tried it out, and it actually worked. Mind… BLOWN.

rm /usr/bin/

OK let’s take this file and place it on the newly deployed vcenter.

even though restore appeared to have worked the vCenter instance booted and showed to be like new install. Was worth a shot I guess, but did not work.

Work Around Option #3

I’m not sure this is even a fair option, as it only works if you have existing backup of alternative types. In my case I use Veeam and its saved my bacon I don’t know how many times.

Sure enough Veeam saved my bacon again. I ended up restoring a copy of my vCenter before the 7.0u3a, which happened to be on 7.0u2d.

I managed to add a SMB path without it erroring, and unreal, I ran a File Based Backup and it actually succeeded!!

Now I just simply run the deploy wizard, and pick restore to build a new vCenter server from this backup.

Ahhh VMware… dammit you got me again!

alright fine… grabs yet another copy of vCenter…

and this time…

are you fucking kidding me? Mhmmm interesting… VCSA 7.0 restore issue – VMware Technology Network VMTN

ok… good to know…

From this… to this….

then Deploy again…

It stated it failed, due to user auth. However I was able to login and verify it worked, but sadly it also instantly expired the license as well. I was hoping I could get another 60 days without creating a new center, reconfiguring and breaking my VM-IDs and CBT delta points for my backup software.

Even this link states what I’m trying to do is not possible… ugh the struggles are real!

In the end just started from scratch, Ugh,

When to VMware Snapshot

OK real quick short post here. I figured I’d take a snapshot of my vCenter server (reason will be next blog post).  In this case I decided to snapshot the VM with memory saving, I figured it would be faster than bringing the VM back up from a shutdown state as that’s what a normal snapshot would do.

In most cases that would probably be a fair assumption, but boy was I wrong.

It Took a short time save the snapshot but almost 15 min or more to bring the VM back to full operational status with all memory back in tact… just check out these charts:

Here you can see it took maybe 5 min to save the memory state to disk, this would have been a time of 0 minutes since a normal snapshot doesn’t save memory to disk. Then you can see the slower longer recovery time it took to get the same memory from disk and put it back into memory.

Of course taking a solid guess that disk I/O is much slower than Memory I/O the bottle would have to be non other than the actual disk….

Yup there’s the same matching results of fast disk writes, and slow disk reads…

and there’s the disk being 100% bust on the read requests. I’m not sure why the read performance on this drive was as bad as it was, but I have a feeling a regular boot would have been faster… I’ll update this post if I do an actual test.

Meh, pretty much same amount of time… I think I need some super fast local storage… yet I’m so cheap I never do…. cheap bastard…

Changing vCenter Hostname

Changing vCenter Hostname

Why?!?! Cause I gotta!

Source: Changing your vCenter Server’s FQDN – VMware vSphere Blog

PreReqs, AKA Checklist

  • Backup all vCenter Servers that are in the SSO Domain before changing the FQDN of the vCenter Server(s)
  • Supports Enhanced Linked Mode (ELM)
  • Changing the FQDN is only supported for embedded vCenter Server nodes
  • Products which are registered with vCenter Server will first need to be unregistered prior to an FQDN change. Once the FQDN change is complete they can then be reregistered.
  • vCenter HA (VCHA) should be destroyed prior to an FQDN change and reconfigured after changes
  • All custom certificates will need to be regenerated
  • Hybrid Linked Mode with Cloud vCenter Server must be recreated
  • vCenter Server that has been renamed will need to be rejoined back to Active Directory
  • Make sure that the new FQDN/Hostname is resolvable to the provided IP address (DNS A records)

NOTE: If the vCenter Server was deployed using the IP as PNID/FQDN, then the following should also be considered:

  • The PNID change workflow cannot be used to change the IP address of vCenter Server
  • The PNID change workflow cannot be used to change the FQND of vCenter Server

In this scenario, use the vCenter Server Appliance Management Interface (VAMI) to update hostnames or IP changes directly. 

The main thing I was expecting was the certificate issue. In my home lab, I removed SSO domain before this change (just using vpshere.local), no ELM, already using embedded (all-in-one), no VCHA, no Hybird, oh yeah…. not sure if you “leave an SSO domain”, before joining back to AD…

My Only Pre-Req

I went into DNS and pre-created A host records for the new server hostname: vCenter.zewwy.ca

Steps

Basically log into VAMI, and change the name.

Then

and and…. well WTF…

No matter what I do it’s greyed out… I thought maybe the untrusted cert, might be an issue so tried from a machine with full trusted chain, and same issue!

Like…. Why… why is Next greyed out? It’s like whatever Button Validation code is written for it is not being triggered, is this a browser version issue? I can’t find anything online with anyone having this issue…. Why? Cause I was right, it was the input validation…

Honestly, this is one of those MASSIVE facepalm moments in my life. I only realized after the fact the username field was NOT auto filled, it was only a label that was greyed and provided as a suggestion… Fill both fields and the next is ungreyed…

Step 4, check the checkbox to acknowledge the warning, and away… she goes!

At which point I clicked redirect now (both web addresses were still available as it didn’t seem to matter which you came from, the cert was untrusted either way, cause the CA not in my trusted ca store)

5 minutes later….

I tell ya nothing more annoying than a spinning circle and the warning “don’t refresh” when the status bar simply does not move… sure got some conflicting messages here….

*Starts to sweat*…

after about 10 minutes time…

More Certificate Fun!

Alright so after this, quick take always… when I went to check the site it was “untrusted” but not for the reason I had thought, I thought it would have been from the same issue as the source blog, and be the hostname on the cert but that was not the case, instead it was imply the the cert chain seemed to be missing, and the issuer could not be verified:

as well as:

So what to do about this… You can download the CA cert from vcenter/certs/download.zip (some reason I had to use IE). Then install the CA cert. (I noticed even after I did this I still had cert warning, error, but after the next day, maybe cache clearing or update, it reported green in the web browser).

Now when I logged in, I got the ol Cert Alert in the vCenter UI

first thing to try is removing old CA’s

Which I did, following this VMware KB

I simply followed my other post about this, and just cleared reset to green on the alert. (Still good days later).

Backup Solutions

Don’t forget to change the server in your backup software, such as I had to do this in Veeam.

These were my results…

Which go figure errored out…

So right click, go to properties of the object… Next, next…

Accept the certs new certificate

Now you figure all is well, but when I went to create a new backup job, when I attempted to expand the vcenter server in Veeam. It just hung there…

I ended up rebooting the server, and then waiting for all the Veeam services to be started. I reopened Veeam, and when to Inventory, clicked the vCenter server, took a second and then showed all the hosts, and the VMs. I clicked it and rescanned to be safe and got this result which was a bit different then the applied settings confirmation above. I think maybe I forgot to rescan the host after applying the new settings, assuming it would have done that as part of the properties change wizard.

which lucky for me now worked, and I was able to select a VM in the Veeam backup wizard, and it successfully backed up the VM.

Final Caveats

like what the heck, everywhere else its changed except at the shell. Let’s see if we change change this.

Well that was easy enough, no reboot required. 🙂

I also found the local hosts file doesn’t update either, in the file it states it managed by VAMI, so many have to look there for potential solutions:

I noticed this since I had to do a work around for something else, and sure enough caught it. I’ll change it manually with vi for now and see what changes after a reboot.

Summary

Overall, literally quick n easy.

  1. Verify DNS records exist.
  2. Use VAMI to edit hostname via editing the Network MGMT settings and change the hostname, click apply and wait.
  3. Manually clear out the old Certs that were created under the old hostname.
  4. Reconfigure you backup solution, which is vender specific (I provided step for Veeam as that is the Backup Vender I like to use)

Overall the task seemed to go pretty smooth. I’ll follow up with any other issue I might come across in the future. Cheers.

 

 

Change vCenter FQDN or IP on Veeam

Story

I recently did a infrastructure upgrade on my home lab, which included moving all my esxi hosts into a dedicate subnet, and making them all more dependent on DNS. This has it’s pros n cons, after all my ESXi host had their IP addresses changed. I also moved my vCenter and changed it IP address, which is now supported yay.

Now I had to move Veeam along with it, originally it was in the same subnet as the esxi hosts, and vCenter which have all moved, instead of trying to manage cross subnet comms, I changed Veaam’s IP address and pointed it’s DNS settings to my AD DNS which has all the ESXi and vCenter host records. Was easy enough just changing the Windows NIC Ip address, and changing the VM’s VMPG.

 How to

Now when I went to scan the vCenter instance in Veeam, it complained about the certificate, since it was renewed from the vCEnter upgrade. I decided I’d change it to be based on DNS now that everything else is as well. When I went to edit the object in Veeam it was greyed out.  Lucky for me Veeam had a KB ready to go.

Challenge

The Name/FQDN/IP of the vCenter Server has changed, and needs to be updated within Veeam Backup & Replication.

Solution

Solves Name Change Only
This solution applies ONLY if the vCenter Server database has not changed.
(I did an upgrade so yes, which you’d want to preserve VM-IDs, and chains)

If the Name/FQDN/IP of the vCenter changed due to a reinstall or upgrade, and a new vCenter database was used, the Ref-IDs will have changed. Due to the changed Ref-IDs you will need to follow the documented process in www.veeam.com/KB1299

Step 1

Prior to running the commands below you need to identify the Name\FQDN\IP Veeam is using to communicate with the VC currently. To do this, edit the entry for the vCenter under Backup Infrastructure and note the “Name:”.

Next perform the following steps to change that VMware Server’s name.

Step 2

Launch PowerShell from inside the Veeam Backup & Replication console. You can find the “PowerShell” button under the File-menu’s “Console” section.
openpsconsole
The Veeam Backup & Replication PowerShell Tookit will load.

Step 3

Run the following command:

$Servers = Get-VBRServer -name "old-name"

Replace old-name with the “Name” current set for the vCenter in the Veeam Backup and Replication Console

Step 4

Run the following command next to change the name:

$Servers.SetName("new-name")

Replace new-name with the new name for the vCenter, this can be an IP, Hostname, or FQDN.
Do not remove the quotes on either side.
This change will go in to effect as soon as the command in Step 4 completes.

How I did it – One Liner

Verify:

Get-VBRServer -name "Name from Step 1"

Change:

(Get-VBRServer -name "Name From Step 1").SetName("new.domain.com")

Results:

Now you can click next, Apply, should get right past checking certificate if the certificates are all good… and end up with the follow after rescan:

That was easy enough, I don’t fully understand why the grey out the UI to make this change, but there you have it. Happy Backups!

vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

Clear vCenter Alert Certificate Status

Story

So lately updated a couple vCenter server servers, and in my process I hit a couple errors that required some resolving…

  1. Expired Certs on Source vCenter
  2. Error [500] Auth Provider, due to something, potentially bad certs.
  3. An HPE Bug, filling up ramdisk, causing HA config issues.
  4. Change in security process; preventing login.

The Problem

So a couple hiccups along the way. And now it’s time to resolve this one…

Yeahhhh and alert on Certificates… Seems like VMware and certificate management is like Oil n Water. They don’t mix well.

I’ve had some terrible times managing certificates  with VMware. However as blogged about here, seems there’s finally a way to use your own certificates via the WebUI.

Anyway… to the point, you figured you simply navigate to the vCenter WebUI -> Home -> Administration -> Certificates. Only to realize there’s nothing reporting as invalid or expired.

Checking for Expired Certs

What gives? Ahhh yes, more hidden secret stuff that is not in your face when it comes to the WebUI. Can you guess? That’s right another VMware KB

So while the other issues I’ve mentioned does have references and script in relation to certs, the only “check” in those previous posts was using openssl on the VCSA shell to grab the certificate from the listening service on the dedicated port. Which was based on a particular symptom which spurred that check. So here’s the KB telling you how to actually check the certificates the easiest way I found so far (no check.py; python script needed)

for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;

That’s it! :D…. which just like the KB indicated which cert was bad, in this case, an old Root CA that was used in previous deployments of vCenter before upgrades, So it turns out even though you follow the required KB to get past the pre-check of expired certs. It doesn’t delete the old certificates CA Cert.

There it is, the second CA Cert with expiry in 2019… OK so… You figured it would be easy to clean this up, but remember you couldn’t even see it in the WebUI, so you best believe there is no WebUI way to do this that protects you from human error.

Removing old Expired Certs

Instead, very brilliantly, you get… yes another KB! Booo Yeah… So let’s do this!

The main thing to note about this is…

Certificates are copied back to the VECS store because the CA Certificate which is expiring is published to the VMware Directory Service (VMDIR). When the Certificate is removed from VECS, VMDIR adds the Certificate back to VECS during a sync operation. This is done in order to ensure the integrity of the TRUSTED_ROOTS Certificate store, as deletion of an incorrect Certificate from this store could cause the environment to be irreparably damaged.

OK…. All I take away from this is Certs are important so they have a second cert store as a backup to the first cert store… that’s all I can take away form this odd statement.

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | less

“Find the Certificate you wish to remove and make a note of the Alias and the X509v3 Subject Key Identifier.

Note: There Could be several Certificates to remove. Any expired and not in use certificates should be removed to avoid certificate related alarms.”

Yes that is the plan…

List the trusted certs published to the VMware Directory Service using the following command (administrator@vsphere.local password required). This command is in the same location as vecs-cli:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert list

Huh… in this case it looks like it is not here, so I should be safe to delete it from the normal store and it shouldn’t auto populate back in.

If you do see it (CN equal to x509v3 Key Identifier) then follow the linked KB to remove it, which seems to save a copy of the cert and use that saved copy to run another command to remove it from the store… super weird.

/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias 3276134ad93b3688b5dc5dcfaa402e9bfd7af12f

Restart all services on the PSCs and on the vCenter Servers and ensure that all services start and respond normally and that you can log in and manage the environment.

service-control --stop --all
service-control --start --all

Took a liil while, then logging in… alert still there, I guess I just have to Reset to Green?

For Now Clicked the Reset to Green link. Even after Yet another vCenter patch, it still did not show up anymore. Yay.

Fixing [400] An error occurred while sending an authentication request to the vCenter Single Sign-On server

So After the last two blog posts about fixing vCenter7’s access issues due to it’s due certificate monument work flows. I was greeted with this error when trying to sign into the web UI on vCenter.

[400] An error occurred while sending an authentication request to the vCenter Single Sign-On server- An error occurred when processing meta data during vCenter Single Sign-On setup:the service provider validation failed. Verify that the server URL is correct and is in FQDN format, or that the hostname is a trusted service provider alias.

After a a quick google search I found yet another VMware KB discussing it.

 Resolution
This is an expected behavior.
VMware vSphere 7.0 enforce FQDN or IP address reverse resolvable to FQDN to allow authentication for Single-Sign on.
Greeeeeeeeeeeeeeaaaaaaaaat! Thanks VMware, just another example of security destroying functionality.
What did I do? Exactly what it stated, I navigated to the WebUI URL using the hostnames Fully Qualified Domain Name E.G: Hostname.domain.end
Cause I was attempting to access it just by just the hostname as domain info was being auto resolved by the domain suffix during queries.

vSphere HA Agent cannot be correctly installed or configured

I updated a vCenter server to 7.0.x when logging into the newly updated vCenter one host in the cluster state the following alert.

Error: “vSphere HA agent cannot be correctly installed or configured” (2056299) (vmware.com)

The KB didn’t sound promising. Checking the hosts event logs. a bunch of errors about /tmp ramdisk being full…

The ramdisk ‘tmp’ is full – VMware ESXi on HPE ProLiant – Davoud Teimouri – Virtualization and Data Center

For real? Wow, not gettin’ lucky last couple weeks. Sure enough exact same issue, cleared /tmp temporality, and downloaded the patch. When I vMotion the VMs from this host onto another host the VMs themselves showed alerts.

Virtual machine failed to become vSphere HA Protected and HA may not attempt to restart it after a failure.

I kept chugging alone in hopes I’d resolve each VM later. However as soon as I placed the issued host into maintenance mode, the alerts from all VMs disappeared. Applied to patch exactly as the HPE KB stated for the ESXi version it was on.

With luck on my side, the host came up clean, and came out of maintenance mode without an issue, and all error and alerts were resolved. Woooo!

Hope this helps anyone doing a vCenter upgrade to 7.x

Fixing vCenter [500] An error occurred while fetching identity providers.

Story

So The other day I posted about upgrading vCenter to 7.0.x while everything went fine during the upgrade. For some odd reason a couple days later when I went to navigate to the vCenter login page I was greeted with:

[500] An error occurred while fetching identity providers.

Kind of wished I had read this reddit post right off the hop, cause the first reply was is going to be my answer at the end of this post.

I did however first hit this KB about it as well I was a bit thrown off has it indicated to only do it if you see the following in the logs:

(/var/log/vmware/trustmanagement/trustmanagement-svcs.log)

2021-03-10T09:27:03.474Z [tomcat-exec-14  INFO  com.vmware.identity.token.impl.X509TrustChainKeySelector  opId=] Failed to find trusted path to signing certificate <STS Certificate Subject, example - C=US,CN=ssoserverSign\,dc\=vsphere\,dc\=local>
java.security.cert.CertPathBuilderException: Unable to find certificate chain.

Which I could not see, so I wasn’t sure if this was the issue or not. What I did see in my logs was the following:

2021-09-17T23:58:03.945Z [tomcat-exec-14 WARN com.vmware.vcenter.trustmanagement.impl.VcIdentityProviders opId=] com.vmware.sso.interop.ldap.NoSuchObjectLdapException: No such object
LDAP error [code: 32]

and

2021-09-18T01:19:01.322Z [tomcat-exec-26 INFO com.vmware.vapi.security.AuthenticationFilter opId=] Not successful authentication
java.lang.RuntimeException: Authentication data not found
Caused by: com.vmware.vapi.dsig.json.SignatureException: Cannot verify the signature over the provided data

So it wasn’t matching. Looking at my firewall I couldn’t see any LDAP connections from vCenter to my LDAP server since the upgrade. So I decided instead to try a reboot. This simply made things worse.

No Healthy Upsteam

Now when I’d try access vCEnter Web UI I was greeted with a blank white web page with simple text stating “No Healthy Upstream”, now looking into this, people reached this problem for several different reasons. As mentioned here and here and for some odd reason this guy just changed his IP address?! Weird.

For me I checked the local Hosts file and it was fine, and couple other mentioned fixes and they all didn’t work for me.

Try Anyway

For some reason at this point I decided to double the mentioned work around in the initial VMware KB I found as the main login symptom was exactly the same even though I couldn’t validate the same log entries within the logs.

How to Copy Files to VCSA via WinSCP

Now a couple real quick things to note here. You need to copy a script to the VCSA. If you get unable to agree on a cipher suite, you’ll need to update your copy of WinSCP to a newer version. Also instead of doing what VMware says to change the shell on the VCSA, do what this guy suggests instead:

“In the new connection dialog, specify the Host name, User name and then click the Advanced button,

(VCSA 6.5)

Choose the Environment/SFTP option

Specify for SFTP server: shell /usr/libexec/sftp-server”

so much easier.

I decided to take a look at the script after copying it to the VCSA, and it had this line which had me hopeful it would actually work to resolve my issue:

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE

So I followed along with the workaround specified in the KB…

1) Download the attached fixsts.sh script from this article and upload to the impacted PSC or vCenter Server with Embedded PSC to the /tmp folder.

2) If the connection to upload to the vCenter by the SCP client is rejected, run this from an SSH session to the vCenter:

chsh -s /bin/bash

3) Connect to the PSC or vCenter Server with an SSH session if you have not already per Step 2.

4) Navigate to the /tmp directory:

cd /tmp

5) Run chmod +x fixsts.sh to make the file executable.

chmod +x ./fixsts.sh

6) Run ./fixsts.sh.

./fixsts.sh

Restart services on all vCenters and/or PSCs in your SSO domain by using below commands:

service-control --stop --all
service-control --start --all

my results:

To my Amazement it actually worked, and I was able to login into the vCenter server!! Wooo!

*Update* Here’s a great blog post covering managing or creating custom certificates with vCenter 7

Kinda funny that 7.0 is stated as 6.8 in the scripts.. mhmm