vCenter Appliance Failed File Based Backup

Story Time

*UPDATE* VMware has pulled this garbage mess of an update version of vSphere. Why?

1) They PSOD ESXi Hosts...

2) Broke more shit then they fixed...

3) Broke and silently removed protocols for File Based Backups (This post)

As much as the backup failed, I failed along with it,

Task. Backup the vCenter Server using VAMI to create a file based backup.

Now for a ESXi host, you can do this super easy (at least the config so install new and simply load the config)

For a deep and better understanding of backing up and restoring ESXi host’s please read this really amazing blog post by Michael Bose from NAKIVO.

Back up ESXi configuration:

vim-cmd hostsvc/firmware/backup_config

and You will get a simple URL to download the file right to your management machine/computer.

Does vCenter have something like this? (from my research…) No.

You use the vCenter Server Interface to perform a file-based backup of the vCenter Server core configuration, inventory, and historical data of your choice. The backed-up data is streamed over FTP, FTPS, HTTP, HTTPS, SFTP, NFS, or SMB to a remote system. The backup is not stored on the vCenter Server.

Which hasn’t been updated since 2019. Let’s make a couple things here clear:

  1. The HTTP and HTTPS mentioned above are not like the ESXi style mentioned above where it creates a nice backup file locally on the VCSA and presents you with a simple URL to navigate to, to download it. It expects the HTTP/HTTPS to be a file based server to accept file transfers to (like dropbox).
  2. Lots of these “supported” protocols have pretty bad bugs, or simply don’t even work at all. Which well see below.

Doing the Theory

So OK, l log into VAMI, Click the Backup tab on the left hand nav, try to add a open SMB path I have available to use cause, why not, make my life some what easy…

Looking this up I get: VAMI Backup with SMB reports error: “Path not exported by the remote filesystem” (86069) (vmware.com) dated Oct 28,2021. Nice, nice.

Alrighty then, I’ll just spin up a dedicated FTP service on my freeNas box I guess. I learnt a couple things about chroot and local users via FTP, but the short and sweet was I created a local account on the FreeNAS box, created a Dataset under than existing mounted logical volume, and granted that account access to the path. Then enabled local user login for the FTP server, and specified that path as the user’s home path, and enabled chroot on the FTP service, so when this user logs in all they can see is their home path, which to that user appears as root. This (I felt) was a fair bit of security on it, even though its a lab and not needed, just nice…. ANYWAY… Once I had an FTP server ready….

Now I went to Start a File based backup of the vcenter server:

First Error: Service Not Running

In my case I got an error that the PSC Health service was not running, this might just be cause my lack of decent hardware for good performance might have caused some services to not start up in a timely manner. Either way, Navigating to Services in VAMI and started the PSC Health service. Lucky for me there was no further errors on this part.

If you have service errors you will have to check them out and get the required services up and running, which is out the scope of this post.

Second Error: Number of Connections

The next error I got complained about the allowed number of connections to the target.

Which in my case there was an option on the FreeNAS FTP service configurations for this, I adjusted it to “0” or unlimited in hopes to resolve this problem:

restart the service, and try again…

Third Error: Unknown

This is starting to get annoying…

What kind of vague error is that?!

Guy in this thread states the path has to be empty? what?

I tried that, cleared some more space, and it seems to have sorta worked?

Clear the FTP users home path, and try again:

Fourth Problem: Stuck @ 95%

The Job appeared to run but I noticed a couple things:

1) Even though the backup config said the overall size would only be roughly 400MB, the job ran to around 1.8 Gigs.

2)  All I/O appeared to stop and all Resources returned to an idle state, while the job remained stuck processing at 95%.

OK… I found this thread, which suggested to restart the autodeploy service, tried that and it didn’t work, the job remained stuck @ 95%.

I also found this VMware KB,  however,

1) I have a tiny deployment so no chance my DB would be 300Gigs.

2) When I went to check the “buggy python script” the “workaround” seemed to already have been implemented. So the versions of vCenter I was on (7.0u3a) already had this “fix” in place

3) The symptoms still remain to be exactly the same and the python scripts remain in a “sleeping” state.

FFS already….

Try Anyway

Well I saw the files were created, so I decided to try the restore method on the VCSA deployment wizard anyway…

I forgot to take a snippet here, but it basically stated there was a missing metafile.json file. I can only assume that when the backup process was stuck at 95% it never created this required json file…

FUCK….

One Scheduled Run

I noticed that I suppose overnight a scheduled job tried to run and provided yet a different error message:

Well that’s still pretty vague, as far as I know there should be no connectivity issues since file were created all the way up to 1.8 gigs, so I don’t see how it’s network, or permissions related, or even available space in this case, since all files were cleared, up to the already possible and shown to be written 1.8 gigs, which have been deleted to empty the path every time.

Liek seriously, wtf gives here. The fact there’s an entirely new KB with an entire Table of list of shit that apparently is wrong with this file based backup honestly begs the question, Where the FUCK is the QA in software these days? This shit is just fucking ridiculous already…

Check the Logs

*This Log file only gets created the first time you click “configure” under the backup section of VAMI.

Here’s how to access the logs:

Using putty or similar, SSH in as root on the appliance.
Type Shell at the prompt.
Type cd /var/log/vmware/applmgmt.
Type more backup.log or tail backup.log.

[VCDB-WAL-Backup:PID-42812] [VCDB::_backup_wal_files:VCDB.py:797] INFO: VCDB backup WAL start not received yet.

Checking the entry I find this thread. Along with this Reddit Post. Which leads right back to the first shared thread, which states some bitching about the /etc/issues files… and I have a strange feeling, just like the stuck @ 95% issue, I’ll look at the file and it will probably be correct just like the guy who created the Reddit post.

Try Alternative Protocols

When I tried alternative protocols I came across more issues:

NFS – Had the same path issue SMB did “Path not exported by remote system”

SCP – Was apparently silently dropped, much like what this thread mentioned. The amount of silence on that thread speaks volumes to me.

TFTP was also dropped.

You are so Fucked

Soo I wonder if I try to “upgrade” aka downgrade using the UI installer of a supposed version that works (7.0u2b)…

Alright so let me get this straight… I upgraded, and now I can’t make a backup cause the upgraded version is completely broken it terms of its File Basked Backups.

I can’t Roll back the upgrade without having kept the old VCSA, which was removed in my case since all other services was working, vSphere itself.

I can’t “downgrade” and existing one, I can’t make a backup to restore my old ones. OK fine well how about a huge FUCK YOU VMWARE. while I try to come up with some sort of work around for this utter fucking mess.

Infected Mushroom – U R So F**ked [HQ & 1080p] – YouTube

Work around option #1

Build a brand new vCenter, add hosts, and reconfigure.

The main issue here is the fact if you rely on CBT, you will be fucked and all the VM-IDs will have changed, so you will have to:

1) Edit and adjust all back up jobs to point to the new VM, via it’s new VM-IM.

2) Let the delta files be all recalculated (which can be major I/O on storage units depending on many different factors (# of VM, Size of VMs, change of files on VMs, etc)

Not and option I want to explore just yet.

Work Around option #2

Back and restore the config database?

Let’s try.. first backup…

copy python scripts (hope they not all buggy and messed up too..)

Stop required services:

service-control --stop vmware-vpxd
service-control --stop vmware-content-library

change the script permissions

chmod +x backup_lin.py

Run it:

Make a copy of it via WinSCP.

run the restore script… and

well was worth a shot but that failed too….

Lets try PG dump for shits…

I’d really recommend to read this blog post by Florian Grehl on Virden.net for great information around using postgres on vCenter.

Connect to server via SSH (SSH enabled required on vCenter).

“To connect to the database, you have to enable SSH for the vCenter Server, login as root, and launch the bash shell. When first connecting to the appliance, you see the “Appliance Shell”. Just enter “shell” to enter the fully-featured bash shell.

The simplest way to connect to the databases is by using the “postgres” user, which has no password. It is convenient to also use the -d option to directly connect to the VCDB instance.”

# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

Cool, this lets us know the postgres DB service is running. The most important take away from Florian’s post is:

“When connecting, make sure that you use the psql binaries located in /opt/vmware/vpostgres/current/bin/ and not just the psql command. The reason is that VMware uses a more recent version than it is provided by the OS. In vSphere 7.0 for example, the OS binaries are at version 10.5 while the Postgres server is running 11.6”

Kool, I could use pg_dumpall but I found it didn’t work (maybe that was wrong version of vcenter being mixed, not sure) either way lets try just the VCDB instance…

interesting, lol, as you see I got an error about version mismatch. I found this thread about it and with the info from Florians post, had an idea, tried it out, and it actually worked. Mind… BLOWN.

rm /usr/bin/

OK let’s take this file and place it on the newly deployed vcenter.

even though restore appeared to have worked the vCenter instance booted and showed to be like new install. Was worth a shot I guess, but did not work.

Work Around Option #3

I’m not sure this is even a fair option, as it only works if you have existing backup of alternative types. In my case I use Veeam and its saved my bacon I don’t know how many times.

Sure enough Veeam saved my bacon again. I ended up restoring a copy of my vCenter before the 7.0u3a, which happened to be on 7.0u2d.

I managed to add a SMB path without it erroring, and unreal, I ran a File Based Backup and it actually succeeded!!

Now I just simply run the deploy wizard, and pick restore to build a new vCenter server from this backup.

Ahhh VMware… dammit you got me again!

alright fine… grabs yet another copy of vCenter…

and this time…

are you fucking kidding me? Mhmmm interesting… VCSA 7.0 restore issue – VMware Technology Network VMTN

ok… good to know…

From this… to this….

then Deploy again…

It stated it failed, due to user auth. However I was able to login and verify it worked, but sadly it also instantly expired the license as well. I was hoping I could get another 60 days without creating a new center, reconfiguring and breaking my VM-IDs and CBT delta points for my backup software.

Even this link states what I’m trying to do is not possible… ugh the struggles are real!

In the end just started from scratch, Ugh,

When to VMware Snapshot

OK real quick short post here. I figured I’d take a snapshot of my vCenter server (reason will be next blog post).  In this case I decided to snapshot the VM with memory saving, I figured it would be faster than bringing the VM back up from a shutdown state as that’s what a normal snapshot would do.

In most cases that would probably be a fair assumption, but boy was I wrong.

It Took a short time save the snapshot but almost 15 min or more to bring the VM back to full operational status with all memory back in tact… just check out these charts:

Here you can see it took maybe 5 min to save the memory state to disk, this would have been a time of 0 minutes since a normal snapshot doesn’t save memory to disk. Then you can see the slower longer recovery time it took to get the same memory from disk and put it back into memory.

Of course taking a solid guess that disk I/O is much slower than Memory I/O the bottle would have to be non other than the actual disk….

Yup there’s the same matching results of fast disk writes, and slow disk reads…

and there’s the disk being 100% bust on the read requests. I’m not sure why the read performance on this drive was as bad as it was, but I have a feeling a regular boot would have been faster… I’ll update this post if I do an actual test.

Meh, pretty much same amount of time… I think I need some super fast local storage… yet I’m so cheap I never do…. cheap bastard…

SharePoint and AD Groups

Story Time

How I got here. New site created, using same IDGLA group nesting as old sites, cross forest users. Access denied. When Old sites work. If you want the short answer please see the Summary Section. Otherwise read through and join on a crazy carpet ride.

Reddits short and blunt answer states:

“No, because in ForestA, that user from ForestB is a Foreign Security Principal, which SharePoint does not support. You would have to add a user from ForestB to a security group in ForestB and then add that group to SharePoint.”

Which would make sense, if not my old sites working in this exact manner.

To quench my ignorance I decided to remove the cross domain group from the SharePoint’s local domain group that was granted access to the site, at first it appeared as if the removal of the group did nothing to change the user access, until the next day when it appeared to have worked and the user lost access to the site (test user in test enviro). I was a bit confused here and decided to use the permission checker on the sites permission page to see what permissions was actually given to the user.

Which did not show the local domain resource group which was suppose to be granting the user access. Instead the following was presented:

Limited Access          Given through "Style Resources Readers" group.

Rabbit Hole #1: Fundamentals

Of course looking up this issues also shows my Technet post that was answered by the amazing Trevor Steward, which was a required dependency (at least for Kerberos). So that wasn’t the answer, and even in there we discussed the issue of nested groups, which in this case again was following the same IDGLA standard I did for the other sites. Something still smells off.

Digging a bit deeper I found some blog posts from a High tiered MS Senior Support Engineer. (One can only dream)… by the name of Josh Roark.

This revolved around “SharePoint: Cross-forest group memberships not reflected by Profile Import” which brought me down some sad memory lanes about the pain and grind of SharePoint’s FIM/ADI and profile sync stuff. (*Future me, guess what it has nothing to do with the problem.*) Which also funny enough, much like the Reddit states to eventually use groups from each domain directly at the resource (SharePoint page/library, etc) instead of relying on nested groups Cross Forest.

Which again still doesn’t fully explain why old sites with the same design are in fact working. I’m still confused, if the answers from all those sites were correct, me removing a cross forest nested group should not affect users permissions on the resource, but in my test it did.

So the only thing I can think of is there some other odd magic going on here with the users profiles, and what groups SharePoint thinks they are a member of?

Following Josh’s post let’s see what matches we have in our design…

Consider the following scenario:

  • You have an Active Directory Forest trust between your local forest and a remote forest.
  • You create a “domain local” type security group in Active Directory and add users from both the local forest and the remote trusted forest as members.
  • You configure SharePoint Profile Synchronization to use Active Directory Import and import users and groups from both forests.

Check, Check, Oh wait, in this case there might have only ever been one import done which was the user domain, and not the resource domain as in one case in time they were separate. I’m not sure if this play s a role here or not, not exactly like I can talk to Josh directly. *Pipe Dreams*

And the differences in our design/problem.

  • You create an Audience within the User Profile Service Application (UPA) using the “member of” option and choose your domain local group.
  • You compile the audience and find that the number of members it shows is less than you expect.
  • You click “View membership” on the audience and find that the compiled audience shows only members from your local forest.

Nope, nope, and nope, in my case it’s:

  • You create an Access to a SharePoint Resource via an AD based domain local group.
  • You attempt to access the resource as a user from a cross forest nested group.
  • You get Access Denied

He also states the Following:

“! Important !
My example above covers a scenario involving Audiences, but this also impacts claims augmentation and OAuth scenarios.

For example, let’s say you give permission to a list or a site using an Active Directory security group with cross-forest membership. You can do that, it works. Those users from the trusted forest will be able to access the site. However, if you run a SharePoint 2013-style workflow against that list, it looks up the user and their group memberships within the User Profile Service Application (UPA). Since the UPA does not show the trusted forest users as members of the group, the claims augmentation function does not think the user belongs to that group and the Workflow fails with 401 – Unauthorized.”

Alright so maybe this is still at play, HOWEVER, the answer is the same as the reddit post, and that still doesn’t explain why the design is actually working for my existing sites, I must dig deeper, but in doing so you might just find what you are looking for as well. Which is funny cause he states you can do that and it does work, then why is it not working for the reddit user, why is it not working for my new site, and why is it working for the old sites. The amount of conflicting information on this topic is a bit frustrating.

Let’s see what we can find out.

Why does this happen?

“It’s a limitation of the way that users and groups are imported into the User Profile Service Application and group membership is calculated.”

Mhmmm, ok, I do remember setting this up, and again was only for the trusted forest, not for the local forest, as initially no users resided there.

“If you look at the group in Active Directory, you’ll notice that the members from your trusted forest are not actually user objects, instead they are foreign security principals.”

Here we go again, back to the FSP’s. In my case instead of user based SIDs in the FSP OU, I had Group SIDs, either way let’s keep going.

How do we work around this?

“The only solution is to use two separate groups in your audiences and site permissions.

Use a group from the local domain to include your local forest users and use a group from the trusted domain to include your trusted forest users.”

Well here we go again, same answer, but I already mentioned it is working on my old site, and even Josh himself initially even stated “For example, let’s say you give permission to a list or a site using an Active Directory security group with cross-forest membership. You can do that, it works. ” so is this my issue or not. Either way there was one last thing he provided after this statement…

Rabbit Hole #2 : UPSA has nothing to do with Permissions

Group / Audience Troubleshooting tips:

“It can be a bit difficult to tell which groups the User Profile Service Application (UPA) thinks a certain user is a member of, or which users are members of a certain group.

You can use these SQL queries against the Profile database to get that info.

Note: These queries are written for SharePoint 2016 and above. For SharePoint 2013, you would need to drop the “upa.” part from in front of the table names: userprofile_full, usermemberships, and membergroup. You only need to supply your own group or user name.”

OK sweet, this is  useful, however I know not everyone that manages SharePoint will have direct access to the SQL server and their databases to do such look ups. The good news is I have some experience writing scripts which you can run queries from the SharePoint front end as most FE’s will have access to the DB’s and tables they need. Thus no need for direct SQL access.

Let’s create a GitHub for these here. (I have to recreate this script as it got wiped in the test enviro rebuild) *note to self don’t write code in a test enviro.

So the first issue I had to overcome was knowing what DB the service was using, since it’s possible to have multiple service applications for similar services. Sure enough I got lucky and found a technet post of someone asking the same question, and low and behold it’s none other then Trevor Steward to answer the question on his own web site (I didn’t even know about this one). a little painful but done. Unfortunately since they could be named anything, I didn’t jump though more hoops to be able to find and list the names of these UPAs, but I did code a line to help inform users of my script of that issue and what to run to help get the required name.

So with the UPA name in place, it’s scripted to locate the Profile DB, and run the same query against it.

OK, so after running my script, and validating it against the actual query that is run against the profile db, here’s what I found.

*Note* I simply entered the group name of %, which is SQL syntax for wildcard (usually *) in the group name request, which is simply a variable for the TSQL’s “like” statement.

anyway, the total groups returned was only 6, and only half of them were actually involved with SharePoint at all. I know there are WAY more groups within that user domain… so… what gives here?

*Note* Josh mentions the “Default Group Problem“, which after reading I do not use this group for permissions access and I do not believe it to be of concern or any root cause to my problem.

*Note 2* Somewhere, I lost the reference link but I found you can use a powershell cmdlet as follows (for unknown reasons as to be run as the farm account):

$profileManager = New-Object Microsoft.Office.Server.UserProfiles.UserProfileManager(Get-SPServiceContext((Get-SPSite)[0]))
foreach($usrProfile in $profileManager.GetEnumerator()) { Write-Host $usrProfile.AccountName "|" $usrProfile.DisplayName; }

Well that didn’t help me much, other then to show me there’s a pretty stale list of users… which brought me right back to Josh Roark…

SharePoint: The complete guide to user profile cleanup – Part 4 – 2016 – Random SharePoint Problems Explained… Intermittently (joshroark.com)

ughhhhhh, **** me…..

So first thing, I check the User Profile Service on the Central Admin page of SharePoint. I noticed it states 63 profiles.

(($profileManager.GetEnumerator()).AccountName).Count

which matches the command I ran as the farm account, HOWEVER, what I noticed was not all account were just from the user domain, several of them were from the resource domain even though no ADI existed for them. Ohhh the question just continue to mount.

At this point I came back to this the next day, and when I came back I had to re-orient myself to where I was in this rabbit hole. When I realized I was covering User Profile Import/Syncing with AD and SharePoint. and I asked myself “Why?”. AFAIK User Profiles have nothing to do with permissions?

Let’s find out, its test, lets wipe everything with UPA and it’s services, all imports and try to access the site…

Since I wasn’t too keen on this process I did a Google search and sure enough found another usual SharePoint blogger here

and look at that, the same command I mentioned above that you need tom run as the farm account for some odd reason.. so I created another script.

Well… I still have access to the old SP sites via that test account, and still not on the new site I created utilizing the same cross forest group structure… so this seem to follow my assumption that UPA profiles has nothing to do with permission access…

One thing I did notice was once I attempted to access a site, the user showed up in the UPA user profile DB without having run a sync or import task.

Well since we are this deep…. let’s delete the UPA service all together and see what happens. Under the Central Admin navigated to manage service applications, click the UPA service, and delete at the top ribbon, there was an option to delete all associated data with the service application, yes please… and…

Everything still works exactly as it did before, and proving this has nothing to do with permissions.

On a side note though, I did notice nothing changed in terms of my User details in the right upper corner, and while I have down this other rabbit hole. I’m going to avoid it here. Lucky for me, it seems in my wake, someone else by the name of Mohammad Tahir has gone down his rabbit hole for me and has even more delightfully blogged about the entire thing himself, here. I really suggestion you read it for a full understanding.

In short, that information is in the “User Information List” UIL, which is different from the data known by the UPSA, the service I just destroyed, however I will share the part where they link:

“The people picker does not extract information from the User Profile Service Application (UPSA). The UPSA syncs information from the AD and updates the UIL but it does not delete the user from it.”

Again in short, I basically broke what would be user information as seen on the sites if someone were to change their name and that change was only done on the authing source (Microsoft AD in this case). That change would not be reflected in SharePoint. At least in theory.

If you made it this far, the above was nothing more than a waste of time, other than to find out the UPSA has no bearing on permissions granted via AD groups. But if you need to clean up user information shown on SharePoint sites then you probably learnt a lot, but that’s not what this post is about.

So all, in all this is probably why there are resources online confusing the two as being connected, when it turns out… they are not.

Rabbit Hole #3: The True Problem – The Secure Token Service, or is it?

So I decided to Google a bit more and I found this thread with the same question: permissions – Users added to AD group not granted access in SharePoint – SharePoint Stack Exchange

and sure enough, what I’ve pretty much realized myself from testing so far appears to hold true on the provided answer:

“Late answer but, The User Profile is not responsible in this case.

SharePoint recognizes AD security groups and attaching permissions to these groups will cause the permissions to be granted to the User.

Unfortunately, due to SharePoint caching the user’s memberships on login, changes made to a security group are identified only after the cache has expired, possibly taking as long as 10 hours (by default) for it to happen.

This causes the unexpected behavior of adding a user to a Group and the user still being shown the access denied or lack of interface feedback related to the new permissions he should have received.

The user not having his tokens cached prior to being added to the group would cause him to receive access immediately.”

And that’s exactly the symptom I usually get, apply AD group permission and after some time (for me I assumed 24 hours cause test the next day) but from this answer states it “10 hours”. My question now would be, what cache is he talking about? Kerberos? Web Browser?

“SharePoint caching the user’s memberships on login”? What logon, Computer/Windows Logon, or SharePoint if you don’t use SSO?

OK I’m so confused now, I did the same thing in my test enviro, and it seemed to work almost instant, I did the same thing in production and it’s not applying.  God I hate SharePoint….

I attempted a Incognito Window, and that didn’t work… so not browser cache…

Logged into a workstation fresh, nope, so not Kerberos cache it seems, so what cache is he referring to?

So I decided to tighten my Google query, and I found plenty of similar issues stemming back a LONG time. security – Why are user permissions set in AD not updated immediately to SharePoint? – SharePoint Stack Exchange

In there, there’s conflicting information where someone actually again mentions the UPSA, which we’ve discovered ourselves to have no impact, and even that answer is adjusted to say even indicate it maybe false.  The more common answer appears to match the “10 hours” “cache” mentioned above, which turns out to be…. *drum roll* … “Security Token Service”.

Funny enough when I went to go Google and find source for the SharePoint STS, I got a bunch of Troubleshooting, and error related articles *Google tryin’ to help me?* either way, sure enough I find an article by non other than our new SharePoint hero; Josh Roark (Sorry Trevor), to my dismay it didn’t cover my issue, or how to clear or reset it’s cache… ok let’s keep looking…

A random github page with some insights into the design ideology... useless nothing about cache…

Found someone who posted a bunch of links around troubleshooting STS, but didn’t even write anything themselves, all I found was he linked MS’s Blog post about which literally copied and pasted Josh’s work. I guess Josh being a MS employee MS can take his work as their own without issue? anyway let’s keep looking…

Funny, I finally found someone asking the question, and for the exact same reason I wrote this whole blog post about…. also funny that the obscurity and amount of “like” or interest in the topics I find this deep have super low like counts cause of just how little people get this down into the nitty gritty. And here’s the Answer,  Third funny thing is their question wasn’t how, but what the affects of doing it are.

“When using AD groups, and adding or removing a user to that group, the permissions may not update as intended, given the default 10 hour life of a token.

I’ve read of an unofficial recommendation to shorten the token lifetime, but others have cautioned it can have adverse affect. With that, I’d rather leave it alone.

Is it safe to purge on demand?

Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache 

…or does it too have adverse affect?”

Answer of “It will degrade performance, but it is otherwise safe.”

OK finally, lets try this out.

I removed the group the test user was a member of in AD, which granted it contribute rights on the site.

After removing the Group I replicated it to all AD servers.

I checked the user permission via SharePoint permission checker, still showed user had contribute rights.

Ran the cmdlet mentioned about on the only SP FE server that exists, with all services running on it, including the STS.

Refreshed the permission page for SharePoint, checked user permissions… C’mon! it still says contribute rights, navigating the page via the user, yup… what gives?!?!?!!

Seriously that was suppose to have solved it, what is it?!?!

even going deep into the thread he doesn’t respond if it worked or not just what other do, as I read that yes seems lowering the cache threshold is often mentioned. For the same reasons, I want permissions to apply more instantly instead of having this stupid 10/24 hour wait period between permission changes.

If the manually clearly of the cache doesn’t work what is it? and again they bring up the misconception of the UPS/UPSA.

OMG and sure enough a TechNet post with the EXACT same problem, trying to do the exact same thing, and having the EXACT same Issue!!!

Wow…

Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache

didn’t work….

cmdlet above + iisreset

didn’t work

Reboot FE Server

Didn’t work

Another post with the same conclusion to bump the cache timeouts.

OK so here’s an article that the “MS tech” who answered the TechNet question referenced. I’ll give credit where it’s due and providing and answer and sourcing it is nice. Active Directory Security Groups and SharePoint Claims Based Authentication | Viorel Iftode

OK Mr.Lftode what ya got for me…

The problem

The tokens have a lifetime (by default 10 hours). More than that, SharePoint by default will cache the AD security group membership details for 24 hours. That means, once the SharePoint will get the details for a security group, if the AD security group will change, SharePoint will still use the cache.

So the same 10 hour / 24 hour problem we’ve been facing this whole time, regardless of cross-forest, or single forest design.

Solution

When your access in SharePoint rely on the AD security groups you have to adjust the caching mechanism for the tokens and you have to adjust it properly everywhere (SharePoint and STS).

Add-PSSnapin Microsoft.SharePoint.PowerShell;

$CS = [Microsoft.SharePoint.Administration.SPWebService]::ContentService;
#TokenTimeout value before
$CS.TokenTimeout;
$CS.TokenTimeout = (New-TimeSpan -minutes 2);
#TokenTimeout value after
$CS.TokenTimeout;
$CS.update();

$STSC = Get-SPSecurityTokenServiceConfig
#WindowsTokenLifetime value before
$STSC.WindowsTokenLifetime;
$STSC.WindowsTokenLifetime = (New-TimeSpan -minutes 2);
#WindowsTokenLifetime value after
$STSC.WindowsTokenLifetime;
#FormsTokenLifetime value before
$STSC.FormsTokenLifetime;
$STSC.FormsTokenLifetime = (New-TimeSpan -minutes 2);
#FormsTokenLifetime value after
$STSC.FormsTokenLifetime;
#LogonTokenCacheExpirationWindow value before
$STSC.LogonTokenCacheExpirationWindow;
#DO NOT SET LogonTokenCacheExpirationWindow LARGER THAN WindowsTokenLifetime
$STSC.LogonTokenCacheExpirationWindow = (New-TimeSpan -minutes 1);
#LogonTokenCacheExpirationWindow value after
$STSC.LogonTokenCacheExpirationWindow;
$STSC.Update();
IISRESET

Well the exact same answer the MS tech provided, with no simple solution of simply clearing a cache on the STS, or restarting the STS, none of it seems to work, its cache is insanely persistent, apparently even across reboots.

I’ll try this out in the test enviro and see what it does. I hope it doesn’t break my site like it did the guy who asked the question on TechNet…. Here goes…

So same thing happened to my sites, I’m not sure if its for the same reason….

Rabbit Hole #4: Publishing Sites

So just like the OP from that Technet post I shared above, set the timeouts back to default and the site started working, but that doesn’t answer the OPs question, and it was left at a dead end there…

Also much like the OP, these sites were too enabled and were using the “publishing feature”.

I decided to look at the source he shared to see if I could find anything else in more details.

On the first link the OP shared the writer made the following statement:

“Publishing pages use the Security Token Service to validate pages. If the validation fails the page doesn’t load. Team sites without Publishing enabled are OK as they don’t do this validation.”

Now I can’t find any white paper type details from MS on why this might be the case, but let’s just take this bit of fact as true.

The poster also made this statement just before that one:

“Initially we had installed the farm with United States timezone, when a change was made to use New Zealand time, the configuration didn’t fully update on all servers and the Security Token Service was responding with US Date format making things very unhappy.”

Here’s a thing that happened, the certs expired, but I also got an alert in my test stating “My clock was ahead”. At first I thought this was due to the expired certs. So I went and updated the certs, and also changed the timeout values back to default which made everything work again. However now that this info is brought to my attention I’m wondering if there’s something else at play here.

Since looking at the second shared resource, makes a similar suggestion…

Rabbit Hole #5: SharePoint Sites and the Time Zones

OK so here’s the first solution to this alternative shared post that had the same issue of sites not working after lowering the STS timeout threadhold:

“Check if time zone for each web application in General Settings is same as your server time zone. Update time zone if nothing selected, run IISRESET and check if the issue is resolved.”

and the second solution is the one we already showed worked and was the answer provided back to the TechNet post by the OP, and that’s to set the thresholds back to default, which simply leaves you with the same permission issue of waiting 10/24 hours for new permissions to apply when changed in AD and not managed at SharePoint dirtectly.

Now unlike the OP… I’m going to take a quick peek to see if my timezone are different on each site vs the FE’s own timezone…

Here we gooo, ugghhhh

System Time Zone

w32tm /tz

for me it was CST, gross with CDT (That terrible thing we call daylights savings time), I really hope that doesn’t play into affect….

SharePoint Time Zone

Well I read up a bit on this from yet another SharePoint Expert “Gregory Zelfond” from SharePoint Maven.

Long story short: There’s Region Time Zone (Admin based, Per Site) and Personal Time Zone. I’m not sure if messing with the personal Time Zone matter but I’ve been down enough rabbit holes, I hope I can ignore that for now.

OK, So quick recap, I checked the site’s regional settings and the time zone matched the host machine, at least for CST, I couldn’t see anything settings in the SharePoint Site Time Zone settings for Daylights Saving times, so for all I know that could also be a contributing factor here. But for now we’ll just say it matches.

I also couldn’t find “About me” option under the top right profile area, so I couldn’t directly check the “Personal Time Zone” that way, I was however, able to check the User Profile Service Application, to “manage User Profiles” to verify there was no Time Zone set for the account I was testing with, I can again only assume here that it means it defaults to the sites Time Zone.

If so then there’s nothing wrong with any of the sites or serves time zone settings.

SO checking my logs I see the same out of Range exceptions in the ULS logs:

SharePoint Foundation Runtime tkau Unexpected System.ArgumentOutOfRangeException: The added or subtracted value results in an un-representable DateTime. Parameter name: value at System.DateTime.AddTicks(Int64 value) at Microsoft.SharePoint.Publishing.CacheManager.HasTimedOut() at Microsoft.SharePoint.Publishing.CacheManager.GetManager(SPSite site, Boolean useContextSite, Boolean allowContextSiteOptimization, Boolean refreshIfNoContext) at Microsoft.SharePoint.Publishing.TemplateRedirectionPage.ComputeRedirectionVirtualPath(TemplateRedirectionPage basePage) at Microsoft.SharePoint.Publishing.Internal.CmsVirtualPathProvider.CombineVirtualPaths(String basePath, String relativePath) at System.Web.Hosting.VirtualPathProvider.CombineVirtualPaths(VirtualPath basePath, VirtualPath relativePath) at System.Web.UI.Dep..

OK… soooo….

Summary

We now know the following:

    1. The root issue is with the Secure Token Service (STS) cause of:
      – Token life time is 10 hours ((Get-SPSecurityTokenServiceConfig).WindowsTokenLifetime)
      – SharePoint cache the AD security details for 24 hours (([Microsoft.SharePoint.Administration.SPWebService]::ContentService).TokenTimeout)
    2. The only command we found to forcefully clear the STS cache didn’t work.
      – Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache
    3.  The only other alternative suggestion was to shorten the STS and SharePoint Cache settings, which breaks the SharePoint sites if they are using Publishing feature.
      – No real answer as to why.
      – Maybe Due to timezone.
      – Most likely due to the shortened cache times set.
    4.  The User Profile Service HAS NO BERRING on site permissions.

So overall, it seems if you

A) Use AD groups to manage SharePoint Permissions and

B) Use the Publishing Feature

You literally have NO OPTIONS other than to wait 24 hours for permissions to be apply to SharePoint resources when the access permissions are managed strictly via Active Directly Groups.

Well after all those rabbit holes, I’m still left with a shitty taste in my mouth. Thanks MS for making a system inheritably have a stupid permission application system with a ridiculous caveat. I honestly can’t thank you enough Microsoft.

*Update* I have a plan, which is to run the cache clear PowerShell cmdlet (the one mentioned above and linked to a TechNet stating it doesn’t work), and then recycle the STS app pool, and will report my results. Finger crossed…

Changing vCenter Hostname

Changing vCenter Hostname

Why?!?! Cause I gotta!

Source: Changing your vCenter Server’s FQDN – VMware vSphere Blog

PreReqs, AKA Checklist

  • Backup all vCenter Servers that are in the SSO Domain before changing the FQDN of the vCenter Server(s)
  • Supports Enhanced Linked Mode (ELM)
  • Changing the FQDN is only supported for embedded vCenter Server nodes
  • Products which are registered with vCenter Server will first need to be unregistered prior to an FQDN change. Once the FQDN change is complete they can then be reregistered.
  • vCenter HA (VCHA) should be destroyed prior to an FQDN change and reconfigured after changes
  • All custom certificates will need to be regenerated
  • Hybrid Linked Mode with Cloud vCenter Server must be recreated
  • vCenter Server that has been renamed will need to be rejoined back to Active Directory
  • Make sure that the new FQDN/Hostname is resolvable to the provided IP address (DNS A records)

NOTE: If the vCenter Server was deployed using the IP as PNID/FQDN, then the following should also be considered:

  • The PNID change workflow cannot be used to change the IP address of vCenter Server
  • The PNID change workflow cannot be used to change the FQND of vCenter Server

In this scenario, use the vCenter Server Appliance Management Interface (VAMI) to update hostnames or IP changes directly. 

The main thing I was expecting was the certificate issue. In my home lab, I removed SSO domain before this change (just using vpshere.local), no ELM, already using embedded (all-in-one), no VCHA, no Hybird, oh yeah…. not sure if you “leave an SSO domain”, before joining back to AD…

My Only Pre-Req

I went into DNS and pre-created A host records for the new server hostname: vCenter.zewwy.ca

Steps

Basically log into VAMI, and change the name.

Then

and and…. well WTF…

No matter what I do it’s greyed out… I thought maybe the untrusted cert, might be an issue so tried from a machine with full trusted chain, and same issue!

Like…. Why… why is Next greyed out? It’s like whatever Button Validation code is written for it is not being triggered, is this a browser version issue? I can’t find anything online with anyone having this issue…. Why? Cause I was right, it was the input validation…

Honestly, this is one of those MASSIVE facepalm moments in my life. I only realized after the fact the username field was NOT auto filled, it was only a label that was greyed and provided as a suggestion… Fill both fields and the next is ungreyed…

Step 4, check the checkbox to acknowledge the warning, and away… she goes!

At which point I clicked redirect now (both web addresses were still available as it didn’t seem to matter which you came from, the cert was untrusted either way, cause the CA not in my trusted ca store)

5 minutes later….

I tell ya nothing more annoying than a spinning circle and the warning “don’t refresh” when the status bar simply does not move… sure got some conflicting messages here….

*Starts to sweat*…

after about 10 minutes time…

More Certificate Fun!

Alright so after this, quick take always… when I went to check the site it was “untrusted” but not for the reason I had thought, I thought it would have been from the same issue as the source blog, and be the hostname on the cert but that was not the case, instead it was imply the the cert chain seemed to be missing, and the issuer could not be verified:

as well as:

So what to do about this… You can download the CA cert from vcenter/certs/download.zip (some reason I had to use IE). Then install the CA cert. (I noticed even after I did this I still had cert warning, error, but after the next day, maybe cache clearing or update, it reported green in the web browser).

Now when I logged in, I got the ol Cert Alert in the vCenter UI

first thing to try is removing old CA’s

Which I did, following this VMware KB

I simply followed my other post about this, and just cleared reset to green on the alert. (Still good days later).

Backup Solutions

Don’t forget to change the server in your backup software, such as I had to do this in Veeam.

These were my results…

Which go figure errored out…

So right click, go to properties of the object… Next, next…

Accept the certs new certificate

Now you figure all is well, but when I went to create a new backup job, when I attempted to expand the vcenter server in Veeam. It just hung there…

I ended up rebooting the server, and then waiting for all the Veeam services to be started. I reopened Veeam, and when to Inventory, clicked the vCenter server, took a second and then showed all the hosts, and the VMs. I clicked it and rescanned to be safe and got this result which was a bit different then the applied settings confirmation above. I think maybe I forgot to rescan the host after applying the new settings, assuming it would have done that as part of the properties change wizard.

which lucky for me now worked, and I was able to select a VM in the Veeam backup wizard, and it successfully backed up the VM.

Final Caveats

like what the heck, everywhere else its changed except at the shell. Let’s see if we change change this.

Well that was easy enough, no reboot required. 🙂

I also found the local hosts file doesn’t update either, in the file it states it managed by VAMI, so many have to look there for potential solutions:

I noticed this since I had to do a work around for something else, and sure enough caught it. I’ll change it manually with vi for now and see what changes after a reboot.

Summary

Overall, literally quick n easy.

  1. Verify DNS records exist.
  2. Use VAMI to edit hostname via editing the Network MGMT settings and change the hostname, click apply and wait.
  3. Manually clear out the old Certs that were created under the old hostname.
  4. Reconfigure you backup solution, which is vender specific (I provided step for Veeam as that is the Backup Vender I like to use)

Overall the task seemed to go pretty smooth. I’ll follow up with any other issue I might come across in the future. Cheers.

 

 

Activating Windows Offline

Story

Quick Story here, Installed a copy of Server 2019. System is completely offline, how to activate it?

I found a couple guides to help along the way, and even a nice thread post.

Issue

Main thing I found was a command to get your started:

slui 4

To my dismay I was greeted with a greeting, much like the thread poster:

"Can't activate Windows by phone."

If you keep reading there are other potential reasons for activation to fail, but that usually happens afterwards with a dedicated error code. E.G Attempting to activate a evail edition, or using a MAK key instead of a retail one, or using the wrong key with the wrong edition (Standard vs Datacenter).

In the first example it makes sense, as well as the last one. In my case I was using the proper image downloaded from VLSC with the key from the same web portal, So I knew I was good on the first and second examples. The middle example of requiring the use of a retail key didn’t seem right, as I would assume any version would suffice. *NOTE* At this point I was merely assuming, as I couldn’t fully verify my key as I wasn’t utilizing a VMAT server. Again this is an offline activation.

Solution

Now for my realization, I had made yet another assumption, and that was I’d assume slui 4 would provide a pop-up that would allow you to enter your product key before starting, and the error message doesn’t exactly convey that with an incorrect error message of: “Can’t activate Windows by phone.”

When in reality it should have simply stated “Please set a product key first”.

as it turns out you have to use: Windows Software Licensing Management Tool and can be accesses via elevated command line using slmgr.vbs.

Yes that’s right a Visual Basic script. ;P.

C:\Windows\System32> slmgr.vbs /ipk <Key>

/dli (This will show basic license and activation information.)
/dlv (This will show detailed license and activation information.)
/xpr (This will show the current expiration date of the license installed which is most useful when using a KMS key with a local KMS activation server on the network.)
/upk (Be careful with this one as it will uninstall your current license key.)
/cpky (Also be careful with this one as this removes license key information from the registry.)
/ipk *****-*****-*****-*****-***** (This will change your license key to the one entered. If there was no key entered previously this command will also attempt to activate the license based on the license key type.)
/ato (This will force an online activation immediately. This could be useful if you have already entered the new key but was not online with either the KMS server for the network or unable to reach Microsoft’s activation servers.)

After doing this, then running slui 4 again, I was prompted with a screen asking me to select my Region. I then proceeded to use a Phone to call the toll free number provided and follow the IVR prompts to get a confirmation ID.

After entering the confirmation ID, I successfully activated Windows Offline. I did note one thing, that I told the IVR I did not have a smartphone so I did not get the web link as mentioned by others in the comment area. You can save this link and use it to do offline activations without calling in to the phone number from another machine that is online. If I manage to get this link I will share it as the commenters in that other post did not do the same.

Hope this helps someone.

 

Change vCenter FQDN or IP on Veeam

Story

I recently did a infrastructure upgrade on my home lab, which included moving all my esxi hosts into a dedicate subnet, and making them all more dependent on DNS. This has it’s pros n cons, after all my ESXi host had their IP addresses changed. I also moved my vCenter and changed it IP address, which is now supported yay.

Now I had to move Veeam along with it, originally it was in the same subnet as the esxi hosts, and vCenter which have all moved, instead of trying to manage cross subnet comms, I changed Veaam’s IP address and pointed it’s DNS settings to my AD DNS which has all the ESXi and vCenter host records. Was easy enough just changing the Windows NIC Ip address, and changing the VM’s VMPG.

 How to

Now when I went to scan the vCenter instance in Veeam, it complained about the certificate, since it was renewed from the vCEnter upgrade. I decided I’d change it to be based on DNS now that everything else is as well. When I went to edit the object in Veeam it was greyed out.  Lucky for me Veeam had a KB ready to go.

Challenge

The Name/FQDN/IP of the vCenter Server has changed, and needs to be updated within Veeam Backup & Replication.

Solution

Solves Name Change Only
This solution applies ONLY if the vCenter Server database has not changed.
(I did an upgrade so yes, which you’d want to preserve VM-IDs, and chains)

If the Name/FQDN/IP of the vCenter changed due to a reinstall or upgrade, and a new vCenter database was used, the Ref-IDs will have changed. Due to the changed Ref-IDs you will need to follow the documented process in www.veeam.com/KB1299

Step 1

Prior to running the commands below you need to identify the Name\FQDN\IP Veeam is using to communicate with the VC currently. To do this, edit the entry for the vCenter under Backup Infrastructure and note the “Name:”.

Next perform the following steps to change that VMware Server’s name.

Step 2

Launch PowerShell from inside the Veeam Backup & Replication console. You can find the “PowerShell” button under the File-menu’s “Console” section.
openpsconsole
The Veeam Backup & Replication PowerShell Tookit will load.

Step 3

Run the following command:

$Servers = Get-VBRServer -name "old-name"

Replace old-name with the “Name” current set for the vCenter in the Veeam Backup and Replication Console

Step 4

Run the following command next to change the name:

$Servers.SetName("new-name")

Replace new-name with the new name for the vCenter, this can be an IP, Hostname, or FQDN.
Do not remove the quotes on either side.
This change will go in to effect as soon as the command in Step 4 completes.

How I did it – One Liner

Verify:

Get-VBRServer -name "Name from Step 1"

Change:

(Get-VBRServer -name "Name From Step 1").SetName("new.domain.com")

Results:

Now you can click next, Apply, should get right past checking certificate if the certificates are all good… and end up with the follow after rescan:

That was easy enough, I don’t fully understand why the grey out the UI to make this change, but there you have it. Happy Backups!

vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

Clear vCenter Alert Certificate Status

Story

So lately updated a couple vCenter server servers, and in my process I hit a couple errors that required some resolving…

  1. Expired Certs on Source vCenter
  2. Error [500] Auth Provider, due to something, potentially bad certs.
  3. An HPE Bug, filling up ramdisk, causing HA config issues.
  4. Change in security process; preventing login.

The Problem

So a couple hiccups along the way. And now it’s time to resolve this one…

Yeahhhh and alert on Certificates… Seems like VMware and certificate management is like Oil n Water. They don’t mix well.

I’ve had some terrible times managing certificates  with VMware. However as blogged about here, seems there’s finally a way to use your own certificates via the WebUI.

Anyway… to the point, you figured you simply navigate to the vCenter WebUI -> Home -> Administration -> Certificates. Only to realize there’s nothing reporting as invalid or expired.

Checking for Expired Certs

What gives? Ahhh yes, more hidden secret stuff that is not in your face when it comes to the WebUI. Can you guess? That’s right another VMware KB

So while the other issues I’ve mentioned does have references and script in relation to certs, the only “check” in those previous posts was using openssl on the VCSA shell to grab the certificate from the listening service on the dedicated port. Which was based on a particular symptom which spurred that check. So here’s the KB telling you how to actually check the certificates the easiest way I found so far (no check.py; python script needed)

for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;

That’s it! :D…. which just like the KB indicated which cert was bad, in this case, an old Root CA that was used in previous deployments of vCenter before upgrades, So it turns out even though you follow the required KB to get past the pre-check of expired certs. It doesn’t delete the old certificates CA Cert.

There it is, the second CA Cert with expiry in 2019… OK so… You figured it would be easy to clean this up, but remember you couldn’t even see it in the WebUI, so you best believe there is no WebUI way to do this that protects you from human error.

Removing old Expired Certs

Instead, very brilliantly, you get… yes another KB! Booo Yeah… So let’s do this!

The main thing to note about this is…

Certificates are copied back to the VECS store because the CA Certificate which is expiring is published to the VMware Directory Service (VMDIR). When the Certificate is removed from VECS, VMDIR adds the Certificate back to VECS during a sync operation. This is done in order to ensure the integrity of the TRUSTED_ROOTS Certificate store, as deletion of an incorrect Certificate from this store could cause the environment to be irreparably damaged.

OK…. All I take away from this is Certs are important so they have a second cert store as a backup to the first cert store… that’s all I can take away form this odd statement.

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | less

“Find the Certificate you wish to remove and make a note of the Alias and the X509v3 Subject Key Identifier.

Note: There Could be several Certificates to remove. Any expired and not in use certificates should be removed to avoid certificate related alarms.”

Yes that is the plan…

List the trusted certs published to the VMware Directory Service using the following command (administrator@vsphere.local password required). This command is in the same location as vecs-cli:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert list

Huh… in this case it looks like it is not here, so I should be safe to delete it from the normal store and it shouldn’t auto populate back in.

If you do see it (CN equal to x509v3 Key Identifier) then follow the linked KB to remove it, which seems to save a copy of the cert and use that saved copy to run another command to remove it from the store… super weird.

/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias 3276134ad93b3688b5dc5dcfaa402e9bfd7af12f

Restart all services on the PSCs and on the vCenter Servers and ensure that all services start and respond normally and that you can log in and manage the environment.

service-control --stop --all
service-control --start --all

Took a liil while, then logging in… alert still there, I guess I just have to Reset to Green?

For Now Clicked the Reset to Green link. Even after Yet another vCenter patch, it still did not show up anymore. Yay.