ESXi /tmp is Full

I’ll keep this post short and to the point. Gott errors in the alerts.

I was like huh, interesting… go to validate it on the host by logging in via SSH then typing the command:

vdf -h

At the bottom you can see /tmp space usage:

I then found out about this cool command from this thread:

find /tmp/ -exec ls -larth '{}' \;

This will list all the files and their sizes to gander at, when I noticed a really large file:

I decided to look up this file and found this lovely VMware KB:

The Workaround:

echo > /tmp/ams-bbUsg.txt

The solution:

To fix the issue, upgrade to VMware AMS to version 11.4.5 (included in the HPE Offline Bundle for ESXi version 3.4.5), available at the following URLs:

HPE Offline Bundle for ESXi 6.7 Version 3.4.5

https://www.hpe.com/global/swpublishing/MTX-a38161c3e8674777a8c664e05a

HPE Offline Bundle for ESXi 6.5 Version 3.4.5

https://www.hpe.com/global/swpublishing/MTX-7d214544a7e5457e9bb48e49af

HPE Offline Bundle for ESXi 6.0 Version 3.4.5

https://www.hpe.com/global/swpublishing/MTX-98c6268c29b3435e8d285bcfcc

Procedure

  1. Power off any virtual machines that are running on the host and place the host into maintenance mode.
  2. Transfer the offline bundle onto the ESXi host local path, or extract it onto an online depot.
  3. Install the bundle on the ESXi host.
    1. Install remotely from client, with offline bundle contents on a online depot:
      esxcli -s <server> -u root -p mypassword software vib install -d <depotURL/bundle-index.xml>
    2. Install remotely from client, with offline bundle on ESXi host:
      esxcli -s <server> -u root -p mypassword software vib install -d <ESXi local path><bundle.zip>
    3. Install from ESXi host, with offline bundle on ESXi host:
      esxcli software vib install -d <ESXi local path><bundle.zip>
  4. After the bundle is installed, reboot the ESXi host for the updates to take effect.
  5. (Optional) Verify that the vibs on the bundle are installed on your ESXi host.
    esxcli software vib list
  6. (Optional) Remove individual vibs. <vib name> can be identified by listing the vibs as shown in #5.
    esxcli software vib remove -n <vib name>

    Summary

    Use the commands shown to trace the source of the usage, your case may not be as easy. Once found hopefully find a solution. In my case I got super lucky and other people already found the problem and solution.

Veeam – More Than One Replica Candidate Found

Story Time!

The Problem!

So real quick one here. I edited a Replication job and changed it source form production to a backup dataset within the Veeam Replication Job settings. I went to run the replication job and was presented with an error I have no seen before…

I had an idea of what happened (I believe the original ESXi host might have been rebuilt) I’m not 100% sure, but just speculating. I was pretty sure the change I made on the job was not the source of the problem.

Since I wasn’t concerned about the target VM being re-created entirely I decided to go to Veeam’s Replica’s, and right clicked the target VM, and picked Delete from Disk… to my amazement the same error was presented…

Alright… kind of sucks, but here’s how I resolved it.

The Solution

Sadly I had to right click the Target VM under Veeams Replicas, and instead picked “Remove from Configuration”. What’s really annoying about this is it will remove the source VM from the replication job itself as well.

Why? Unno Veeams coding choices...

So after successfully removing the target VM from Veeam’s configuration, I manually deleted the target VM on the host ESXi host. Then I had to reconfigure the replication job and point it to the source VM again. Again if your interested in why that’s the case see the link above.

After that the job ran successfully. Hope this helps someone.

Exchange Certificates and SMTP

Exchange and the Certificates

Quick Post here… If you need to change Certificates on a SMTP receiver using TLS.. how do you do it?

You might be inclined to search and find this MS Doc source: Assign certificates to Exchange Server services | Microsoft Docs

What you might notice is how strange the UI is designed, you simple find the certificate, and in it’s settings check off to use SMTP.

Then in the connectors options, you simply check off TLS.

Any sensible person, might soon wonder… if you have multiple certificates, and they can all enable the check box for SMTP, and you can have multiple connectors with the checkbox enabled for TLS…. then… which cert is being used?

If you have any familiarity with IIS you know that you have multiple sites, then you go enable HTTPS per site, you define which cert to use (usually implying the use of SNI).

When I googled this I found someone who was having a similar question when they were receiving a unexpected cert when testing their SMTP connections.

I was also curious how you even check those, and couldn’t find anything native to Windows, just either python, or openSSL binaries required.

Anyway, from the first post seems my question was answered, in short “Magic”…

“The Exchange transport will pick the certificate that “fits” the best, based on the if its a third party certificate, the expiration date and if a subject name on the certificate matches what is set for the FQDN on the connector used.” -AndyDavid

Well that’s nice…. and a bit further down the thread someone mentions you can do it manually, when they source non other than the Exchange Guru himself; Paul Cunninham.

So that’s nice to know.

The Default Self Signed Certificate

You may have noticed a fair amount of chatter in that first thread about the default certificate. You may have even noticed some stern warnings:

“You can’t unless you remove the cert. Do not remove the built-in cert however. ” “Yikes. Ok, as I mentioned, do not delete that certificate.”-AndyDavid

Well the self signed cert looks like is due to expire soon, and I was kind of curious, how do you create a new self signed certificate?

So I followed along, and annoyingly you need an SMB shared path accessible to the Exchange server to accomplish this task. (I get it; for clustered deployments)

Anyway doing this and using the UI to assign the certificates to all the required services. Deleted the old Self Signed Cert, wait a bit, close the ECP, reopen it and….

I managed to find this ms thread with the same issue.

The first main answer was to “wait n hour or more”, yeah I don’t think that’s going to fix it…

KarlIT700 – ”

Our cert is an externally signed cert that is due to expire next year so we wanted to keep using it and not have to generate a new self sign one.

We worked around this by just running the three PS commands below in Exchange PS

Set-AuthConfig -NewCertificateThumbprint <WE JUST USED OUR CURRENT CERT THUMPRINT HERE> -NewCertificateEffectiveDate (Get-Date)
Set-AuthConfig -PublishCertificate
Set-AuthConfig -ClearPreviousCertificate

 

Note: that we did have issues running the first command because our cert had been installed NOT allowing the export of the cert key. once we reinstalled the same cert back into the (local Computer) personal cert store but this time using the option to allow export of the cert key, the commands above worked fine.

We then just needed to restart ISS and everything was golden. :D”

Huh, sure enough this MS KB on the same issue..

The odd part is running the validation cmdlet:

(Get-AuthConfig).CurrentCertificateThumbprint | Get-ExchangeCertificate | Format-List

Did return the certificate I renewed UI the ECP webUI… even then I decided to follow the rest of the steps, just as Karl has mentioned using the thumbprint from the only self signed cert that was there.

Which sure enough worked and everything was working again with the new self signed cert.

Anyway, figured maybe this post might help someone.

vCenter Appliance Failed File Based Backup

Story Time

*UPDATE* VMware has pulled this garbage mess of an update version of vSphere. Why?

1) They PSOD ESXi Hosts...

2) Broke more shit then they fixed...

3) Broke and silently removed protocols for File Based Backups (This post)

As much as the backup failed, I failed along with it,

Task. Backup the vCenter Server using VAMI to create a file based backup.

Now for a ESXi host, you can do this super easy (at least the config so install new and simply load the config)

For a deep and better understanding of backing up and restoring ESXi host’s please read this really amazing blog post by Michael Bose from NAKIVO.

Back up ESXi configuration:

vim-cmd hostsvc/firmware/backup_config

and You will get a simple URL to download the file right to your management machine/computer.

Does vCenter have something like this? (from my research…) No.

You use the vCenter Server Interface to perform a file-based backup of the vCenter Server core configuration, inventory, and historical data of your choice. The backed-up data is streamed over FTP, FTPS, HTTP, HTTPS, SFTP, NFS, or SMB to a remote system. The backup is not stored on the vCenter Server.

Which hasn’t been updated since 2019. Let’s make a couple things here clear:

  1. The HTTP and HTTPS mentioned above are not like the ESXi style mentioned above where it creates a nice backup file locally on the VCSA and presents you with a simple URL to navigate to, to download it. It expects the HTTP/HTTPS to be a file based server to accept file transfers to (like dropbox).
  2. Lots of these “supported” protocols have pretty bad bugs, or simply don’t even work at all. Which well see below.

Doing the Theory

So OK, l log into VAMI, Click the Backup tab on the left hand nav, try to add a open SMB path I have available to use cause, why not, make my life some what easy…

Looking this up I get: VAMI Backup with SMB reports error: “Path not exported by the remote filesystem” (86069) (vmware.com) dated Oct 28,2021. Nice, nice.

Alrighty then, I’ll just spin up a dedicated FTP service on my freeNas box I guess. I learnt a couple things about chroot and local users via FTP, but the short and sweet was I created a local account on the FreeNAS box, created a Dataset under than existing mounted logical volume, and granted that account access to the path. Then enabled local user login for the FTP server, and specified that path as the user’s home path, and enabled chroot on the FTP service, so when this user logs in all they can see is their home path, which to that user appears as root. This (I felt) was a fair bit of security on it, even though its a lab and not needed, just nice…. ANYWAY… Once I had an FTP server ready….

Now I went to Start a File based backup of the vcenter server:

First Error: Service Not Running

In my case I got an error that the PSC Health service was not running, this might just be cause my lack of decent hardware for good performance might have caused some services to not start up in a timely manner. Either way, Navigating to Services in VAMI and started the PSC Health service. Lucky for me there was no further errors on this part.

If you have service errors you will have to check them out and get the required services up and running, which is out the scope of this post.

Second Error: Number of Connections

The next error I got complained about the allowed number of connections to the target.

Which in my case there was an option on the FreeNAS FTP service configurations for this, I adjusted it to “0” or unlimited in hopes to resolve this problem:

restart the service, and try again…

Third Error: Unknown

This is starting to get annoying…

What kind of vague error is that?!

Guy in this thread states the path has to be empty? what?

I tried that, cleared some more space, and it seems to have sorta worked?

Clear the FTP users home path, and try again:

Fourth Problem: Stuck @ 95%

The Job appeared to run but I noticed a couple things:

1) Even though the backup config said the overall size would only be roughly 400MB, the job ran to around 1.8 Gigs.

2)  All I/O appeared to stop and all Resources returned to an idle state, while the job remained stuck processing at 95%.

OK… I found this thread, which suggested to restart the autodeploy service, tried that and it didn’t work, the job remained stuck @ 95%.

I also found this VMware KB,  however,

1) I have a tiny deployment so no chance my DB would be 300Gigs.

2) When I went to check the “buggy python script” the “workaround” seemed to already have been implemented. So the versions of vCenter I was on (7.0u3a) already had this “fix” in place

3) The symptoms still remain to be exactly the same and the python scripts remain in a “sleeping” state.

FFS already….

Try Anyway

Well I saw the files were created, so I decided to try the restore method on the VCSA deployment wizard anyway…

I forgot to take a snippet here, but it basically stated there was a missing metafile.json file. I can only assume that when the backup process was stuck at 95% it never created this required json file…

FUCK….

One Scheduled Run

I noticed that I suppose overnight a scheduled job tried to run and provided yet a different error message:

Well that’s still pretty vague, as far as I know there should be no connectivity issues since file were created all the way up to 1.8 gigs, so I don’t see how it’s network, or permissions related, or even available space in this case, since all files were cleared, up to the already possible and shown to be written 1.8 gigs, which have been deleted to empty the path every time.

Liek seriously, wtf gives here. The fact there’s an entirely new KB with an entire Table of list of shit that apparently is wrong with this file based backup honestly begs the question, Where the FUCK is the QA in software these days? This shit is just fucking ridiculous already…

Check the Logs

*This Log file only gets created the first time you click “configure” under the backup section of VAMI.

Here’s how to access the logs:

Using putty or similar, SSH in as root on the appliance.
Type Shell at the prompt.
Type cd /var/log/vmware/applmgmt.
Type more backup.log or tail backup.log.

[VCDB-WAL-Backup:PID-42812] [VCDB::_backup_wal_files:VCDB.py:797] INFO: VCDB backup WAL start not received yet.

Checking the entry I find this thread. Along with this Reddit Post. Which leads right back to the first shared thread, which states some bitching about the /etc/issues files… and I have a strange feeling, just like the stuck @ 95% issue, I’ll look at the file and it will probably be correct just like the guy who created the Reddit post.

Try Alternative Protocols

When I tried alternative protocols I came across more issues:

NFS – Had the same path issue SMB did “Path not exported by remote system”

SCP – Was apparently silently dropped, much like what this thread mentioned. The amount of silence on that thread speaks volumes to me.

TFTP was also dropped.

You are so Fucked

Soo I wonder if I try to “upgrade” aka downgrade using the UI installer of a supposed version that works (7.0u2b)…

Alright so let me get this straight… I upgraded, and now I can’t make a backup cause the upgraded version is completely broken it terms of its File Basked Backups.

I can’t Roll back the upgrade without having kept the old VCSA, which was removed in my case since all other services was working, vSphere itself.

I can’t “downgrade” and existing one, I can’t make a backup to restore my old ones. OK fine well how about a huge FUCK YOU VMWARE. while I try to come up with some sort of work around for this utter fucking mess.

Infected Mushroom – U R So F**ked [HQ & 1080p] – YouTube

Work around option #1

Build a brand new vCenter, add hosts, and reconfigure.

The main issue here is the fact if you rely on CBT, you will be fucked and all the VM-IDs will have changed, so you will have to:

1) Edit and adjust all back up jobs to point to the new VM, via it’s new VM-IM.

2) Let the delta files be all recalculated (which can be major I/O on storage units depending on many different factors (# of VM, Size of VMs, change of files on VMs, etc)

Not and option I want to explore just yet.

Work Around option #2

Back and restore the config database?

Let’s try.. first backup…

copy python scripts (hope they not all buggy and messed up too..)

Stop required services:

service-control --stop vmware-vpxd
service-control --stop vmware-content-library

change the script permissions

chmod +x backup_lin.py

Run it:

Make a copy of it via WinSCP.

run the restore script… and

well was worth a shot but that failed too….

Lets try PG dump for shits…

I’d really recommend to read this blog post by Florian Grehl on Virden.net for great information around using postgres on vCenter.

Connect to server via SSH (SSH enabled required on vCenter).

“To connect to the database, you have to enable SSH for the vCenter Server, login as root, and launch the bash shell. When first connecting to the appliance, you see the “Appliance Shell”. Just enter “shell” to enter the fully-featured bash shell.

The simplest way to connect to the databases is by using the “postgres” user, which has no password. It is convenient to also use the -d option to directly connect to the VCDB instance.”

# /opt/vmware/vpostgres/current/bin/psql -U postgres -d VCDB

Cool, this lets us know the postgres DB service is running. The most important take away from Florian’s post is:

“When connecting, make sure that you use the psql binaries located in /opt/vmware/vpostgres/current/bin/ and not just the psql command. The reason is that VMware uses a more recent version than it is provided by the OS. In vSphere 7.0 for example, the OS binaries are at version 10.5 while the Postgres server is running 11.6”

Kool, I could use pg_dumpall but I found it didn’t work (maybe that was wrong version of vcenter being mixed, not sure) either way lets try just the VCDB instance…

interesting, lol, as you see I got an error about version mismatch. I found this thread about it and with the info from Florians post, had an idea, tried it out, and it actually worked. Mind… BLOWN.

rm /usr/bin/

OK let’s take this file and place it on the newly deployed vcenter.

even though restore appeared to have worked the vCenter instance booted and showed to be like new install. Was worth a shot I guess, but did not work.

Work Around Option #3

I’m not sure this is even a fair option, as it only works if you have existing backup of alternative types. In my case I use Veeam and its saved my bacon I don’t know how many times.

Sure enough Veeam saved my bacon again. I ended up restoring a copy of my vCenter before the 7.0u3a, which happened to be on 7.0u2d.

I managed to add a SMB path without it erroring, and unreal, I ran a File Based Backup and it actually succeeded!!

Now I just simply run the deploy wizard, and pick restore to build a new vCenter server from this backup.

Ahhh VMware… dammit you got me again!

alright fine… grabs yet another copy of vCenter…

and this time…

are you fucking kidding me? Mhmmm interesting… VCSA 7.0 restore issue – VMware Technology Network VMTN

ok… good to know…

From this… to this….

then Deploy again…

It stated it failed, due to user auth. However I was able to login and verify it worked, but sadly it also instantly expired the license as well. I was hoping I could get another 60 days without creating a new center, reconfiguring and breaking my VM-IDs and CBT delta points for my backup software.

Even this link states what I’m trying to do is not possible… ugh the struggles are real!

In the end just started from scratch, Ugh,

When to VMware Snapshot

OK real quick short post here. I figured I’d take a snapshot of my vCenter server (reason will be next blog post).  In this case I decided to snapshot the VM with memory saving, I figured it would be faster than bringing the VM back up from a shutdown state as that’s what a normal snapshot would do.

In most cases that would probably be a fair assumption, but boy was I wrong.

It Took a short time save the snapshot but almost 15 min or more to bring the VM back to full operational status with all memory back in tact… just check out these charts:

Here you can see it took maybe 5 min to save the memory state to disk, this would have been a time of 0 minutes since a normal snapshot doesn’t save memory to disk. Then you can see the slower longer recovery time it took to get the same memory from disk and put it back into memory.

Of course taking a solid guess that disk I/O is much slower than Memory I/O the bottle would have to be non other than the actual disk….

Yup there’s the same matching results of fast disk writes, and slow disk reads…

and there’s the disk being 100% bust on the read requests. I’m not sure why the read performance on this drive was as bad as it was, but I have a feeling a regular boot would have been faster… I’ll update this post if I do an actual test.

Meh, pretty much same amount of time… I think I need some super fast local storage… yet I’m so cheap I never do…. cheap bastard…

SharePoint and AD Groups

Story Time

How I got here. New site created, using same IDGLA group nesting as old sites, cross forest users. Access denied. When Old sites work. If you want the short answer please see the Summary Section. Otherwise read through and join on a crazy carpet ride.

Reddits short and blunt answer states:

“No, because in ForestA, that user from ForestB is a Foreign Security Principal, which SharePoint does not support. You would have to add a user from ForestB to a security group in ForestB and then add that group to SharePoint.”

Which would make sense, if not my old sites working in this exact manner.

To quench my ignorance I decided to remove the cross domain group from the SharePoint’s local domain group that was granted access to the site, at first it appeared as if the removal of the group did nothing to change the user access, until the next day when it appeared to have worked and the user lost access to the site (test user in test enviro). I was a bit confused here and decided to use the permission checker on the sites permission page to see what permissions was actually given to the user.

Which did not show the local domain resource group which was suppose to be granting the user access. Instead the following was presented:

Limited Access          Given through "Style Resources Readers" group.

Rabbit Hole #1: Fundamentals

Of course looking up this issues also shows my Technet post that was answered by the amazing Trevor Steward, which was a required dependency (at least for Kerberos). So that wasn’t the answer, and even in there we discussed the issue of nested groups, which in this case again was following the same IDGLA standard I did for the other sites. Something still smells off.

Digging a bit deeper I found some blog posts from a High tiered MS Senior Support Engineer. (One can only dream)… by the name of Josh Roark.

This revolved around “SharePoint: Cross-forest group memberships not reflected by Profile Import” which brought me down some sad memory lanes about the pain and grind of SharePoint’s FIM/ADI and profile sync stuff. (*Future me, guess what it has nothing to do with the problem.*) Which also funny enough, much like the Reddit states to eventually use groups from each domain directly at the resource (SharePoint page/library, etc) instead of relying on nested groups Cross Forest.

Which again still doesn’t fully explain why old sites with the same design are in fact working. I’m still confused, if the answers from all those sites were correct, me removing a cross forest nested group should not affect users permissions on the resource, but in my test it did.

So the only thing I can think of is there some other odd magic going on here with the users profiles, and what groups SharePoint thinks they are a member of?

Following Josh’s post let’s see what matches we have in our design…

Consider the following scenario:

  • You have an Active Directory Forest trust between your local forest and a remote forest.
  • You create a “domain local” type security group in Active Directory and add users from both the local forest and the remote trusted forest as members.
  • You configure SharePoint Profile Synchronization to use Active Directory Import and import users and groups from both forests.

Check, Check, Oh wait, in this case there might have only ever been one import done which was the user domain, and not the resource domain as in one case in time they were separate. I’m not sure if this play s a role here or not, not exactly like I can talk to Josh directly. *Pipe Dreams*

And the differences in our design/problem.

  • You create an Audience within the User Profile Service Application (UPA) using the “member of” option and choose your domain local group.
  • You compile the audience and find that the number of members it shows is less than you expect.
  • You click “View membership” on the audience and find that the compiled audience shows only members from your local forest.

Nope, nope, and nope, in my case it’s:

  • You create an Access to a SharePoint Resource via an AD based domain local group.
  • You attempt to access the resource as a user from a cross forest nested group.
  • You get Access Denied

He also states the Following:

“! Important !
My example above covers a scenario involving Audiences, but this also impacts claims augmentation and OAuth scenarios.

For example, let’s say you give permission to a list or a site using an Active Directory security group with cross-forest membership. You can do that, it works. Those users from the trusted forest will be able to access the site. However, if you run a SharePoint 2013-style workflow against that list, it looks up the user and their group memberships within the User Profile Service Application (UPA). Since the UPA does not show the trusted forest users as members of the group, the claims augmentation function does not think the user belongs to that group and the Workflow fails with 401 – Unauthorized.”

Alright so maybe this is still at play, HOWEVER, the answer is the same as the reddit post, and that still doesn’t explain why the design is actually working for my existing sites, I must dig deeper, but in doing so you might just find what you are looking for as well. Which is funny cause he states you can do that and it does work, then why is it not working for the reddit user, why is it not working for my new site, and why is it working for the old sites. The amount of conflicting information on this topic is a bit frustrating.

Let’s see what we can find out.

Why does this happen?

“It’s a limitation of the way that users and groups are imported into the User Profile Service Application and group membership is calculated.”

Mhmmm, ok, I do remember setting this up, and again was only for the trusted forest, not for the local forest, as initially no users resided there.

“If you look at the group in Active Directory, you’ll notice that the members from your trusted forest are not actually user objects, instead they are foreign security principals.”

Here we go again, back to the FSP’s. In my case instead of user based SIDs in the FSP OU, I had Group SIDs, either way let’s keep going.

How do we work around this?

“The only solution is to use two separate groups in your audiences and site permissions.

Use a group from the local domain to include your local forest users and use a group from the trusted domain to include your trusted forest users.”

Well here we go again, same answer, but I already mentioned it is working on my old site, and even Josh himself initially even stated “For example, let’s say you give permission to a list or a site using an Active Directory security group with cross-forest membership. You can do that, it works. ” so is this my issue or not. Either way there was one last thing he provided after this statement…

Rabbit Hole #2 : UPSA has nothing to do with Permissions

Group / Audience Troubleshooting tips:

“It can be a bit difficult to tell which groups the User Profile Service Application (UPA) thinks a certain user is a member of, or which users are members of a certain group.

You can use these SQL queries against the Profile database to get that info.

Note: These queries are written for SharePoint 2016 and above. For SharePoint 2013, you would need to drop the “upa.” part from in front of the table names: userprofile_full, usermemberships, and membergroup. You only need to supply your own group or user name.”

OK sweet, this is  useful, however I know not everyone that manages SharePoint will have direct access to the SQL server and their databases to do such look ups. The good news is I have some experience writing scripts which you can run queries from the SharePoint front end as most FE’s will have access to the DB’s and tables they need. Thus no need for direct SQL access.

Let’s create a GitHub for these here. (I have to recreate this script as it got wiped in the test enviro rebuild) *note to self don’t write code in a test enviro.

So the first issue I had to overcome was knowing what DB the service was using, since it’s possible to have multiple service applications for similar services. Sure enough I got lucky and found a technet post of someone asking the same question, and low and behold it’s none other then Trevor Steward to answer the question on his own web site (I didn’t even know about this one). a little painful but done. Unfortunately since they could be named anything, I didn’t jump though more hoops to be able to find and list the names of these UPAs, but I did code a line to help inform users of my script of that issue and what to run to help get the required name.

So with the UPA name in place, it’s scripted to locate the Profile DB, and run the same query against it.

OK, so after running my script, and validating it against the actual query that is run against the profile db, here’s what I found.

*Note* I simply entered the group name of %, which is SQL syntax for wildcard (usually *) in the group name request, which is simply a variable for the TSQL’s “like” statement.

anyway, the total groups returned was only 6, and only half of them were actually involved with SharePoint at all. I know there are WAY more groups within that user domain… so… what gives here?

*Note* Josh mentions the “Default Group Problem“, which after reading I do not use this group for permissions access and I do not believe it to be of concern or any root cause to my problem.

*Note 2* Somewhere, I lost the reference link but I found you can use a powershell cmdlet as follows (for unknown reasons as to be run as the farm account):

$profileManager = New-Object Microsoft.Office.Server.UserProfiles.UserProfileManager(Get-SPServiceContext((Get-SPSite)[0]))
foreach($usrProfile in $profileManager.GetEnumerator()) { Write-Host $usrProfile.AccountName "|" $usrProfile.DisplayName; }

Well that didn’t help me much, other then to show me there’s a pretty stale list of users… which brought me right back to Josh Roark…

SharePoint: The complete guide to user profile cleanup – Part 4 – 2016 – Random SharePoint Problems Explained… Intermittently (joshroark.com)

ughhhhhh, **** me…..

So first thing, I check the User Profile Service on the Central Admin page of SharePoint. I noticed it states 63 profiles.

(($profileManager.GetEnumerator()).AccountName).Count

which matches the command I ran as the farm account, HOWEVER, what I noticed was not all account were just from the user domain, several of them were from the resource domain even though no ADI existed for them. Ohhh the question just continue to mount.

At this point I came back to this the next day, and when I came back I had to re-orient myself to where I was in this rabbit hole. When I realized I was covering User Profile Import/Syncing with AD and SharePoint. and I asked myself “Why?”. AFAIK User Profiles have nothing to do with permissions?

Let’s find out, its test, lets wipe everything with UPA and it’s services, all imports and try to access the site…

Since I wasn’t too keen on this process I did a Google search and sure enough found another usual SharePoint blogger here

and look at that, the same command I mentioned above that you need tom run as the farm account for some odd reason.. so I created another script.

Well… I still have access to the old SP sites via that test account, and still not on the new site I created utilizing the same cross forest group structure… so this seem to follow my assumption that UPA profiles has nothing to do with permission access…

One thing I did notice was once I attempted to access a site, the user showed up in the UPA user profile DB without having run a sync or import task.

Well since we are this deep…. let’s delete the UPA service all together and see what happens. Under the Central Admin navigated to manage service applications, click the UPA service, and delete at the top ribbon, there was an option to delete all associated data with the service application, yes please… and…

Everything still works exactly as it did before, and proving this has nothing to do with permissions.

On a side note though, I did notice nothing changed in terms of my User details in the right upper corner, and while I have down this other rabbit hole. I’m going to avoid it here. Lucky for me, it seems in my wake, someone else by the name of Mohammad Tahir has gone down his rabbit hole for me and has even more delightfully blogged about the entire thing himself, here. I really suggestion you read it for a full understanding.

In short, that information is in the “User Information List” UIL, which is different from the data known by the UPSA, the service I just destroyed, however I will share the part where they link:

“The people picker does not extract information from the User Profile Service Application (UPSA). The UPSA syncs information from the AD and updates the UIL but it does not delete the user from it.”

Again in short, I basically broke what would be user information as seen on the sites if someone were to change their name and that change was only done on the authing source (Microsoft AD in this case). That change would not be reflected in SharePoint. At least in theory.

If you made it this far, the above was nothing more than a waste of time, other than to find out the UPSA has no bearing on permissions granted via AD groups. But if you need to clean up user information shown on SharePoint sites then you probably learnt a lot, but that’s not what this post is about.

So all, in all this is probably why there are resources online confusing the two as being connected, when it turns out… they are not.

Rabbit Hole #3: The True Problem – The Secure Token Service, or is it?

So I decided to Google a bit more and I found this thread with the same question: permissions – Users added to AD group not granted access in SharePoint – SharePoint Stack Exchange

and sure enough, what I’ve pretty much realized myself from testing so far appears to hold true on the provided answer:

“Late answer but, The User Profile is not responsible in this case.

SharePoint recognizes AD security groups and attaching permissions to these groups will cause the permissions to be granted to the User.

Unfortunately, due to SharePoint caching the user’s memberships on login, changes made to a security group are identified only after the cache has expired, possibly taking as long as 10 hours (by default) for it to happen.

This causes the unexpected behavior of adding a user to a Group and the user still being shown the access denied or lack of interface feedback related to the new permissions he should have received.

The user not having his tokens cached prior to being added to the group would cause him to receive access immediately.”

And that’s exactly the symptom I usually get, apply AD group permission and after some time (for me I assumed 24 hours cause test the next day) but from this answer states it “10 hours”. My question now would be, what cache is he talking about? Kerberos? Web Browser?

“SharePoint caching the user’s memberships on login”? What logon, Computer/Windows Logon, or SharePoint if you don’t use SSO?

OK I’m so confused now, I did the same thing in my test enviro, and it seemed to work almost instant, I did the same thing in production and it’s not applying.  God I hate SharePoint….

I attempted a Incognito Window, and that didn’t work… so not browser cache…

Logged into a workstation fresh, nope, so not Kerberos cache it seems, so what cache is he referring to?

So I decided to tighten my Google query, and I found plenty of similar issues stemming back a LONG time. security – Why are user permissions set in AD not updated immediately to SharePoint? – SharePoint Stack Exchange

In there, there’s conflicting information where someone actually again mentions the UPSA, which we’ve discovered ourselves to have no impact, and even that answer is adjusted to say even indicate it maybe false.  The more common answer appears to match the “10 hours” “cache” mentioned above, which turns out to be…. *drum roll* … “Security Token Service”.

Funny enough when I went to go Google and find source for the SharePoint STS, I got a bunch of Troubleshooting, and error related articles *Google tryin’ to help me?* either way, sure enough I find an article by non other than our new SharePoint hero; Josh Roark (Sorry Trevor), to my dismay it didn’t cover my issue, or how to clear or reset it’s cache… ok let’s keep looking…

A random github page with some insights into the design ideology... useless nothing about cache…

Found someone who posted a bunch of links around troubleshooting STS, but didn’t even write anything themselves, all I found was he linked MS’s Blog post about which literally copied and pasted Josh’s work. I guess Josh being a MS employee MS can take his work as their own without issue? anyway let’s keep looking…

Funny, I finally found someone asking the question, and for the exact same reason I wrote this whole blog post about…. also funny that the obscurity and amount of “like” or interest in the topics I find this deep have super low like counts cause of just how little people get this down into the nitty gritty. And here’s the Answer,  Third funny thing is their question wasn’t how, but what the affects of doing it are.

“When using AD groups, and adding or removing a user to that group, the permissions may not update as intended, given the default 10 hour life of a token.

I’ve read of an unofficial recommendation to shorten the token lifetime, but others have cautioned it can have adverse affect. With that, I’d rather leave it alone.

Is it safe to purge on demand?

Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache 

…or does it too have adverse affect?”

Answer of “It will degrade performance, but it is otherwise safe.”

OK finally, lets try this out.

I removed the group the test user was a member of in AD, which granted it contribute rights on the site.

After removing the Group I replicated it to all AD servers.

I checked the user permission via SharePoint permission checker, still showed user had contribute rights.

Ran the cmdlet mentioned about on the only SP FE server that exists, with all services running on it, including the STS.

Refreshed the permission page for SharePoint, checked user permissions… C’mon! it still says contribute rights, navigating the page via the user, yup… what gives?!?!?!!

Seriously that was suppose to have solved it, what is it?!?!

even going deep into the thread he doesn’t respond if it worked or not just what other do, as I read that yes seems lowering the cache threshold is often mentioned. For the same reasons, I want permissions to apply more instantly instead of having this stupid 10/24 hour wait period between permission changes.

If the manually clearly of the cache doesn’t work what is it? and again they bring up the misconception of the UPS/UPSA.

OMG and sure enough a TechNet post with the EXACT same problem, trying to do the exact same thing, and having the EXACT same Issue!!!

Wow…

Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache

didn’t work….

cmdlet above + iisreset

didn’t work

Reboot FE Server

Didn’t work

Another post with the same conclusion to bump the cache timeouts.

OK so here’s an article that the “MS tech” who answered the TechNet question referenced. I’ll give credit where it’s due and providing and answer and sourcing it is nice. Active Directory Security Groups and SharePoint Claims Based Authentication | Viorel Iftode

OK Mr.Lftode what ya got for me…

The problem

The tokens have a lifetime (by default 10 hours). More than that, SharePoint by default will cache the AD security group membership details for 24 hours. That means, once the SharePoint will get the details for a security group, if the AD security group will change, SharePoint will still use the cache.

So the same 10 hour / 24 hour problem we’ve been facing this whole time, regardless of cross-forest, or single forest design.

Solution

When your access in SharePoint rely on the AD security groups you have to adjust the caching mechanism for the tokens and you have to adjust it properly everywhere (SharePoint and STS).

Add-PSSnapin Microsoft.SharePoint.PowerShell;

$CS = [Microsoft.SharePoint.Administration.SPWebService]::ContentService;
#TokenTimeout value before
$CS.TokenTimeout;
$CS.TokenTimeout = (New-TimeSpan -minutes 2);
#TokenTimeout value after
$CS.TokenTimeout;
$CS.update();

$STSC = Get-SPSecurityTokenServiceConfig
#WindowsTokenLifetime value before
$STSC.WindowsTokenLifetime;
$STSC.WindowsTokenLifetime = (New-TimeSpan -minutes 2);
#WindowsTokenLifetime value after
$STSC.WindowsTokenLifetime;
#FormsTokenLifetime value before
$STSC.FormsTokenLifetime;
$STSC.FormsTokenLifetime = (New-TimeSpan -minutes 2);
#FormsTokenLifetime value after
$STSC.FormsTokenLifetime;
#LogonTokenCacheExpirationWindow value before
$STSC.LogonTokenCacheExpirationWindow;
#DO NOT SET LogonTokenCacheExpirationWindow LARGER THAN WindowsTokenLifetime
$STSC.LogonTokenCacheExpirationWindow = (New-TimeSpan -minutes 1);
#LogonTokenCacheExpirationWindow value after
$STSC.LogonTokenCacheExpirationWindow;
$STSC.Update();
IISRESET

Well the exact same answer the MS tech provided, with no simple solution of simply clearing a cache on the STS, or restarting the STS, none of it seems to work, its cache is insanely persistent, apparently even across reboots.

I’ll try this out in the test enviro and see what it does. I hope it doesn’t break my site like it did the guy who asked the question on TechNet…. Here goes…

So same thing happened to my sites, I’m not sure if its for the same reason….

Rabbit Hole #4: Publishing Sites

So just like the OP from that Technet post I shared above, set the timeouts back to default and the site started working, but that doesn’t answer the OPs question, and it was left at a dead end there…

Also much like the OP, these sites were too enabled and were using the “publishing feature”.

I decided to look at the source he shared to see if I could find anything else in more details.

On the first link the OP shared the writer made the following statement:

“Publishing pages use the Security Token Service to validate pages. If the validation fails the page doesn’t load. Team sites without Publishing enabled are OK as they don’t do this validation.”

Now I can’t find any white paper type details from MS on why this might be the case, but let’s just take this bit of fact as true.

The poster also made this statement just before that one:

“Initially we had installed the farm with United States timezone, when a change was made to use New Zealand time, the configuration didn’t fully update on all servers and the Security Token Service was responding with US Date format making things very unhappy.”

Here’s a thing that happened, the certs expired, but I also got an alert in my test stating “My clock was ahead”. At first I thought this was due to the expired certs. So I went and updated the certs, and also changed the timeout values back to default which made everything work again. However now that this info is brought to my attention I’m wondering if there’s something else at play here.

Since looking at the second shared resource, makes a similar suggestion…

Rabbit Hole #5: SharePoint Sites and the Time Zones

OK so here’s the first solution to this alternative shared post that had the same issue of sites not working after lowering the STS timeout threadhold:

“Check if time zone for each web application in General Settings is same as your server time zone. Update time zone if nothing selected, run IISRESET and check if the issue is resolved.”

and the second solution is the one we already showed worked and was the answer provided back to the TechNet post by the OP, and that’s to set the thresholds back to default, which simply leaves you with the same permission issue of waiting 10/24 hours for new permissions to apply when changed in AD and not managed at SharePoint dirtectly.

Now unlike the OP… I’m going to take a quick peek to see if my timezone are different on each site vs the FE’s own timezone…

Here we gooo, ugghhhh

System Time Zone

w32tm /tz

for me it was CST, gross with CDT (That terrible thing we call daylights savings time), I really hope that doesn’t play into affect….

SharePoint Time Zone

Well I read up a bit on this from yet another SharePoint Expert “Gregory Zelfond” from SharePoint Maven.

Long story short: There’s Region Time Zone (Admin based, Per Site) and Personal Time Zone. I’m not sure if messing with the personal Time Zone matter but I’ve been down enough rabbit holes, I hope I can ignore that for now.

OK, So quick recap, I checked the site’s regional settings and the time zone matched the host machine, at least for CST, I couldn’t see anything settings in the SharePoint Site Time Zone settings for Daylights Saving times, so for all I know that could also be a contributing factor here. But for now we’ll just say it matches.

I also couldn’t find “About me” option under the top right profile area, so I couldn’t directly check the “Personal Time Zone” that way, I was however, able to check the User Profile Service Application, to “manage User Profiles” to verify there was no Time Zone set for the account I was testing with, I can again only assume here that it means it defaults to the sites Time Zone.

If so then there’s nothing wrong with any of the sites or serves time zone settings.

SO checking my logs I see the same out of Range exceptions in the ULS logs:

SharePoint Foundation Runtime tkau Unexpected System.ArgumentOutOfRangeException: The added or subtracted value results in an un-representable DateTime. Parameter name: value at System.DateTime.AddTicks(Int64 value) at Microsoft.SharePoint.Publishing.CacheManager.HasTimedOut() at Microsoft.SharePoint.Publishing.CacheManager.GetManager(SPSite site, Boolean useContextSite, Boolean allowContextSiteOptimization, Boolean refreshIfNoContext) at Microsoft.SharePoint.Publishing.TemplateRedirectionPage.ComputeRedirectionVirtualPath(TemplateRedirectionPage basePage) at Microsoft.SharePoint.Publishing.Internal.CmsVirtualPathProvider.CombineVirtualPaths(String basePath, String relativePath) at System.Web.Hosting.VirtualPathProvider.CombineVirtualPaths(VirtualPath basePath, VirtualPath relativePath) at System.Web.UI.Dep..

OK… soooo….

Summary

We now know the following:

    1. The root issue is with the Secure Token Service (STS) cause of:
      – Token life time is 10 hours ((Get-SPSecurityTokenServiceConfig).WindowsTokenLifetime)
      – SharePoint cache the AD security details for 24 hours (([Microsoft.SharePoint.Administration.SPWebService]::ContentService).TokenTimeout)
    2. The only command we found to forcefully clear the STS cache didn’t work.
      – Clear-SPDistributedCacheItem –ContainerType DistributedLogonTokenCache
    3.  The only other alternative suggestion was to shorten the STS and SharePoint Cache settings, which breaks the SharePoint sites if they are using Publishing feature.
      – No real answer as to why.
      – Maybe Due to timezone.
      – Most likely due to the shortened cache times set.
    4.  The User Profile Service HAS NO BERRING on site permissions.

So overall, it seems if you

A) Use AD groups to manage SharePoint Permissions and

B) Use the Publishing Feature

You literally have NO OPTIONS other than to wait 24 hours for permissions to be apply to SharePoint resources when the access permissions are managed strictly via Active Directly Groups.

Well after all those rabbit holes, I’m still left with a shitty taste in my mouth. Thanks MS for making a system inheritably have a stupid permission application system with a ridiculous caveat. I honestly can’t thank you enough Microsoft.

*Update* I have a plan, which is to run the cache clear PowerShell cmdlet (the one mentioned above and linked to a TechNet stating it doesn’t work), and then recycle the STS app pool, and will report my results. Finger crossed…

Changing vCenter Hostname

Changing vCenter Hostname

Why?!?! Cause I gotta!

Source: Changing your vCenter Server’s FQDN – VMware vSphere Blog

PreReqs, AKA Checklist

  • Backup all vCenter Servers that are in the SSO Domain before changing the FQDN of the vCenter Server(s)
  • Supports Enhanced Linked Mode (ELM)
  • Changing the FQDN is only supported for embedded vCenter Server nodes
  • Products which are registered with vCenter Server will first need to be unregistered prior to an FQDN change. Once the FQDN change is complete they can then be reregistered.
  • vCenter HA (VCHA) should be destroyed prior to an FQDN change and reconfigured after changes
  • All custom certificates will need to be regenerated
  • Hybrid Linked Mode with Cloud vCenter Server must be recreated
  • vCenter Server that has been renamed will need to be rejoined back to Active Directory
  • Make sure that the new FQDN/Hostname is resolvable to the provided IP address (DNS A records)

NOTE: If the vCenter Server was deployed using the IP as PNID/FQDN, then the following should also be considered:

  • The PNID change workflow cannot be used to change the IP address of vCenter Server
  • The PNID change workflow cannot be used to change the FQND of vCenter Server

In this scenario, use the vCenter Server Appliance Management Interface (VAMI) to update hostnames or IP changes directly. 

The main thing I was expecting was the certificate issue. In my home lab, I removed SSO domain before this change (just using vpshere.local), no ELM, already using embedded (all-in-one), no VCHA, no Hybird, oh yeah…. not sure if you “leave an SSO domain”, before joining back to AD…

My Only Pre-Req

I went into DNS and pre-created A host records for the new server hostname: vCenter.zewwy.ca

Steps

Basically log into VAMI, and change the name.

Then

and and…. well WTF…

No matter what I do it’s greyed out… I thought maybe the untrusted cert, might be an issue so tried from a machine with full trusted chain, and same issue!

Like…. Why… why is Next greyed out? It’s like whatever Button Validation code is written for it is not being triggered, is this a browser version issue? I can’t find anything online with anyone having this issue…. Why? Cause I was right, it was the input validation…

Honestly, this is one of those MASSIVE facepalm moments in my life. I only realized after the fact the username field was NOT auto filled, it was only a label that was greyed and provided as a suggestion… Fill both fields and the next is ungreyed…

Step 4, check the checkbox to acknowledge the warning, and away… she goes!

At which point I clicked redirect now (both web addresses were still available as it didn’t seem to matter which you came from, the cert was untrusted either way, cause the CA not in my trusted ca store)

5 minutes later….

I tell ya nothing more annoying than a spinning circle and the warning “don’t refresh” when the status bar simply does not move… sure got some conflicting messages here….

*Starts to sweat*…

after about 10 minutes time…

More Certificate Fun!

Alright so after this, quick take always… when I went to check the site it was “untrusted” but not for the reason I had thought, I thought it would have been from the same issue as the source blog, and be the hostname on the cert but that was not the case, instead it was imply the the cert chain seemed to be missing, and the issuer could not be verified:

as well as:

So what to do about this… You can download the CA cert from vcenter/certs/download.zip (some reason I had to use IE). Then install the CA cert. (I noticed even after I did this I still had cert warning, error, but after the next day, maybe cache clearing or update, it reported green in the web browser).

Now when I logged in, I got the ol Cert Alert in the vCenter UI

first thing to try is removing old CA’s

Which I did, following this VMware KB

I simply followed my other post about this, and just cleared reset to green on the alert. (Still good days later).

Backup Solutions

Don’t forget to change the server in your backup software, such as I had to do this in Veeam.

These were my results…

Which go figure errored out…

So right click, go to properties of the object… Next, next…

Accept the certs new certificate

Now you figure all is well, but when I went to create a new backup job, when I attempted to expand the vcenter server in Veeam. It just hung there…

I ended up rebooting the server, and then waiting for all the Veeam services to be started. I reopened Veeam, and when to Inventory, clicked the vCenter server, took a second and then showed all the hosts, and the VMs. I clicked it and rescanned to be safe and got this result which was a bit different then the applied settings confirmation above. I think maybe I forgot to rescan the host after applying the new settings, assuming it would have done that as part of the properties change wizard.

which lucky for me now worked, and I was able to select a VM in the Veeam backup wizard, and it successfully backed up the VM.

Final Caveats

like what the heck, everywhere else its changed except at the shell. Let’s see if we change change this.

Well that was easy enough, no reboot required. 🙂

I also found the local hosts file doesn’t update either, in the file it states it managed by VAMI, so many have to look there for potential solutions:

I noticed this since I had to do a work around for something else, and sure enough caught it. I’ll change it manually with vi for now and see what changes after a reboot.

Summary

Overall, literally quick n easy.

  1. Verify DNS records exist.
  2. Use VAMI to edit hostname via editing the Network MGMT settings and change the hostname, click apply and wait.
  3. Manually clear out the old Certs that were created under the old hostname.
  4. Reconfigure you backup solution, which is vender specific (I provided step for Veeam as that is the Backup Vender I like to use)

Overall the task seemed to go pretty smooth. I’ll follow up with any other issue I might come across in the future. Cheers.