Configure Certificate-Based Administrator Authentication on a Palo Alto Networks Firewall

Source

As a “more secure” alternative to password-based authentication to the firewall web interface, you can configure certificate-based authentication for administrator accounts that are local to the firewall. Certificate-based authentication involves the exchange and verification of a digital signature instead of a password.
Configuring certificate-based authentication for any administrator disables the username/password logins for all administrators on the firewall; administrators thereafter require the certificate to log in.
To avoid any issues I created a snapshot of the PA VM. This took out my internet for roughly 30 seconds or so.

Step 1) Generate a certificate authority (CA) certificate on the firewall.
You will use this CA certificate to sign the client certificate of each administrator.
Create a Self-Signed Root CA Certificate.
Alternatively, Import a Certificate and Private Key from your enterprise CA or a third-party CA.

I do have a PKI I can use but no specfic key-pair that’s nice for this purpose, for the ease of testing I’ll create a local CA cert on the PAN FW.

Step 2) Configure a certificate profile for securing access to the web interface.
Configure a Certificate Profile.
Set the Username Field to Subject.
In the CA Certificates section, Add the CA Certificate you just created or imported.

Now for ease of use and testing I’m not defining CRL or OCSP.

Step 3) Configure the firewall to use the certificate profile for authenticating administrators.
Select Device -> Setup – > Management and edit the Authentication Settings.
Select the Certificate Profile you created for authenticating administrators and click OK.

Step 4) Configure the administrator accounts to use client certificate authentication.
For each administrator who will access the firewall web interface, Configure a Firewall Administrator Account and select Use only client certificate authentication.
If you have already deployed client certificates that your enterprise CA generated, skip to Step 8. Otherwise, go to Step 5.

Step 5) Generate a client certificate for each administrator.
Generate a Certificate. In the Signed By drop-down, select a self-signed root CA certificate.

Step 6) Export the client certificate.
Export a Certificate and Private Key. (I saved as pcks12, with a password)
Commit your changes. The firewall restarts and terminates your login session. Thereafter, administrators can access the web interface only from client systems that have the client certificate you generated.

File was in my downloads folder.

Step 7)Import the client certificate into the client system of each administrator who will access the web interface.

Refer to your web browser documentation. I am using windows, so I’m assuming the browser (Edge) will use the windows store, so I installed it to my user cert store by simply double clicking the file and providing the password in the import wizard prompt. Then checked my local user cert store.
Time to commit and see what happens…
as soon as I committed I got a prompt for the cert:
If I open a new InPrivate window and don’t offer the certificate I get blocked:
If I provide the certificate the usual FBA login page loads.
So now any access to the firewall requires the use of this key, and a known login creds. Though the notice stated it “disables the username/password logins for all administrators on the firewall” my testing showed that not to be true, it simply locks down access to the FBA page requiring the user of the created certificate.

Using Internal PKI

Let’s try to set this up, but instead of self signed, let’s try using an interal PKI, in this case Windows PKI using Windows based CA’s.

Pre-reqs, It is assumed you already have a windows domain, PKI and CA all already configured. If you require asstance please see my blog post on how to set such a environment up from scratch here: Setup Offline Root CA (Part 1) – Zewwy’s Info Tech Talks

This post also assumes you have a Palo Alto Networks firewall in which you want to secure the mgmt web interface with increased authentication mechanisms.

Step 1) Import all certificates into the PA firewall so it shows a valid stack:

Step 2) Configure a certificate profile for securing access to the web interface.
Configure a Certificate Profile.
Set the Username Field to Subject.
In the CA Certificates section, Add the CA Certificate you just created or imported.
Step 3) Generate a client certificate for each administrator.
Generate a Certificate. In the Signed By drop-down, select a CSR.
Now I’m not 100% certain how this all works, so I name the name common name and SAN the same as the local admin account I wanted to secure.
Then export CSR, and sign it by your internal CA server. and import it back into the PA firewall. In my case I decided (for testing purposes and simply due to pure ignorance) to create the certificate using the Web Server Template, even though I know this is going to be a certificate used for user authentication. *shrug* The final result should look like this:
Step 4) Configure the firewall to use the certificate profile for authenticating administrators. Pick the Cert profile created in Step 2.
Select Device -> Setup – > Management and edit the Authentication Settings.
Select the Certificate Profile you created for authenticating administrators and click OK. (At this point I recommend to not commit until at least the certificate created earlier is exported.)
Step 5) Configure the administrator accounts to use client certificate authentication.
For each administrator who will access the firewall web interface, Configure a Firewall Administrator Account and select Use only client certificate authentication.
This is where things start to feel weird in the whole process of this stuff…. It seem as soon as you check this checkbox off, the password fields disappear:
Before:
After:
Which makes it seem like it just changes the account the account from password based to just certificate based, and not 2fa as expected. On top of that, why can’t I specify which certificate to use, does this mean any certificate that exists within the PA store is good enough? I guess I’ll have to test to see if that’s the case anyway…
Step 6) Export the client certificate.
Export a Certificate and Private Key. (I saved as pcks12, with a password)
Step 7) Commit, and watch it be like before, where the web login won’t even show an FBA page until you present a certificate first. Which again seems like the firewall doesn’t associate certain certificates with certain users, instead it seems to lock down the FBA page to require ANY certificate (with key?) that is configured or signed by the CA’s specified in the Certificate profile.
Which seems like such a dumb design, it be way better off, that when you check off Certificate based option for a user, you have to pick which cert, then instead of blocking the FBA page as a whole, when that user’s credentials are entered into the FBA page, it then checks/asks for the certificate specified in the one selected in the user creation process.
I seemed to be getting stuck at 400 bad request even with the certificate in my personal store. My only guess is due to the point I mentioned about that I picked web server template when I signed the certificate, which you can see client auth is missing from the useage field:
I didn’t make a snapshot (or you maybe running a physical firewall), how do I fix this? Well… access the console directly (VM use the hypervisor console), or if physical use the console port, or if you configured SSH access, SSH in, and revert the config. I figured “load config last-saved” would have worked, but it didn’t I guess last saved is running config so the command to me feels useless. I could be missing something on that, so instead I had to pick a config from a couple months  ago. The first time around it didn’t state anything about restarting the web mgmt service, but when picking the older config it does:
This must be cause of the Cert Profile binding option in the auth section of the mgmt settings. Further validating my assumptions on the design choice.
Now I was able to log back in to the MGMT web interface, load the config with all my work on it (so I didn’t have to redo all the steps above). Let’s simply recreate the “user” cert but using a client template, and see how that goes…
1) Delete the old cert (check)
2) Create new cert (check)
This time, no additional fields (not even a SAN):
Signed using User template:
Import it into the firewall… (Check) (No clue where that TLSv1.3 cert came from…)
Export it from the firewall… (check)
Import into client machine, user’s personal store.. (check) (Interesting shows assign to the admin account that requested the certificate)
Double check the Mgmt auth settings (check), so only main difference is the client cert now and… error 400… ****
I reverted again, after which I loaded the config above again, but this time changing the cert profile selected on the mgmt auth section to be the self signed one that worked in the orginal posting I made about this stuff, oddly enough after commit on my reg web browser I couldn’t get the web interface to load (400 error) but with incong/in-private window I got the prompt for the admin cert and I got the FBA page.
So for testing one last time to get the Internal PKI cert to work. I decided to make one last change. When I made the certificate I specified the subject name to be that of the account (in this case I had an account on the PA firewall of akamin. I also decided to use the Template I created for making user certs for Global Protect which were templated for client auth. The final results on the PA looked like this:
and exporting, and importing into client machine cert store looked like this:
As you can see this looks much cleaner then all my previous attempts, and shows all assigned to be the user in which we want to login as. The only other change was I created another Certificate Profile, but did not check off any of the Blocked options. Once I committed this change I got a 400 on my regular web browser, but opening an in-private window I got:
Finally! Picking it we can see it auto populated the username:
However don’t be fooled by this, I was easily able to change the name in the field and log in as another user. In this case I changed the name to another local admin, and entered the password of that user and logged in just fine. Further validating that all it’s doing is blocking access to the FBA page to anyone who has Any cert signed by the CA’s listed in the Certificate Profile.
Now I want to figure out the regular browser 400 error problem so I don’t have to open an in-private window each time. Usually this means just cleaning the cache, but when picking what to clear I picked last hour and everything but browsing history, that didn’t work. Reboot did work.
The next task is to see if I could load this certificate onto a YubiKey, and be able to use it’s ability to act as a certificate key store.

Yubikey

Source

First annoying thing this source is missing is that you need the YubiKey MiniDriver installed in order to complete this task.

The next thing that burnt me, was when I went to import the certificate it kept saying my PIN was wrong. Which first lead to my PIN becoming blocked, which lead me to reading all this stuff. Since my PIN was locked after 3 attempts, I nearly locked the PUK as the first two entries were wrong, and lucky me it was set to the default, and I managed to unblock the PIN. I then managed to set a new PUK and PIN. I did this by using ykman. Which was available on Fedoras native repo, so I did this using fedora live.

What I don’t get is, is there a different pin for WebAuthN vs this one for certificates? It seems like it, cause even when the certificate pin was blocked my WebAuthN was still working.

Back to my Windows machine

Plug in Yubikey.. then:
certutil –csp "Microsoft Base Smart Card Crypto Provider" –importpfx C:\Path\to\your.pfx

When prompted, enter the PIN. If you have not set a PIN, the default value is 123456.

This time it worked, yay…

ok now how to test this…., I’ll try to access the mgmt web interface from a random computer one thats not the one that I tested above which has the key already installed in the user’s windows cert store. Mhmmm what do I have… how about my old as Acer Netbook running windows 7 32bit, there’s no way that’ll work… would it…

I try to acess the web console and sure enough 400 bad gateway… plug in the Yubi Key…

There’s no way….

No freaking way…. and try to access the web mgmt…

No way! it actually worked, that’s unbelievable!

enter the pin I configured above (not the WebAuthN Pin)…

crazy… I can’t believe that works… so yeah this is a feasible solution, but it’s still not as good as WebAuthN, which I hope will be supported soon.

Weird… I went to access the web interface from my machine that has the cert in my cert store, but now it seems to want the yubikey even though the cert is in my user store, I tried an in-private window but same problem… do I have to reboot my machine again? Fuck no, that didn’t work either… like WTF.

Tried another browser, Chrome, SAME THING! It’s like when running the command to import a certificate into a YubiKey it overrides the one on the local store and always asks for the YubiKey when picking that certificate. Which doesn’t make any sense…

I grabbed the cert, imported into the user store on another machine, and bam it works as intended… it just seems on the machine in which you import it into a yubi then it always wants the yubikey on that machine, regardless of the certificate being in the users cert store… which still doesn’t make sense…

OK, so I deleted the certificate from my user cert store, re-imported it, open an in-private window and now it accepted the cert without asking for the YubiKey. I still don’t understand what’s going on here…. but that fixed it….

Things I still don’t understand though… if I set this user option to require certificate and the password fields disappear in the PAN Web mgmt interface, then why is it still asking for a password for the user? Why is a certificate required before login if there’s a toggle for certificate based login on the user’s setting? Wouldn’t it make more sense that the Web UI stays available and once you enter your creds then based on the creds entered the PAN OS looks up if that check box is enabled, and then ask for the certificate? And you’d have to configure which certificate in the User settings so that it actually ties a specific user to a specific cert, so you don’t have any cert is good for any admin? So many questions…. so little answers…

Wait a second, I can’t remember this users password, and I can’t login, ah nuts I made a typo in the cert.. FFS man…

What makes it even dumber is it states No auth profile found, but what it really means is that user doesn’t exist. Now instead of mucking around creating a new cert import/export/import and all that jazz, lets create a user akamin check off Cert based which means no password set and lets see what happens…

Oh interesting….

Now that the user was properly defined as the common name when the cert was created you can no longer specify a user account, and it forces the one specified. But if this is the case, how does an admin login who isn’t defined to use certificate based login? While this makes sense on which user is supposed to use which cert without having to defined it in the user’s setting. However, it doesn’t explain the forced certificated requirement before the FBA page, or how admins not configured for certificate based login can even login now.

¯\_(ツ)_/¯

I lost my keys… what do I do?

If you have the default admin account and left it as normal (no cert requirement), you can sign in via SSH or direct console and remove the config from the auth settings:

Configuration
delete deviceconfig system certificate-profile
commit

That should be all to get back to normal weblogin, but you’d still need to have an accounts configured to not have the certificate checkbox on those user’s settings.

It seems like that this can work as long as you leave the default admin account configured for regular auth (username and password).Maybe you can still make it work as long as there’s a lot of certs and redundancy. I haven’t exactly tested that out.

OK so above I simply reverted cause it was the only change I had. This only works in two conditions:

  1. You know exactly when the change was implemented.
  2. You have the latest running config saved.

Thinking about this I think the latest running will always be there so you just have to know when the change was implemented. revert to that, then load the last running, and turn off the cert profile on the mgmt auth settings area.

but what if you don’t know when that was made, well let’s see if we can make the change via the CLI…

So I found the location on where to set it….

set deviceconfig system certificate-profile

I can’t seem to set it to null… I found a similar question here, which only further validated my concern above about other admins who aren’t configured for cert login…

“However at the very beginning of the Web Page I can read:

“Configuring certificate-based authentication for any administrator disables the username/password logins for all administrators on the firewall; administrators thereafter require the certificate to log in.””

Unfortunately, the linked source is dead, but I’m sure it’s still in play. With the thread having no real answer to the question, it seems my only option is the steps I did before… revert to a config before it was implemented, load the old running config, and within the web UI remove the Cert profile, which totally fucking sucks ass…. However, as we discovered, if we configure a cert with a common name that isn’t a user on the PAN, then we can use that to access the FBA page with accounts that are not checked off with the setting in the users’ setting. I wonder if this is something that wasn’t intended and I discovered it simply by chance which kinda shows the poor implementation design here.

I think I covered everything I can about this topic here… Now since this account I created was a superuser (read only) and now that the user exists… I’ll revert…. or wait….

maybe I can delete that user, and then go back to just needing the cert and I can sign in with another account to fix this… let’s try that LOL.

Finally direct guidance! Woo

and….

and….

No matter what browser or machine I try to connect to it, it just error 400.

This… shit… sucks.

Can you make this idea work… yes.

Can you fix this if you lose all your keys, not easily, you’d have to know the exact commit the change was made, and if there were other changes made after that, they temp not be applied during the recovery period.

Facepalm…. I don’t know why I didn’t think of this sooner… you don’t use set, you use delete in the cli to set it to none.

Upgrading Windows 10 2016 LTSB to 2019 LTSC

*Note 1* – This retains the Channel type.
*Note 2* – Requires a new Key.
*Note 3* – You can go from LTSB to SA, keeping files if you specify new key.
*Note 4* – LTSC versions.
*Note 5* – Access to ISO’s. This is hard and most places state to use the MS download tool. I however, managed to get the image and key thanks to having a MSDN aka Visual Studio subscription.

I attempted to grab the 2021 Eval copy and ran the setup exe. When it got to the point of wanting to keep existing file (aka upgrading) it would grey them all out… 🙁

So I said no to that, and grabbed the 2019 copy which when running the setup exe directly asks for the key before moving on in the install wizard… which seems to let me keep existing files (upgrade) 🙂

My enjoyment was short lived when I was presented with a nice window update failed window.

Classic. So the usual, “sfc /scannow”

Classic. So fix it, “dism /online/ cleanup-image /restorehealth”

Stop, Disable Update service, then clear cache:

Scan system files again, “sfc /scannow”

reboot make sure system still boots fine, check, do another sfc /scannow, returns 100% clean. Run Windows update (after enabling the service) comes back saying 100% up to date. Run installer….

For… Fuck… Sakes… what logs are there for this dumb shit? Log files created when you upgrade Windows 11/10 to a newer version (thewindowsclub.com)

setuperr.log Same as setupact.log Data about setup errors during the installation. Review all errors encountered during the installation phase.

Coool… where is this dumb shit?

Log files created when an upgrade fails during installation before the computer restarts for the second time.

  • C:\$Windows.~BT\Sources\panther\setupact.log
  • C:\$Windows.~BT\Sources\panther\miglog.xml
  • C:\Windows\setupapi.log
  • [Windows 10:] C:\Windows\Logs\MoSetup\BlueBox.log

OK checking the log…..

Lucky me, something exists as documented, count my graces, what this file got for me?

PC Load letter? WTF does that mean?!  While it’s not listed in this image it must have been resolved but I had a line that stated “required profile hive does not exist” in which I managed to find this MS thread of the same problem, and thankfully someone in the community came back with an answer, which was to create a new local temp account, and remove all old profiles and accounts on the system (this might be hard for some, it was not an issue for me), sadly I still got, Windows 10 install failed.

For some reason the next one that seems to stick out like a sore thumb for me is “PidGenX function failed on this product key”. Which lead me to this thread all the way back from 2015.

While there’s a useless comment by “SaktinathMukherjee”, don’t be this dink saying they downloaded some third party software to fix their problem, gross negligent bullshit. The real hero is a comment by a guy named “Nathan Earnest” – “I had this same problem for a couple weeks. Background: I had a brand new Dell Optiplex 9020M running Windows 8.1 Pro. We unboxed it and connected it to the domain. I received the same errors above when attempting to do the Windows 10 upgrade. I spent about two weeks parsing through the setup error logs seeing the same errors as you. I started searching for each error (0x800xxxxxx) + Windows 8.1. Eventually I found one suggesting that there is a problem that occurs during the update from Windows 8 to Windows 8.1 in domain-connected machines. It doesn’t appear to cause any issues in Windows 8.1, but when you try to upgrade to Windows 10… “something happened.”

In my case, the solution: Remove the Windows 8.1 machine from the domain, retry the Windows 10 upgrade, and it just worked. Afterwards, re-join the machine to the domain and go about your business.

Totally **** dumb… but it worked. I hope it helps someone else.”

Again, I’m free to try stuff, so since I was testing I cloned the machine and left it disconnected from the network, then under computer properties changed from domain to workgroup (which means it doesn’t remove the computer object from AD, it just removes itself from being part of a domain). After this I ran another sfc /scannow just to make sure no issue happened from the VM cloning, with 100% green I ran the installer yet again, and guess what… Nathan was right. The update finally succeeded, I can now choose to rename the PC and rejoin the domain, or whatever, but the software on the machine shouldn’t need to be re-installed.

Another fun dumb day in paradise, I hope this blog post ends up helping someone.

 

Move Linux Swap and Extend OS File System

Story

So, you go to run updates, in this case some Linux servers. So, you dust off your old dusty fingers and type the blissful phrase, “apt update” followed by the holier than thou “apt upgrade”….

You watch as the test scrolls past the screen in beautiful green text console style, as you whisper, “all I see is blonde, and brunette”, having seen the same text so many times you glaze over them following up with “ignorance is bliss”.

Your sweet dreams of living in the Matrix come to a halt as instead of success you see the dreaded red text on the screen and realize the Matric has no red text. Shucks this is reality, and the update has just failed.

Reality can be a cruel place, and it can also be unforgiving, in this case the application that failed to update is not the problem (I mean you could associate blame here if the dev’s and maintainers didn’t do any due diligence on efficiencies, but I digress), the problem was simply, the problem as old as computers themselves “Not enough storage space”.

Now, you might be wondering at this point… what does this have to do with Linux Swap?!?!? Like any good ol’ storyteller, I’m gettin’ to that part. Now where was I… oh yes, that pesty no space issue. Now normally this would be a very simple endeavor, either:

A) Go clear up needless crap.
Trust me I tried, ran apt autoclean, and apt clean. Looked through the File System, nothing was left to remove.

B) Add more storage.
This is the easiest route, if virtualized simply expand the VM’s HDD on the host that’s serving it, or if physical DD the contents to a drive of similar bus but with higher tier storage.

Lucky for me the server was virtual, now comes the kicker, even after expanding the hard drive, the Linux machine was configured to have a partition-based Swap. In both situations, virtual and physical, this will have to be dealt with in order to expand the file system the Linux OS is utilizing.

Swap: What is it?

Swap is space on a hard drive reserved for putting memory temporary while another request for memory is being made and there is no more actual RAM (Random Access Memory) available for it to be placed for use. The system simply takes lower access memory and just kind “shows it under the rug” to be cleaned or used later.

If you were running a system with massive amounts of memory, you could, in theory, run without swap, just remove it and life’s good. However, in lots of cases memory is a scarce commodity vs something like hard drive storage, the difference is merely speed.

Anyway, in this case I attempted to remove swap entire (steps will be provided shortly), however this system was no different in terms of just being provisioned enough where several MB of RAM was actually being placed into swap, as such when I removed the swap, and all the services began to spin up the VM became unusable, as running commands would return unable to associate memory. So instead, the swap was simply changed from a partition to a file-based swap.

Step 1) Stop Services

This step may or may not be required, it depends on your systems current resource allocations, if you’re in the same boat as I was in that commands won’t run as the system is at max memory usage; then this is needed to ensure the system doesn’t become unusable during the transition, as it will require to disable swap for a short time.

The commands to stop services will depend on both the Linux distro used and the service being managed. This is beyond the scope of this post.

Step 2) Verify Swap

Run the command:

swapon -s

This is an old Linux machine I plan on decommissioning, but as we can see here, a shining example of a partition-based swap, and the partition it’s assigned to. /dev/sda3. We can also see some of the swap is actually used. During my testing I found Linux wouldn’t disable swap if it is unable to allocate physical memory for its content, which makes sense.

Step 3) Create Swap File

Create the Swap file before disabling the current partition swap or apparently the dd command will fail due to memory buffers.

dd if=/dev/zero of=/swapfile count=1 bs=1GiB

This also depends on the size of your old swap, change the command accordingly based on the size of the partition you plan to remove. In my case roughly a Gig.

chmod -v 0600 /swapfile
chown -v root:root /swapfile
mkswap /swapfile

Step 4) Disable Swap

Now it’s time for us to disable swap so we can convert it to a file-based version. If it states it can’t move the data to memory cause memory is full, revert to step 1 which was to stop services to make room in memory. If this can’t be done due to service requirements, then you’d have to schedule a Maintenace window, since without enough memory on the host service interruption is inevitable… Mr.Anderson.

swapoff /dev/sda3

easy peasy.

Step 5) Enable Swap File

swapon /swapfile

Step 6) Edit fstab

Now looks like we done, but don’t forget this is handled by fstab after reboot, just ask me how I know….  yeah, I found out the hard way… let’s check the existing fstab file…

cat /etc/fstab

Step 7) Reboot and Verify Services

Wait both mounted as swap… what??!?

To fix this, I removed the partition, updated kernel usage, and initram, then reboot:

fdisk /dev/sda
d
3
w
partprobe
update-initramfs -u

Rebooted and swapon showed just the file swap being used. Which means the deleted partition is no longer in the way of the sectors to allow for a full proper expansion of the OS file system. Not sure what was with the error… didn’t seem to affect anything in terms of the services being offered by the server.

Step 8) Extending the OS File System

If you’ve ever done this on Windows, you’ll know how easy it is with Disk Manager. On linux it’s a bit… interesting… you delete the partition to create it again, but it doesn’t delete the data, which we all come to expect in the Windows realm.

fdisk /dev/sda
d
[enter]
n
[enter]
[enter]
[enter]
w
partprobe
resize2fs /dev/sda2

The above simply delete’s the second partition, then recreates it using all available sectors on the disk. Then final commands allows the file system to use all the available sectors, as extended by fdisk.

Summary

Have fun doing whatever you need to do with all the new extra space you have.  Is there any performance impact from doing this? Again, if you have a system with adequate memory, the swap should never be used. If you want to go down that rabbit hole.. here.. Swap File vs Swap Partition : r/linux4noobs (reddit.com) have fun. Could I have removed the swap partition, created it at the end of the new extended hard drive…. yes… I could have but that would have required calculating the sectors, and extending the new file system to the sector that would be the start of the new swap partition, and I much rather press enter a bunch of times and have the computer do it all for me, I can also extend a file easier than a partition, so read the reddit thread… and pick your own poisons…

Upgrading Windows Server 2016 Core AD to 2022

Goal

Upgrade a Windows Server 2016 Core that’s running AD to Server 2022.

What actually happened

Normally if the goal is to stay core to core, this should be as easy as an in-place upgrade. When I attempted this myself this first issue was it would get all the way to end of the wizard then error out telling me to look at some bazar path I wasn’t familiar with (C:\$windows.~bt\sources\panther\ScanResults.xml). Why? Why can’t the error just be displayed on the screen? Why can’t it be coded for in the dependency checks? Ugh, anyway, since it was core I had to attach a USB stick to my machine, pass it through to the VM, save the file open it up, and nested deep in there, it basically stated “Active Directory on this domain controller does not contain Windows Server 2022 ADPREP /FORESTPREP updates.” Seriously, ok, apparently requires schema updates before upgrading, since it’s an AD server.

Get-ADObject (Get-ADRootDSE).schemaNamingContext -Property objectVersion
d:\support\adprep\adprep.exe /forestprep
d:\support\adprep\adprep.exe /domainprep

Even after all that, the install wizard got past the error, but then after rebooting, and getting to around 30% of the install, it would reboot again and say reverting the install, and it would boot back into Server 2016 core.

Note, you can’t change versions during upgrade (Standard vs Datacenter) or (Core vs Desktop). For all limitation see this MS page. The “Keep existing files and apps” was greyed out and not selectable if I picked Desktop Experience. I had this same issue when I was attempting to upgrade a desktop server and I was entering a License Key for Standard not realizing the server had a Datacenter based key installed.

New Plan

I didn’t look at any logs since I wasn’t willing to track them down at this point to figure out what went wrong. Since I also wanted to go Desktop Experience I had to come up with any alternative route.

Seem my only option is going to be:

  1. Install a clean copy of Server 2016 Desktop, Update completely). (Run sysprep, clone for later)
  2. Add it as a domain controller in my domain.
  3. Migrate the FSMO roles. (If I wanted a clustered AD, I could be done but that wouldn’t allow me to upgrade the original AD server that’s failing to upgrade)
  4. Decommission the old Server 2016 Core AD server.
  5. Install a clean copy of Server 2016 Desktop, Update completely). (The cloned copy, should be OOBE stage)
  6. Add to Domain.
  7. Upgrade to 2022.
  8. Migrate FSMO roles again. (Done if cluster of two AD servers is wanted).
  9. Decommission other AD servers to go back to single AD system.

Clean Install

Using a Windows Server 2016 ISO image, and a newly spun up VM, The install went rather quick taking only 15 minutes to complete.

Check for updates. KB5023788 and KB4103720. This is my biggest pet peve, Windows updates.

RANT – The Server 2016 Update Race

As someone who’s a resource hall monitor, I like to see what a machine is doing and I use a variety of tools and methods to do so, including Resource Monitor, Task Manager (for Windows), Htop (linux) and all the graphs available under the Monitor tab of vSphere. What I find is always the same, one would suspect high Disk, and high network (receive) when downloading updates (I see this when installing the bare OS, and the disk usage and throughput is amazing, with low latency, which is why the install only took 15 minutes).

Yet when I click check for updates, it’s always the same, a tiny bit of bandwidth usage, low disk usage, and just endless high CPU usage. I see this ALL THE TIME. Another thing I see is once it’s done and reboot you think the install is done, but no the windows update service will kick off and continue to process “whatever” in the background for at least another half hour.

Why is Windows updates such Dog Shit?!?! Like yay we got monthly Cumulative updates, so at least one doesn’t need to install a rolling ton of updates like we did with the Windows 7 era. But still the lack of proper reporting, insight on proper resource utilization and reliance on “BITS”… Just Fuck off wuauclt….

Ughhh, as I was getting snippets ready to show this, and I wanted to get the final snip of it still showing to be stuck at 4%, it stated something went wrong with the update, so I rebooted the machine and will try again. *Starting to get annoyed here*.

*Breathe* Ok, go grab the latest ISO available for Window Server 2016 (Updates Feb 2018), So I’m guessing has KB4103720 already baked in, but then I check the System resources and its different.

But as I’m writing this it seems the same thing is happening, updates stalling at 5%, and CPU usage stays at 50%, Disk I/O drops to next to nothing.

*Breaks* Man Fuck this! An announcer is born! Fuck it, we’ll do it live!

I’ll let this run, and install another VM with the latest ISO I just downloaded, and let’s have a race, see if I can install it and update it faster then this VM…. When New VM finished installing, let a couple config settings. Check for updates:

Check for updates. KB5023788 and KB4103723. Seriously?

Install, wow, the Downloading updates is going much quicker.  Well, the download did, click install sticking @ 0% and the other VM is finishing installing KB4103720. I wonder if it needs to install KB4103723 as well, if so then the new VM is technically already ahead… man this race is intense.

I can’t believe it, the second server I gave more memory to, was the latest available image from Microsoft, and it does the exact same thing as the first one.. get stuck 5%.. CPU usage 50% for almost an hour.. and error.

lol No fucking way… reboot check for updates, and:

at the same time on the first VM that has been checking for updates forever which said it completed the first round of updates…

This is unreal…

Shit pea one, and shit pea 2, both burning up the storage backend in 2 different ways…. for the same update:

Turd one really rips the disk:

Turd two does a bit too, but more just reads:

I was going to say both turds are still at 0% but Turd one like it did before spontaneously burst back in “Checking for update” while the second one seem it moved up to 5%… mhmm feel like I’ve been down this road before.

Damn this sucks, just update already FFS, stupid Windows. *Announcer* “Get your bets here!, Put in your bets here!” Mhmmm I know turd one did the same thing as turd 2, but it did complete one round of updates, and shows a higher version then turd 2, even though turd 2 was the latest downloadable ISO from Microsoft.

I’m gonna put my bets on Turd 1….

Current state:

Turd 1: “Checking for Updates”… Changed to Downloading updates 5%.. shows signs of some Disk I/O. Heavy CPU usage.

Turd 2: “Preparing Updates 5%” … 50% CPU usage… lil to no Disc I/O.

We are starting to see a lot more action from Turd 1, this race is getting real intense now folks. Indeed, just noticing that Turd one is actually preparing a new set of updates, now past the peasant KB4103720. While Turd 2 shows no signs of changing as it sits holding on to that 5%.

Ohhhh!!! Turd one hits 24% while Turd 2 hit the same error hit the first time, is it stuck in a failed loop? Let’s just retry this time without a reboot.. and go..! Back on to KB4103720 preparing @ 0%. Not looking good for Turd 2. Turd 1 has hit 90% on the new update download.

and comming back from the break Turd one is expecting a reboot while Turd 2 hits the same error, again! Stop Windows service, clear softwaredistrobution folder. Start update service, check for updates, tried fails, reboot, retry:

racing past the download stage… Download complete… preparing to install updates… oh boy… While Turd one is stuck at a blue screen “Getting Windows Ready” The race between these too can’t get any hotter.

Turd one is now at 5989 from 2273. While Turd 2 stays stuck on 1884. Turd 2 managed to get up to 2273, but I wasn’t willing to watch the hours it takes to get to the next jump. Turd 1 wins.

Checking these build numbers looks like Turd 1 won the update race. I’m not interested in what it takes to get Turd 2 going. Over 4 hours just to get a system fully patched. What a Pain in the ass. I’m going to make a backup, then clear the current snap shot, then create a new snapshot, then sysprep the machine so I can have a clean OOBE based image for cloning, which can be done in minutes instead of hours.

END RANT

Step 2) Add as Domain Controller.

Wow amazing no issues.

Step 3) Move FSMO Roles

Transfer PDCEmulator

Move-ADDirectoryServerOperationMasterRole -Identity "ADD" PDCEmulator

Transfer RIDMaster

Move-ADDirectoryServerOperationMasterRole -Identity "ADD" RIDMaster

Transfer InfrastrctureMaster

Move-ADDirectoryServerOperationMasterRole -Identity "ADD" Infrastructuremaster

Transfer DomainNamingMaster

Move-ADDirectoryServerOperationMasterRole -Identity "ADD" DomainNamingmaster

Transfer SchemaMaster

Move-ADDirectoryServerOperationMasterRole -Identity "ADD" SchemaMaster

Step 4) Demote Old DC

Since it was a Core server, I had to use Server Manager from the remote client machine (Windows 10) via Server Manager. Again no Problem.

As the final part said it became a member server. So not only did I delete under Sites n Services, I deleted under ADUC as well.

Step 5) Create new server.

I recovered the system above, changed hostname, sysprepped.

This took literally 5 minutes, vs the 4 hours to create from scratch.

Step 6) Add as Domain Controller.

Wow amazing no issues.

Step 7) Upgrade to 2022.

Since we got 2 AD servers now, and all my servers are pointing to the other one, let’s see if we can update the Original AD server that is now on Server 2016 from the old Core.

Ensure Schema is upgraded first:

d:\support\adprep\adprep.exe /forestprep

d:\support\adprep\adprep.exe /domainprep

run setup!

It took over an hour, but it succeeded…

Summary

If I had an already updated system, that was already on Desktop Experience this might have been faster, I’m not sure again why the in-place update did work for the server core, here’s how you can upgrade it Desktop Experience and then up to 2022. It does unfortunately require a brand new install, with service migrations.

Veeam Backup Encryption

Story

So, a couple posts back I blogged about getting a NTFS USB drives shared to a Windows VM via SMB to store backups onto, so that the drive could easily plugged into a Windows machine with Veeam on it to recover the VMs if needed. However, you don’t want to make it this easy if it were to be stolen, what’s the solution, encryption… and remembering passwords. Woooooo.

Veeam’s Solution; Encryption

Source: Backup Job Encryption – User Guide for VMware vSphere (veeam.com)

I find it strange in their picture they are still using Windows Server 2012, weird.

Anyway, so I find my Backup Copy job and sure enough find the option:

Mhmmm, so the current data won’t be converted I take it then…

Here’s the backup files before:

and after:

As you can see the old files are completely untouched and a new full backup file is created when an Active full is run. You know what that means…

Not Retroactive

“If you enable encryption for an existing job, except the backup copy job, during the next job session Veeam Backup & Replication will automatically create a full backup file. The created full backup file and subsequent incremental backup files in the backup chain will be encrypted with the specified password.

Encryption is not retroactive. If you enable encryption for an existing job, Veeam Backup & Replication does not encrypt the previous backup chain created with this job. If you want to start a new chain so that the unencrypted previous chain can be separated from the encrypted new chain, follow this Veeam KB article.”

What the **** does that even mean…. to start I prefer not to have a new chain but since an Active full was required there’s a start of a new chain, so… so much for that. Second… Why would I want to separate the unencrypted chain from the new encrypted chain? wouldn’t it be nice to have those same points still exist and be selectable but just be encrypted? Whatever… let’s read the KB to see if maybe we can get some context to that odd sentence. It’s literally talking about disassociating the old backup files with that particular backup job. Now with such misdirected answers it would seem it straight up is not possible to encrypt old backup chains.

Well, that’s a bummer….

Even changing the password is not possible, while they state it is, it too is not retroactive as you can see by this snippet of the KB shared. Which is also mentioned in this Veeam thread where it’s being asked.

So, if your password is compromised, but the backup files have not you can’t change the password and keep your old backup restore points without going through a nightmare procedure or resorting all points and backing them up somehow?

Also, be cautious checking off this option as it encrypts the metadata file and can prevent import of not encrypted backups.”You can enter password and read data from it, but you cannot “remove the lock” retroactively”

Reason why Veeam asks for passwords even on non-encrypted chains, is because backupdata metadata(holding information about all restore points in the chain, including encrypted and non encrypted ones) is encrypted too!”

“Metadata will be un-encrypted when last encrypted restore point it describes will be gone by retention.”

Huh, that’s good to know… this lack of retroactive ability is starting to really suck ass here. Like I get the limitations that there’d be high I/O switching between them, but if BitLocker for windows can do it for a whole O/S drive LIVE, non-the-less, why can’t Veeam do it for backup sets?

Summary

  • Veeam Supports Encryption
    • Easy, Checkbox on Backup Job
    • Uses Passwords
    • Non Retroactive

I’ll start off by saying it’s nice that it’s supported, to some extent. What would be nice is:

  1. Openness of what Encryption algos are being used.
  2. Retroactive encryption/decryption on backup sets.
  3. Support for Certificates instead of passwords.

I hope this review helps someone. Cheers.

No coredump target has been configured. Host core dumps cannot be saved.

ESXi on SD Card

Ohhh ESXi on SD cards, it got a little controversial but we managed to keep you, doing the latest install I was greet with the nice warning “No coredump target has been configured. Host core dumps cannot be saved.

What does this mean you might ask. Well in short, if there ever was a problem with the host, log files to determine what happened wouldn’t be available. So it’s a pick your poison kinda deal.

Store logs and possibly burn out the SD/USB drive storage, which isn’t good at that sort of thing, or point it somewhere else. Here’s a nice post covering the same problem and the comments are interesting.

Dan states “Interesting solution as I too faced this issue. I didn’t know that saving coredump files to an iSCSI disk is not supported. Can you please provide your source for this information. I didn’t want to send that many writes to an SD card as they have a limited number (all be it a very large number) of read/writes before failure. I set the advanced system setting, Syslog.global.logDir to point to an iSCSI mounted volume. This solution has been working for me for going on 6 years now. Thanks for the article.”

with the OP responding “Hi Dan, you can definately point it to an iscsi target however it is not supported. Please check this KB article: https://kb.vmware.com/s/article/2004299 a quarter of the way down you will see ‘Note: Configuring a remote device using the ESXi host software iSCSI initiator is not supported.’”

Options

Option 1 – Allow Core Dumps on USB

Much like the source I mentioned above: VMware ESXi 7 No Coredump Target Has Been Configured. (sysadmintutorials.com)

Edit the boot options to allow Core Dumps to be saved on USB/SD devices.

Option 2 – Set Syslog.global.logDir

You may have some other local storage available, in that case set the variable above to that local or shared storage (shared storge being “unsupported”).

Option 3 – Configure Network Coredump

As mentioned by Thor – “Apparently the “supported” method is to configure a network coredump target instead rather than the unsupported iSCSI/NFS method: https://kb.vmware.com/s/article/74537

Option 4 – Disable the notification.

As stated by Clay – ”

The environment that does not have Core Dump Configured will receive an Alarm as “Configuration Issues :- No Coredump Target has been Configured Host Core Dumps Cannot be Saved Error”.
In the scenarios where the Core Dump partition is not configured and is not needed in the specific environment, you can suppress the Informational Alarm message, following the below steps,

Select the ESXi Host >

Click Configuration > Advanced Settings

Search for UserVars.SuppressCoredumpWarning

Then locate the string and and enter 1 as the value

The changes takes effect immediately and will suppress the alarm message.

To extract contents from the VMKcore diagnostic partition after a purple screen error, see Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen (1004128).”

Summary

In my case it’s a home lab, I wasn’t too concerned so I followed Option 4, then simply disabled file core dumps following the second steps in Permanently disable ESXi coredump file (vmware.com)

Note* Option 2 was still required to get rid of another message: System logs are stored on non-persistent storage (2032823) (vmware.com)

Not sure, but maybe still helps with I/O to disable coredumps. Will update again if new news arises.

Word document always opening with properties sidebar

SharePoint Story

The Problem

User: “I get this properties panel opening every time I open a word doc from this SharePoint site.”

The Burdensome Solution

Me Google, find Fix:

Here’re steps:

  1. Go to Library Settings -> Advanced Settings -> Set “Allow management of Content Types?” as “Yes”.
  2. Go to Library Settings -> Click the Document content type under content types section -> Document Information panel Settings -> Uncheck the box “Always Show the Document Information Panel”.

I think we can do better than that

Which based on the answer requires 2 steps (enabling editing of content types), then flipping the switch. Which as you may have guessed does not scale well, and would be really time consuming against hundreds of lists. If you know front ends they always rely on some backend, so how’s the backend doing it? How to fix via backend

This didn’t work, Why? Cause the first linked site which is the real answer is doing it per list’s document content type, where the answer above it doing it just at the site level. The difference is noted at the beginning of this long TechNet post.

What it did tell me is how the SchemaXml property is edited, which seems to be by editing the XmlDocuments array property.

So with these three references in mind, we should now be to actually fix the problem via the backend.

The Superuser Solution

First we need to build some variables:

$plainSchema = "http://schemas.microsoft.com/office/2006/metadata/customXsn"

While this variable may not change (in this case its for SharePoint 2016), How this this derived? How? From this:

((((Get-spweb http://spsite.consoto.com).Lists) | ?{($_.ContentTypes["Document"]).SchemaXml -match "<openByDefault>True"})[0].ContentTypes["Document"]).XmlDocuments[1]

This takes a SharePoint Web Site, goes through all it’s lists, but only grab the lists which have a content type of Document associated with them,
all of these objects will have a property “SchemaXml” now only grab the ones that have the Schema property of “openByDefault” that are set to true,
from this list of objects only grab the first one”[0]”, grab it’s Document Content Type Object, and spit out the second XmlDocument “.XmlDocuments[1]”.
From this String XML output we want the xml property of “customXsn”:

<customXsn xmlns="http://schemas.microsoft.com/office/2006/metadata/customXsn"><xsnLocation></xsnLocation><cached>True</cached><openByDefault>True</openByDefault><xsnScope></xsnScope></customXsn>

Why? For some reason the Content Type’s SchemaXml property can not be directly edited.

Why? Unsure, but it is this field property that gets changed when doing the fix via the front end.

$goodSchema = "<customXsn xmlns="http://schemas.microsoft.com/office/2006/metadata/customXsn"><xsnLocation></xsnLocation><cached>True</cached><openByDefault>False</openByDefault><xsnScope></xsnScope></customXsn>"

This can also be derived by using a Replace operation, flipping true to false.

After this is done we need to build an object (type of array) that will hold all lists that are affected (have openByDefault set to true):

$problemDroids = ((Get-spweb http://spsite.consoto.com).Lists) | ?{($_.ContentTypes["Document"]).SchemaXml -match "<openByDefault>True"}
$problemDroids | %{($_.ContentTypes["Document"]).XmlDocuments.Delete($plainSchema)}
$problemDroids | %{($_.ContentTypes["Document"]).XmlDocuments.Add($GoodSchema)}
$problemDroids | %{($_.ContentTypes["Document"]).Update()}

Not good Enough

User: “It’s not working on all sites”

Solution: The above code will go through all document libraries affected at the root site, if you have subsites you simply have to add .Webs[0].Webs to the initial call for creating the “problemDroids” variable. The level of how deep you need to go depends on how many subsites your SharePoint implementation has.

$problemDroids = ((Get-spweb http://spsite.consoto.com).Webs[0].webs.Lists) | ?{($_.ContentTypes["Document"]).SchemaXml -match "<openByDefault>True"}

Summary

Something that should have been a boolean type property on the object was really a boolean nested in XML, which was of a string type.

Standing Ovation. Fun had by all parties involved. Tune in next week when I post more SharePoint content.

Xbox One No Video Output
Replace Xbox One HDD

Expectation: Existing Slot, Easy pull out, Plug in new HDD, have USB stick with offline installer that you plug into unit and power on, and done.

Reality:

First off, Existing Slot, Easy Pull out. hahahahah. Try complicated clipped casing, and a caddy for a caddy with 11+ screws for just mounting the HDD to the chassis. If you need a video on that process you can watch this one by Joe:

Microsoft Xbox One S Hard Drive HDD Replacement | Repair Tutorial – YouTube

Then expectation that OS install will partition and format disk… no, you have to preformat it, does MS give you a tool, no the community had to do it:

Xbox One Windows and Linux Internal Hard Drive Partitioning Script | GBAtemp.net – The Independent Video Game Community

Then, you need a 8GB USB stick formatted to NTFS, to copy the Offline OS Installer on to.

Perform an offline system update | Xbox Support

For the best compresneive step by step watch this video by XFiX:

Xbox One Internal Hard Drive Repair or Replace Using Windows Series 7 – YouTube

Unfortunetly for me the system I was working on had no video output after booting, and no matter what I did, including installing a new HDD, I couldn’t get video to work.

If you have any thoughts or suggestions on how to fix a no video display issue (I already did the eject and power on hold for 10 seconds to default video output, didn’t work), please leave a comment. 🙂

*Update* the Video problem was related to the HDD.

I tried a couple more times and had the following results.

Using the old hdd would seem good enough to boot but fail on all update attempts, and would end up in 200 or 106 error state. If I got a boot and into the maintence window, if I hot swapped the HDD and did a offline update, I’d get a 101 error, if rebooted a 102 or 106 error.

I didn’t have any good 500 of bigger 2.5″ hdd around, only smaller ones, so I ended up finding this video using smaller drives by XFiX I ended upfollowing along with the video and when the step to copy the data came up the process came to a halt, on the SYSTEM UPDATE partition none the  less, since I knew it complete up to this point, I hard killed the script and it hung the linux machine. After a reboot, I completed the last part define “stage 3” defining the GUIDs.

I then poped the HDD into the Xbox and it actually showed the maintence screen almost instantly, then doing then offline update actually succeeded without issue.

After a reboot, the box was fixed and fully working!

Manually Fix Veeam Backup Job after VM-ID change

The Story

There’s been a couple time where my VM-IS’s change:

  • A vSphere server has crashed beyond a recoverable state.
  • A server has been removed and added back into the inventory in vSphere.
  • Manually move a VM to a new ESXi host.
    • VM removed from inventory, and readded.
  • Loss vCenter Server.
  • Full VM Recovery via Veeam.

What sucks is when you go to run the Job in Veeam after any of the above, the job simply fails to find the object. You can edit the job by removing the VM and re-adding it, but this will build a whole new chain, which you can see in the repo of Veeam after such events occur:

As you can see two chains, this has been an annoyance for a long time for me, as there’s no way to manually set the VM-ID in vCenter, it’s all automanaged.

I found this Veeam thread discussing the same issue, and someone mentioned “an old trick” which may apply, and linked to a blog post by someone named “Ideen Jahanshahi”.

I had no idea about this, let’s try…

Determine VM-ID on vCenter

The source uses powerCLI, which I’ve covered installing, but easier is to just use the Web UI, and in the address bar grab it after the vms parameter.

Determine VM-ID in Veeam

The source installs SSMS, and much like my fixing WSUS post, I don’t like installing heavy stuff on my servers to do managerial tasks. Lucky for me, SQLCMD is already installed on the Veeam server so no extra software needed.

Pre-reqs for SQLCMD

You’ll need the hostname. (run command hostname).

You’ll need the Instance name. (Use services.msc to list SQL services)

Connect to Veeam DB

Open CMD as admin

sqlcmd -E -S Veeam\VEEAMSQL2012

use VeeamBackup
:setvar SQLCMDMAXVARTYPEWIDTH 30
:setvar SQLCMDMAXFIXEDTYPEWIDTH 30
SELECT bj.name, bo.object_id FROM bjob bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN Bobjects bo ON bo.id = oij.object_id WHERE bj.type=0
go

Some reason above code wouldn’t work on my latest build/install of Veeam, but this one worked:

SELECT name, job_id, bo.object_id FROM bjobs bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN BObjects bo ON bo.id = oij.object_id WHERE bj.type=0

In my case after remove the VM from inventory and readding it:

As you can see they do not match, and when I check the VM size in the job properties the size can’t be calculated cause the link is gone.

Fix the Broken Job

UPDATE bobjects SET object_id = 'vm-55633' WHERE object_id='vm-53657'

After this I checked the VM size in the job properties and it was calculated, to my amazement it fully worked it even retained the CBT points, and the backup job ran perfectly. Woo-hoo!

This info is for educational purposes only, what you do in your own environment is on you. Cheers, hope this helps someone.

vCLS High CPU usage

The Story

So I went to vMotion a VM to do some maintenance work on a host. Target machine well over 50% CPU usage.. what?! That can’t be right, it’s not running anything…

I tried hard powering the VM off, but it just came right back up suckin CPU cycles with it….

The Hunt

alright Google, what ya got for me… I found this blog post by “Tripp W Black” he mentions stopping a vCenter Service called “VMware ESX Agent Manager”, which he stops and then deletes the offending VMs, sounds like a plan. Let’s try it, so login into VAMI. (vcenter.consonto.com:5480)

K, let’s stop it… let me hard power off the VM now… ehh the VM is staying dead and host CPU:

K let’s go kill the other droid I have causing an issue…

ok I got them all down now, but the odd part is I can’t delete them from disk much like Sir Black mentioned in their blog post. The options is greyed out for me, let’s start the service and see what happens…

The Pain

Well, that was extremely annoying, it seemed to have worked only for a moment and the CPU usages came right back, so I stopped the service again, but I can’t delete the VMs…

Similar issues in vSphere 8, even suggestions to stay running in retreat mode, which I’ll get to in a moment. So, if you are unfamiliar, vCLS are small VMs that are distributed to ESXi hosts to keep HA and DRS features operational, even if vCenter itself goes down. The thing is, I’m not even using HA or DRS, I created a cluster for merely EVC purposes, so I can move VMs between hosts live at my own leisure and without downtime. What’s annoying is I shouldn’t have to spend half my weekend day trying to solve a bug in my HomeLab due to poor design choices.

The Constructive Criticism

VMware…. do not assume a cluster alone requires vCLS. Instead, enable vCLS only when HA or DRS features are enabled.

Now that we have that very simple thing out of the way.

The Fix

So, as we mentioned we are able to stop the vCLS VMs when we stop the EAM service on vCenter, but that won’t be a solution if the server gets rebooted. I decided to Google to see how other people delete vCLS when it doesn’t seem possible.

I found this reddit thread, in which they discuss the same thing mentioned above “Retreat Mode”. However, after setting the required settings (which is apparently tattoo’d after done), I still couldn’t delete the VMs, even after restarting the vpxd service. Much like ‘bananna_roboto’ I ended up deleting the vCLS VMs from the ESXi host UI directly, however when checking vCenter UI the still showed on all the hosts.

After rebooting the vCenter server, all the vCLS VMs were gone, at first, I thought they’d come back, but since the retreat mode setting was applied it seems they do not get recreated. Hence, I will leave Retreat mode enabled as suggested in the reddit thread for now, since I am not using HA or DRS.

So if you want to use EVC in a cluster, but not HA and DRS and would like to skim even more memory from your hosts, while saving on buggy CPU cycles, apparently “Retreat mode” is what you need.

If you do need those features, and you are unable to delete the old vCLS VMs, and restarting the EAM service doesn’t resolve your issue (which it didn’t for me), you may have to open a support case with VMware.

Any, I hope this helped someone. Cheers.