Manually Fix Veeam Backup Job after VM-ID change

The Story

There’s been a couple time where my VM-IS’s change:

  • A vSphere server has crashed beyond a recoverable state.
  • A server has been removed and added back into the inventory in vSphere.
  • Manually move a VM to a new ESXi host.
    • VM removed from inventory, and readded.
  • Loss vCenter Server.
  • Full VM Recovery via Veeam.

What sucks is when you go to run the Job in Veeam after any of the above, the job simply fails to find the object. You can edit the job by removing the VM and re-adding it, but this will build a whole new chain, which you can see in the repo of Veeam after such events occur:

As you can see two chains, this has been an annoyance for a long time for me, as there’s no way to manually set the VM-ID in vCenter, it’s all automanaged.

I found this Veeam thread discussing the same issue, and someone mentioned “an old trick” which may apply, and linked to a blog post by someone named “Ideen Jahanshahi”.

I had no idea about this, let’s try…

Determine VM-ID on vCenter

The source uses powerCLI, which I’ve covered installing, but easier is to just use the Web UI, and in the address bar grab it after the vms parameter.

Determine VM-ID in Veeam

The source installs SSMS, and much like my fixing WSUS post, I don’t like installing heavy stuff on my servers to do managerial tasks. Lucky for me, SQLCMD is already installed on the Veeam server so no extra software needed.

Pre-reqs for SQLCMD

You’ll need the hostname. (run command hostname).

You’ll need the Instance name. (Use services.msc to list SQL services)

Connect to Veeam DB

Open CMD as admin

sqlcmd -E -S Veeam\VEEAMSQL2012

use VeeamBackup
:setvar SQLCMDMAXVARTYPEWIDTH 30
:setvar SQLCMDMAXFIXEDTYPEWIDTH 30
SELECT bj.name, bo.object_id FROM bjob bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN Bobjects bo ON bo.id = oij.object_id WHERE bj.type=0
go

Some reason above code wouldn’t work on my latest build/install of Veeam, but this one worked:

SELECT name, job_id, bo.object_id FROM bjobs bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN BObjects bo ON bo.id = oij.object_id WHERE bj.type=0

In my case after remove the VM from inventory and readding it:

As you can see they do not match, and when I check the VM size in the job properties the size can’t be calculated cause the link is gone.

Fix the Broken Job

UPDATE bobjects SET object_id = 'vm-55633' WHERE object_id='vm-53657'

After this I checked the VM size in the job properties and it was calculated, to my amazement it fully worked it even retained the CBT points, and the backup job ran perfectly. Woo-hoo!

This info is for educational purposes only, what you do in your own environment is on you. Cheers, hope this helps someone.

Changing vCenter Hostname

Changing vCenter Hostname

Why?!?! Cause I gotta!

Source: Changing your vCenter Server’s FQDN – VMware vSphere Blog

PreReqs, AKA Checklist

  • Backup all vCenter Servers that are in the SSO Domain before changing the FQDN of the vCenter Server(s)
  • Supports Enhanced Linked Mode (ELM)
  • Changing the FQDN is only supported for embedded vCenter Server nodes
  • Products which are registered with vCenter Server will first need to be unregistered prior to an FQDN change. Once the FQDN change is complete they can then be reregistered.
  • vCenter HA (VCHA) should be destroyed prior to an FQDN change and reconfigured after changes
  • All custom certificates will need to be regenerated
  • Hybrid Linked Mode with Cloud vCenter Server must be recreated
  • vCenter Server that has been renamed will need to be rejoined back to Active Directory
  • Make sure that the new FQDN/Hostname is resolvable to the provided IP address (DNS A records)

NOTE: If the vCenter Server was deployed using the IP as PNID/FQDN, then the following should also be considered:

  • The PNID change workflow cannot be used to change the IP address of vCenter Server
  • The PNID change workflow cannot be used to change the FQND of vCenter Server

In this scenario, use the vCenter Server Appliance Management Interface (VAMI) to update hostnames or IP changes directly. 

The main thing I was expecting was the certificate issue. In my home lab, I removed SSO domain before this change (just using vpshere.local), no ELM, already using embedded (all-in-one), no VCHA, no Hybird, oh yeah…. not sure if you “leave an SSO domain”, before joining back to AD…

My Only Pre-Req

I went into DNS and pre-created A host records for the new server hostname: vCenter.zewwy.ca

Steps

Basically log into VAMI, and change the name.

Then

and and…. well WTF…

No matter what I do it’s greyed out… I thought maybe the untrusted cert, might be an issue so tried from a machine with full trusted chain, and same issue!

Like…. Why… why is Next greyed out? It’s like whatever Button Validation code is written for it is not being triggered, is this a browser version issue? I can’t find anything online with anyone having this issue…. Why? Cause I was right, it was the input validation…

Honestly, this is one of those MASSIVE facepalm moments in my life. I only realized after the fact the username field was NOT auto filled, it was only a label that was greyed and provided as a suggestion… Fill both fields and the next is ungreyed…

Step 4, check the checkbox to acknowledge the warning, and away… she goes!

At which point I clicked redirect now (both web addresses were still available as it didn’t seem to matter which you came from, the cert was untrusted either way, cause the CA not in my trusted ca store)

5 minutes later….

I tell ya nothing more annoying than a spinning circle and the warning “don’t refresh” when the status bar simply does not move… sure got some conflicting messages here….

*Starts to sweat*…

after about 10 minutes time…

More Certificate Fun!

Alright so after this, quick take always… when I went to check the site it was “untrusted” but not for the reason I had thought, I thought it would have been from the same issue as the source blog, and be the hostname on the cert but that was not the case, instead it was imply the the cert chain seemed to be missing, and the issuer could not be verified:

as well as:

So what to do about this… You can download the CA cert from vcenter/certs/download.zip (some reason I had to use IE). Then install the CA cert. (I noticed even after I did this I still had cert warning, error, but after the next day, maybe cache clearing or update, it reported green in the web browser).

Now when I logged in, I got the ol Cert Alert in the vCenter UI

first thing to try is removing old CA’s

Which I did, following this VMware KB

I simply followed my other post about this, and just cleared reset to green on the alert. (Still good days later).

Backup Solutions

Don’t forget to change the server in your backup software, such as I had to do this in Veeam.

These were my results…

Which go figure errored out…

So right click, go to properties of the object… Next, next…

Accept the certs new certificate

Now you figure all is well, but when I went to create a new backup job, when I attempted to expand the vcenter server in Veeam. It just hung there…

I ended up rebooting the server, and then waiting for all the Veeam services to be started. I reopened Veeam, and when to Inventory, clicked the vCenter server, took a second and then showed all the hosts, and the VMs. I clicked it and rescanned to be safe and got this result which was a bit different then the applied settings confirmation above. I think maybe I forgot to rescan the host after applying the new settings, assuming it would have done that as part of the properties change wizard.

which lucky for me now worked, and I was able to select a VM in the Veeam backup wizard, and it successfully backed up the VM.

Final Caveats

like what the heck, everywhere else its changed except at the shell. Let’s see if we change change this.

Well that was easy enough, no reboot required. 🙂

I also found the local hosts file doesn’t update either, in the file it states it managed by VAMI, so many have to look there for potential solutions:

I noticed this since I had to do a work around for something else, and sure enough caught it. I’ll change it manually with vi for now and see what changes after a reboot.

Summary

Overall, literally quick n easy.

  1. Verify DNS records exist.
  2. Use VAMI to edit hostname via editing the Network MGMT settings and change the hostname, click apply and wait.
  3. Manually clear out the old Certs that were created under the old hostname.
  4. Reconfigure you backup solution, which is vender specific (I provided step for Veeam as that is the Backup Vender I like to use)

Overall the task seemed to go pretty smooth. I’ll follow up with any other issue I might come across in the future. Cheers.

 

 

Change vCenter FQDN or IP on Veeam

Story

I recently did a infrastructure upgrade on my home lab, which included moving all my esxi hosts into a dedicate subnet, and making them all more dependent on DNS. This has it’s pros n cons, after all my ESXi host had their IP addresses changed. I also moved my vCenter and changed it IP address, which is now supported yay.

Now I had to move Veeam along with it, originally it was in the same subnet as the esxi hosts, and vCenter which have all moved, instead of trying to manage cross subnet comms, I changed Veaam’s IP address and pointed it’s DNS settings to my AD DNS which has all the ESXi and vCenter host records. Was easy enough just changing the Windows NIC Ip address, and changing the VM’s VMPG.

 How to

Now when I went to scan the vCenter instance in Veeam, it complained about the certificate, since it was renewed from the vCEnter upgrade. I decided I’d change it to be based on DNS now that everything else is as well. When I went to edit the object in Veeam it was greyed out.  Lucky for me Veeam had a KB ready to go.

Challenge

The Name/FQDN/IP of the vCenter Server has changed, and needs to be updated within Veeam Backup & Replication.

Solution

Solves Name Change Only
This solution applies ONLY if the vCenter Server database has not changed.
(I did an upgrade so yes, which you’d want to preserve VM-IDs, and chains)

If the Name/FQDN/IP of the vCenter changed due to a reinstall or upgrade, and a new vCenter database was used, the Ref-IDs will have changed. Due to the changed Ref-IDs you will need to follow the documented process in www.veeam.com/KB1299

Step 1

Prior to running the commands below you need to identify the Name\FQDN\IP Veeam is using to communicate with the VC currently. To do this, edit the entry for the vCenter under Backup Infrastructure and note the “Name:”.

Next perform the following steps to change that VMware Server’s name.

Step 2

Launch PowerShell from inside the Veeam Backup & Replication console. You can find the “PowerShell” button under the File-menu’s “Console” section.

The Veeam Backup & Replication PowerShell Tookit will load.

Step 3

Run the following command:

$Servers = Get-VBRServer -name "old-name"

Replace old-name with the “Name” current set for the vCenter in the Veeam Backup and Replication Console

Step 4

Run the following command next to change the name:

$Servers.SetName("new-name")

Replace new-name with the new name for the vCenter, this can be an IP, Hostname, or FQDN.
Do not remove the quotes on either side.
This change will go in to effect as soon as the command in Step 4 completes.

How I did it – One Liner

Verify:

Get-VBRServer -name "Name from Step 1"

Change:

(Get-VBRServer -name "Name From Step 1").SetName("new.domain.com")

Results:

Now you can click next, Apply, should get right past checking certificate if the certificates are all good… and end up with the follow after rescan:

That was easy enough, I don’t fully understand why the grey out the UI to make this change, but there you have it. Happy Backups!

vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

Clear vCenter Alert Certificate Status

Story

So lately updated a couple vCenter server servers, and in my process I hit a couple errors that required some resolving…

  1. Expired Certs on Source vCenter
  2. Error [500] Auth Provider, due to something, potentially bad certs.
  3. An HPE Bug, filling up ramdisk, causing HA config issues.
  4. Change in security process; preventing login.

The Problem

So a couple hiccups along the way. And now it’s time to resolve this one…

Yeahhhh and alert on Certificates… Seems like VMware and certificate management is like Oil n Water. They don’t mix well.

I’ve had some terrible times managing certificates  with VMware. However as blogged about here, seems there’s finally a way to use your own certificates via the WebUI.

Anyway… to the point, you figured you simply navigate to the vCenter WebUI -> Home -> Administration -> Certificates. Only to realize there’s nothing reporting as invalid or expired.

Checking for Expired Certs

What gives? Ahhh yes, more hidden secret stuff that is not in your face when it comes to the WebUI. Can you guess? That’s right another VMware KB

So while the other issues I’ve mentioned does have references and script in relation to certs, the only “check” in those previous posts was using openssl on the VCSA shell to grab the certificate from the listening service on the dedicated port. Which was based on a particular symptom which spurred that check. So here’s the KB telling you how to actually check the certificates the easiest way I found so far (no check.py; python script needed)

for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;

That’s it! :D…. which just like the KB indicated which cert was bad, in this case, an old Root CA that was used in previous deployments of vCenter before upgrades, So it turns out even though you follow the required KB to get past the pre-check of expired certs. It doesn’t delete the old certificates CA Cert.

There it is, the second CA Cert with expiry in 2019… OK so… You figured it would be easy to clean this up, but remember you couldn’t even see it in the WebUI, so you best believe there is no WebUI way to do this that protects you from human error.

Removing old Expired Certs

Instead, very brilliantly, you get… yes another KB! Booo Yeah… So let’s do this!

The main thing to note about this is…

Certificates are copied back to the VECS store because the CA Certificate which is expiring is published to the VMware Directory Service (VMDIR). When the Certificate is removed from VECS, VMDIR adds the Certificate back to VECS during a sync operation. This is done in order to ensure the integrity of the TRUSTED_ROOTS Certificate store, as deletion of an incorrect Certificate from this store could cause the environment to be irreparably damaged.

OK…. All I take away from this is Certs are important so they have a second cert store as a backup to the first cert store… that’s all I can take away form this odd statement.

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | less

“Find the Certificate you wish to remove and make a note of the Alias and the X509v3 Subject Key Identifier.

Note: There Could be several Certificates to remove. Any expired and not in use certificates should be removed to avoid certificate related alarms.”

Yes that is the plan…

List the trusted certs published to the VMware Directory Service using the following command (administrator@vsphere.local password required). This command is in the same location as vecs-cli:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert list

Huh… in this case it looks like it is not here, so I should be safe to delete it from the normal store and it shouldn’t auto populate back in.

If you do see it (CN equal to x509v3 Key Identifier) then follow the linked KB to remove it, which seems to save a copy of the cert and use that saved copy to run another command to remove it from the store… super weird.

/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias 3276134ad93b3688b5dc5dcfaa402e9bfd7af12f

Restart all services on the PSCs and on the vCenter Servers and ensure that all services start and respond normally and that you can log in and manage the environment.

service-control --stop --all
service-control --start --all

Took a liil while, then logging in… alert still there, I guess I just have to Reset to Green?

For Now Clicked the Reset to Green link. Even after Yet another vCenter patch, it still did not show up anymore. Yay.

vSphere HA Agent cannot be correctly installed or configured

I updated a vCenter server to 7.0.x when logging into the newly updated vCenter one host in the cluster state the following alert.

Error: “vSphere HA agent cannot be correctly installed or configured” (2056299) (vmware.com)

The KB didn’t sound promising. Checking the hosts event logs. a bunch of errors about /tmp ramdisk being full…

The ramdisk ‘tmp’ is full – VMware ESXi on HPE ProLiant – Davoud Teimouri – Virtualization and Data Center

For real? Wow, not gettin’ lucky last couple weeks. Sure enough exact same issue, cleared /tmp temporality, and downloaded the patch. When I vMotion the VMs from this host onto another host the VMs themselves showed alerts.

Virtual machine failed to become vSphere HA Protected and HA may not attempt to restart it after a failure.

I kept chugging alone in hopes I’d resolve each VM later. However as soon as I placed the issued host into maintenance mode, the alerts from all VMs disappeared. Applied to patch exactly as the HPE KB stated for the ESXi version it was on.

With luck on my side, the host came up clean, and came out of maintenance mode without an issue, and all error and alerts were resolved. Woooo!

Hope this helps anyone doing a vCenter upgrade to 7.x

Fixing vCenter [500] An error occurred while fetching identity providers.

Story

So The other day I posted about upgrading vCenter to 7.0.x while everything went fine during the upgrade. For some odd reason a couple days later when I went to navigate to the vCenter login page I was greeted with:

[500] An error occurred while fetching identity providers.

Kind of wished I had read this reddit post right off the hop, cause the first reply was is going to be my answer at the end of this post.

I did however first hit this KB about it as well I was a bit thrown off has it indicated to only do it if you see the following in the logs:

(/var/log/vmware/trustmanagement/trustmanagement-svcs.log)

2021-03-10T09:27:03.474Z [tomcat-exec-14  INFO  com.vmware.identity.token.impl.X509TrustChainKeySelector  opId=] Failed to find trusted path to signing certificate <STS Certificate Subject, example - C=US,CN=ssoserverSign\,dc\=vsphere\,dc\=local>
java.security.cert.CertPathBuilderException: Unable to find certificate chain.

Which I could not see, so I wasn’t sure if this was the issue or not. What I did see in my logs was the following:

2021-09-17T23:58:03.945Z [tomcat-exec-14 WARN com.vmware.vcenter.trustmanagement.impl.VcIdentityProviders opId=] com.vmware.sso.interop.ldap.NoSuchObjectLdapException: No such object
LDAP error [code: 32]

and

2021-09-18T01:19:01.322Z [tomcat-exec-26 INFO com.vmware.vapi.security.AuthenticationFilter opId=] Not successful authentication
java.lang.RuntimeException: Authentication data not found
Caused by: com.vmware.vapi.dsig.json.SignatureException: Cannot verify the signature over the provided data

So it wasn’t matching. Looking at my firewall I couldn’t see any LDAP connections from vCenter to my LDAP server since the upgrade. So I decided instead to try a reboot. This simply made things worse.

No Healthy Upsteam

Now when I’d try access vCEnter Web UI I was greeted with a blank white web page with simple text stating “No Healthy Upstream”, now looking into this, people reached this problem for several different reasons. As mentioned here and here and for some odd reason this guy just changed his IP address?! Weird.

For me I checked the local Hosts file and it was fine, and couple other mentioned fixes and they all didn’t work for me.

Try Anyway

For some reason at this point I decided to double the mentioned work around in the initial VMware KB I found as the main login symptom was exactly the same even though I couldn’t validate the same log entries within the logs.

How to Copy Files to VCSA via WinSCP

Now a couple real quick things to note here. You need to copy a script to the VCSA. If you get unable to agree on a cipher suite, you’ll need to update your copy of WinSCP to a newer version. Also instead of doing what VMware says to change the shell on the VCSA, do what this guy suggests instead:

“In the new connection dialog, specify the Host name, User name and then click the Advanced button,

(VCSA 6.5)

Choose the Environment/SFTP option

Specify for SFTP server: shell /usr/libexec/sftp-server”

so much easier.

I decided to take a look at the script after copying it to the VCSA, and it had this line which had me hopeful it would actually work to resolve my issue:

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE

So I followed along with the workaround specified in the KB…

1) Download the attached fixsts.sh script from this article and upload to the impacted PSC or vCenter Server with Embedded PSC to the /tmp folder.

2) If the connection to upload to the vCenter by the SCP client is rejected, run this from an SSH session to the vCenter:

chsh -s /bin/bash

3) Connect to the PSC or vCenter Server with an SSH session if you have not already per Step 2.

4) Navigate to the /tmp directory:

cd /tmp

5) Run chmod +x fixsts.sh to make the file executable.

chmod +x ./fixsts.sh

6) Run ./fixsts.sh.

./fixsts.sh

Restart services on all vCenters and/or PSCs in your SSO domain by using below commands:

service-control --stop --all
service-control --start --all

my results:

To my Amazement it actually worked, and I was able to login into the vCenter server!! Wooo!

*Update* Here’s a great blog post covering managing or creating custom certificates with vCenter 7

Kinda funny that 7.0 is stated as 6.8 in the scripts.. mhmm

Upgrade and Migrate a vCenter Server

Intro

Hello everyone! Today I’ll be doing a test in my home lab where I will be upgrading, not to be confused with updating, a vCenter server. If you are interested in staying on the version your vCenter is currently on but just patch to the latest version, see my other blog post: VMware vCenter Updates using VAMI – Zewwy’s Info Tech Talks

Before I get into it, there are a couple thing expected from you:

  1. An existing instance of vCenter deployed (for me yup, 6.7)
  2. A backup of the config or whole server via a backup product
  3. A Copy of the latest vCenter ISO (either from VMware directly or for me from VMUG)

Side Story

*Interesting Side Note* VM Creation dates property is only a thing since vCenter 6.7. Before that it was in the events table that gets rotated out from retention policies. 🙂

*Side Note 2* I was doing some vmotions of VMs to prepare rebooting a storage device hosting some datastores before the vCenter update, and oddly even though the Task didn’t complete it would disappear from the recent task view. Clicking all Tasks showed the task in progress but @ 0% so no indication of the progress. The only trick that worked for me was to log off and back in.

A quick little side story, it was a little while since I had logged into VMUG for anything, and I have to admit the site setup is unbelievably bad designed. It’s so unintuitive I had to Google, again, how to get the ISO’s I need from VMUG.

Also for some reason, I don’t know why, when I went to log in it stated my username and password is wrong. Considering I use a password manager, I was very confident it was something wrong on their end. Attempting to do a password reset, provided no email to my email address.

Distort I decided to make a another account with the same email, which oddly enough when created brought me right back to my old account on first log in. Super weird. According to Reddit I was not the only one to experience oddities with VMUG site.

Also on the note of VMware certification, I totally forgot you have to take one of the mandatory classes before you can challenge, or take any of the VMware exams.

“Without the mandatory training? Yes, they represent a reasonable value proposition. With mandatory training? No, they do not. Requiring someone who’s been using your products for a decade to attend a class which covers how to spell ESXi is patronizing if not downright condescending. I only carry VMware certifications because I was able to attain them without going through the nonsense mandatory training.”

“The exam might as well cost $3500 and “include” the class for “free”.”

Don’t fully agree with that last one cause you can take any one class (AFAIK) and take all the exams. I get the annoyance of the barrier to entry, gotta keep the poor out. 😛

Simple Summary about VMUG.

  1. Create account and Sign up for Advantage from the main site.
  2. Download Files from their dedicated Repo Site.

Final gripes about VMUG:

  1. You can’t get Offline Bundles to create custom ESXi images.
  2. You can’t seem to get older versions of the software from there.
  3. The community response is poor.
  4. The site is unintuitive and buggy.

So now that we finally got the vCenter 7 ISO

For a more technical coverage of updating vCenter see VMware’s guide.

For shits.. moving esxi hosts, and vcenter to new subnet.

1) Build Subnet, and firewall rules and vlans
2) Configure all hosts with new VMPG for new vlan
3) Move each host one at a time to new subent, ensure again that network will be allowed to the vCenter server after migration
4) Can’t change VMK for mgmt to use VLAN from the vCenter GUI, have to do it at host level.
i) Place host into maintenance mode, remove from inventory (if host were added by IP, otherwise just disconnect)
ii) Update hosts IP address via the hosts console, and update DNS records
iii) Re-add the host to the cluster via new DNS hostname

Changing vCenter Server IP address

Source: How to change vCenter and vSphere IP Address ( embedded PSC ) – Virtualblog.nl

changed IP address in the VAMI, it even changed the vpxa config serverIP address to the new IP automatically. it worked. :O

Upgrading vCenter

Using the vCenter ISO

The ISO is not a bootable one, so for me I mount it on to a Windows machine that has access to the vCenter server.

Run the installer exe file…

Click Upgrade

I didn’t enter the source ESXi host IP.. lets see

nope wants all the info, fill all fields including source esxi host info.

Yes.

Target ESXi Host for new VCSA deployment. Next

Target VCSA VM info. Next

Would you like, large or eXtra large?

pick VMs datastore location, next.

VM temp info, again insure network connections are open between subnets if working with segregated networks.

Ready to deploy.

Deploying VM to target ESXi host. Once this was done got a message to move on to Stage 2, which can be done later, I clicked next.

Note right here, when you get a prompt for entering the Root password, I found it to be the target Root password not actually the source.

Second Note Resolving Certs Expired Pre-Check

While working on a client upgrade, it was more in my face when doing the source server pre-checks and would not continue stating certificates expired.

I was wondering how to check Existing certs and while this KB states you can check it via the WebUI There  could be a couple issues.

1) You might not even be able to login into the WebUI as mentioned in this Blog, a bit of a catch 22. (Note* same goes for SSO domains, it can’t be managed by VAMI, so if there’s an AD issue with a source, you often get a service 503 error attempting to log on to the WebUI)

2) It might not even show up in that area of the WebUI.

In these cases I managed to find this blog post… which shockingly enough is the very guy who wrote the fixsts script used to fix my problem in this very blog post :O

Checking Certs via the CLI

Grab Script from This VMware KB

Download the checksts.py script attached to the above KB article.
Upload to attached script to the VCSA or external PSC.

For example, /tmp

Once the script has been successfully uploaded to VCSA, change the directory to /tmp.

For example:

cd /tmp

Run python checksts.py.

OK Dokie then, I guess this script doesn’t check the required cert… so instead I followed along with this VMware KB (Yes another one).

In which case I ran the exact commands as specified in the KB and saved the certificate to a txt, file and opened it up in Windows by double clicking the .crt file.

openssl s_client -connect MGMT-IP:7444 | more

So now instead of running the fixsts script, this KB states to run the following to reset this certificate to use the Machine Cert (self signed with valid date stamps, at least that’s what this server showed when checking them via the Certificate management are in the vCenter WebUI).

For the Appliance (I don’t deal with the Windows Server version as it EOL)

/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT > /var/tmp/MachineSSL.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store MACHINE_SSL_CERT --alias __MACHINE_CERT > /var/tmp/MachineSSL.key
/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > /var/tmp/sts_internal_backup.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > /var/tmp/sts_internal_backup.key
/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT -y
/usr/lib/vmware-vmafd/bin/vecs-cli entry create --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --cert /var/tmp/MachineSSL.crt --key /var/tmp/MachineSSL.key

Then:

  • service-control --stop --all
  • service-control --start --all

In my case for some odd reason I saw a bunch of these… when stopping and starting the services

2021-09-20T18:35:47.049Z Service vmware-sts-idmd does not seem to be registered with vMon. If this is unexpected please make sure your service config is a valid json. Also check vmon logs for warnings.

I was nervous at first I may have broke it, after sometime it didn’t complete the startup command sequence, and after some time the WebUI was fully accessible again. Let’s validate the cert with the same odd method we did above.

Which sure enough showed a date valid cert that is the machine cert, self-signed.

Running the Update Wizard… Boooo Yeah!

 

Uhhh, ok….

 

Ok dokie?

I didn’t care too much about old metrics.

nope.

Let’s go!

After some time…

Nice! and it appears to have worked. 🙂

Another Side Trail

I was excited cause I deployed this new VCSA off the FreeNAS Datastore I wanted to bring and reboot. but low and behold some new random VMs are on the Datastore…

doing some research I found this simple explanation of them however it wasn’t till I found this VMware article with the info I was more after.

Datastore selection for vCLS VMs

The datastore for vCLS VMs is automatically selected based on ranking all the datastores connected to the hosts inside the cluster. A datastore is more likely to be selected if there are hosts in the cluster with free reserved DRS slots connected to the datastore. The algorithm tries to place vCLS VMs in a shared datastore if possible before selecting a local datastore. A datastore with more free space is preferred and the algorithm tries not to place more than one vCLS VM on the same datastore. You can only change the datastore of vCLS VMs after they are deployed and powered on.

If you want to move the VMDKs for vCLS VMs to a different datastore or attach a different storage policy, you can reconfigure vCLS VMs. A warning message is displayed when you perform this operation.

You can perform a storage vMotion to migrate vCLS VMs to a different datastore. You can tag vCLS VMs or attach custom attributes if you want to group them separately from workload VMs, for instance if you have a specific meta-data strategy for all VMs that run in a datacenter.

In vSphere 7.0 U2, new anti-affinity rules are applied automatically. Every three minutes a check is performed, if multiple vCLS VMs are located on a single host they will be automatically redistributed to different hosts.

Note:When a datastore is placed in maintenance mode, if the datastore hosts vCLS VMs, you must manually apply storage vMotion to the vCLS VMs to move them to a new location or put the cluster in retreat mode. A warning message is displayed.

The enter maintenance mode task will start but cannot finish because there is 1 virtual machine residing on the datastore. You can always cancel the task in your Recent Tasks if you decide to continue.
The selected datastore might be storing vSphere Cluster Services VMs which cannot be powered off. To ensure the health of vSphere Cluster Services, these VMs have to be manually vMotioned to a different datastore within the cluster prior to taking this datastore down for maintenance. Refer to this KB article: KB 79892.

Select the checkbox Let me migrate storage for all virtual machines and continue entering maintenance mode after migration. to proceed.

huh, the checkbox is greyed out and I can’t click it.
vmotioned them and the process kept moving up.

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.

 

vCenter 503 Service Unavailable

I was going to test a auditing script from a DefCon presenter on my AD server, when I was adding the USB controller and the USB stick I was passing thorugh to get the script in my VM was being weird.

First USB 3.0 connected just fine, and connected the USB device to the VM, but diskpart was not showing it. So I went to remove it and try a USB 2.0 controller, that failed to connect since the USB 3.0 was still showing there and I selected to remove it again, which it errored another concurrent task. Makes sense, till refreshing the page told me unprivileged account. I wasn’t sure what this was about, so I decided to open another window and navigate to my center web app… 503 service unavailable:

“503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x000055aec30ef1d0] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)”

What the… rebooting the VCSA showed no success still same error even with an incognito window.. ughh.

I found this thread: https://communities.vmware.com/thread/588755

I was going through this, and decided to try to renew the certs, even though my internal PKI certs were still valide (AFAIK, and checking the cert provided when accessing the page). Now here’s the thing, while I ran the certificate-manager script and renewed all the certs, I noticed my AD server somehow was down. I booted it back up. I’m not exactly sure which fixed it. So I decided to take another snapshot while it was in this “fixed state” and revert to the  broken state. After restoring o the broken state nothing was responding at all on the https service from the VCSA, so I gave it a simple reboot (which I did initially before I noticed my AD server was down, for some reason). Sure enough after the reboot everything was working fine with my internal PKI certs.

I guess if you set vCenter to use MS AD as the primary login domain and that domain is not available the web management service becomes unavailable… that kind of sucks. I should have noticed my AD was not operational but I didn’t have monitoring on it 😉 or use my local workstation as a AD member. Mostly just random VMs I have for testing.

Like most people, should have looked at the logs for a better idea of what the root cause was. I threw 2 darts at a dart board and had to revert to find the true root cause. Not the best way to troubleshoot, but sometimes if logs are not available it is another method…