Remove Orphaned Datastore in vCenter Again

Story

I did this once before, but that time was due to rebuilding a ESXi host and not removing the old datastore. This time however it’s due to the storage server failing.

The Precursor

I did NOT want to write this blog post. This post comes from the fact that VMware is not perfect and I’m here to air some dirty laundry…. Let’s get started.

What Happened?

The short answer is I don’t fully know, all I know is that the backend storage server (FreeNas 11.1u7) running iSCSI started showing weird signs of problems (Reporting Graphs not rendering). Since I wanted to possibly do some Frankenstein surgery on the unit (iOmega px12-350r). I started to vMotion the primary VMs I needed on to local ESXi storage.

Even though I checked the logs, I can’t determine what is causing all the services to not start.  Trying to manually start it, just showed gibberish in the system log.

The Problem

Since I couldn’t get it back up they show as inaccessible in vCenter:

Attempting to unmount them results in an error:

Not sure what that means, I even put the host in maintenance mode and gives the same error. Attempting to remove the iSCSI configuration to which hosts those datastores, also errors out with:

Strange how can there be active sessions when it literally dead?

I tried following my old blog post on a similar case, but I was only able to successfully unmount the datastore via esxcli but the Web GUI would still show them…

esxcli storage filesystem list
esxcli storage filesystem unmount -u UID

Any attempt to set them as offline failed as they were status as dead anyway…

As you can see no diff:

Solutions?

I attempted to look up solutions, I found one post of a similar nature here:

How to remove unmounted/inaccessible datastore from ESXi Host (tomaskalabis.com)

When I attempted to run the command,

esxcli storage core device detached remove -d naa.ID

it sadly failed for me:

I was at a dead end… I could see the dead devices with no files or I/O bound to them, but I can’t seem to removed them.. they show as detached…

esxcli storage core device detached list

as a last ditch effort I rescanned one last time and then ran the command to check for devices.

esxcli storage core adapter rescan --all
esxcli storage core device list

checking the Web Gui I could see the Datastores gone but the iSCSI config was still there, attempting to remove it would result in the same error as above. Then I realized there were still static records defined, once I deleted them, everything was finally clean on the host.

Do It Again!

Since this seem to be a per host thing lets see if we can fix it without maintenance mode, or moving VMs. Test host.. this broken datastores check:

Turns out its even easier… just remove static iSCSI targets, remove dynamic target, rescan storage and adapters:

I guess sometimes you just overthink things and get lead down rabbit holes when a simple solution already easily exists. I followed these simple steps on the final host and oddly one datastore lingered:

Well let’s enable SSH and see what’s going on here…

esxcli storage filesystem list
esxcli storage filesystem unmount -u 643e34da-56b15cb2-0373-288023d8f36f

esxcli storage core device list
esxcli storage core device set -d naa.6589cfc0000005e95e5e4104f101a307 --state=off

“Unable to set device’s status. Error was: Unable to change device state, the device is marked as ‘busy’ by the VMkernel.: Busy”

Mhmmm different then last time, which might explain why it wasn’t auto removed.

esxcli storage core device world list -d naa.6589cfc0000005e95e5e4104f101a307

hostd-worker and if I run the command to get process VMs it doesn’t show makes me think the old scratch/core dump…

I’m not sure what restarting HostD does so I’ll move critical VMs off just to be save and then test restarting that service to see if it released it’s strangle hold…

/etc/init.d/hostd restart

After this it did show disconnected from vCenter for a short while, then came back, and the old Datastore was done.

Although the datastore was gone.. the disk remained, and I couldn’t get rid of it.

I don’t get it… do I have to reboot this host….

ughh reboot worked… what a pain though.

If you want to know what datastore/UUID is linked to what disk run

esxcli storage vmfs extent list

Now for G9-SSD2, I tried to remove it since it showed signs of on the way out. and I couldn’t… seem like an on going story here. I could only unmount it from the CLI.

Weird, I deleted The G9-SSD3 normally, then I detached the disk containing G9-SSD2. Then when I recreated G9-SSD3, the G9-SSD2 just disappeared. The drive still shows as unconsumed and detached.

Now I have to go rebuilt my shared storage server…

VMware Patches May 2024

Yup this shit never ends:

VMSA-2024-0011:VMware ESXi, Workstation, Fusion and vCenter Server updates address multiple security vulnerabilities

Patching vCenter

Login to VAMI, lets see what I’m on:

Here’s the fix Matrix:

Can you tell if I’m good, no cause the Matrix uses a different version coding (7.0 u3q) vs the version shown in VAMI (7.0.3.01700). You can either look up, by googling the version, which I did and it’s 7.0 u3o), or clicking the link in the KB and checking the build number.

VMware: constructive criticism.. make the Matrix have the same versioning syntax as VAMI so it’s easy to know, and verify.

Anyway, in VAMI click update. there it is….

Accept the EULA, Pass pre-update checks, Installing…

It’s chugging along…

at this point the vCenter regular web interface was unresponsive, and had to use the host that was running the VCSA to get the CPU usage. However, as you can see VAMI appears to be up and showing status just fine.

45 Minutes later…

alright… 1% woo, woo, woo! Why does this seem oddly familiar…. mhmm anyway. After about an hour…

Re-log into VAMI.

Looks good, going to the main mgmt page… mhmm shows 404, but by the time I wanted to get a snip, it refreshed to show the FBA page, so I logged in like normal.

Yay it worked.

Patching ESXi

In vCenter, go to the host, pick updates, then baseline, and check compliance.

On the two baselines, select them and pick remediate.

Server went into maintenance mode, and after about 20 min (I think it rebooted, I didn’t have an active ping on it, not sure will check on the next one).

My PA-ESXi is a special beast, it for some reason needs a helping hand during boot, so we’ll know if it reboots this time…

yup… it rebooted.

Fun times had by all.

Configuring shared LVM over iSCSI on Proxmox

So, I’ve been recently playing with Proxmox for virtualization. It’s pretty nice, but in my cluster (which consisted of two old laptops) whenever I would migrate VM’s or Containers it would have to migrate the storage over the network as well. Since they are just old laptops everything connects together with 1 gbps to switches with the same rated ports.

I’m used to iSCSI so I checked the Proxmox storage guidance to see what I could use.

I was interested in ZFS over iSCSI. However, I temporarily gave up on this cause for some reason… you have to allow root access to the FreeNAS box over SSH, on the same network that the iSCSI is for….

First of all we need to setup SSH keys to the freenas box, the SSH connection needs to be on the same subnet as the iSCSI Portal, so if you are like me and have a separate VLAN and subnet for iSCSI the SSH connection needs to be established to the iSCSI Portal IP and not to the LAN/Management IP on the FreeNAS box.
The SSH connection is only used to list the ZFS pools”

Also mentioned in this guide.

This was further verified when I attempted to setup ZFS on an iSCSI disk, I go this error message:

Since I didn’t want to configure my NAS to have root access over SSH, on the iSCSI network. I was still curious then what the point of iSCSI was for PVE if you can’t use a drive shared… Reviewing the chart above, and this comment “i guess the best way to do it, is to create a iscsi storage via the gui and then an lvm storage also via the gui (if you want to use lvm to manage the disks) or directly use the luns (they have to be managed on the storage server side)

I ended up using LVM on the disk “3: It is possible to use LVM on top of an iSCSI or FC-based storage. That way you get a shared LVM storage”

However, using this model you can’t use snapshots. 🙁
You can use LVM-Thin but that’s not shared.

Step 1) Setup Storage Server

In my case I’m using a FreeNAS server, with spare drive ports, so for this test I took a 2TB drive (3.5″), plugged it in and wiped it from the web UI.

After this I configured a new extent as a raw device share.

Created the associated targets and portals. Once this was done (since I had dynamic discovery on my ESXi hosts) they discovered the disk. I left them be, but probably best to have separate networks…. but I’ll admit… I was lazy.

Step 2) Configure PVE hosts

In my case I had to add the iSCSI network (VLAN tagged) on to my hosts. This is easy enough Host -> System -> Network -> Create Linux VLAN

OK, so where in ESXi you simply add an iSCSI adapter, in PVE you have to install it first? Sure ok lets do that… Turns out it was already installed.
after reading that and seeing what my ESXi did, I managed to edit my /etc/pve/storage.cfg and added

iscsi: freenas
portal 172.16.69.2
target iqn.2005-10.org.freenass.ctl:proxhdd
content none

To my surprise… it showed as a storage unit on both my PVE hosts. :O

mhmm doing a df -h, I don’t see anything… but doing a fdisk -l sure enough I see the drive.. so cool 🙂
So now that I got both hosts to see the same disk, I guess it simply comes down to creating a file system on the raw disk.
Or not… when I try to create a ZFS using the WebUI it just says no disk are available.

Step 3) Setup LVM

However, adding an LVM works:

After setting up LVM the data source should show up on all nodes in the cluster that have access to the disk. One on of my nodes it wasn’t showing as accessible until I rebooted the node that had no problems accessing it. ¯\_(ツ)_/¯

So, there’s no option to pick storage when migrating a VM, you have to go into the VM’s hardware settings and “move the disk”.

When I went to do my first live VM migration, I got an error:

I soon realized this was just my mistake by not having selected “delete source” since when “moving the disk” it actually converted the disk from qcow2 to raw and didn’t delete the old qcow2 file. So I simply deleted it. then tried again…

and it worked! Now the only problem is no snapshots. I attempted to create an LVM-Thin on top the LVM, and it did create it, but as noted in the chart both my hosts could not access it at the same time, so not shared.

Guess I’ll have to see how Ceph works. That’ll be a post for another day. Cheers.

*Update* I’ll have to implement a filter on FreeNAS cause Proxmox I guess won’t implement a fix that was given to them for free.

https://forum.proxmox.com/threads/iscsi-reconnecting-every-10-seconds-to-freenas-solution.21205/#post-163412

https://bugzilla.proxmox.com/show_bug.cgi?id=957

Delete Root Certificate from vCenter

In my last two posts, we renewed the Root Certificate on the VCSA.

We then renewed the STS certificate.

But we were left with the old Root certificate in on the VCSA, how do we removed it?

You can use the Certificate Management vCenter Trusted Root Chains interface to add, delete and read trusted root certificate chains. This use case demonstrates how to delete a root certificate or certificate chain from the trusted root store of your vCenter Server system.

Deleting certificates is not available through the vSphere Client and you can only do this by using the vSphere Automation API or the CLI tools.

Caution:
Deleting a root certificate or certificate chain that is in use might cause breakage of your systems. Proceed to delete a root certificate only if you are sure it is not in use by your vCenter Server or any connected systems.

The above link may have good warning, the steps in it are useless, and didn’t work for me, possibly cause I did have the “vSphere Automation API server” or something, I’m not sure putting in the get into a browser simply prompted for creds and didn’t accept them.

So, you can also use PowerCLI, or vecs-cli lets try the latter.

1 ) List the certificates using vecs-cli.

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | less

2) Find the Certificate you wish to remove and make a note of the Alias and the X509v3 Subject Key Identifier.

My case:
Alias : 9eadf42a18387ee983d3dfa4f607eee91a3e5b67
X509v3 Subject Key Identifier: 0B:62:2D:98:7B:28:34:2A:14:81:CD:34:AC:46:40:06:80:DA:84:3E

3) List the trusted certs published to the VMware Directory Service using the following command (administrator@vsphere.local password required). This command is in the same location as vecs-cli:
Windows:
C:\Program Files\VMware\vCenter Server\vmafdd>dir-cli trustedcert list

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert list

This will output a list of Certificates published to VMDIR. It will look similar to the following output:

4) Locate the Certificate’s CN (thumbprint) which matches the Key Identifier from Step 2 above. In this example, the Certificate will be the first one in the list with the following CN:

0B622D987B28342A1481CD34AC46400680DA843E

5) Using the ID located in Step 4, run the following command, change ID from step 4:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert get --id 0B622D987B28342A1481CD34AC46400680DA843E --login administrator@vsphere.local --outcert /tmp/oldcert.cer

6) Un-publish the CA Certificate from VMDIR by running the following command:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert unpublish --cert /tmp/oldcert.cer

7) Delete the Certificate from VECS utilizing the Alias located in Step 2 by running the following command:

/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias 9eadf42a18387ee983d3dfa4f607eee91a3e5b67

8) Confirm that the Certificate was deleted by running the following command:

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | grep Alias

9) Force a refresh of VECS by running the following command. This will ensure updates are pushed to the other PSCs in the environment if there is more than one.

/usr/lib/vmware-vmafd/bin/vecs-cli force-refresh

10) Restart all services on the PSCs and on the vCenter Servers and ensure that all services start and respond normally and that you can log in and manage the environment. (aka giver a reboot)

Logged in just fine, and certs are now clean as a whistle:

Looks like Root Certs are good for 10 Years, STS Certs are good for 10 years, machine Cert is good for 2 years.

Hope these last couple posts help someone.

Renew vCenter STS Certificate

Source: Refresh a vCenter Server STS Certificate Using the vSphere Client (vmware.com)

  1. Log in with the vSphere Client to the vCenter Server.
  2. Specify the user name and password for administrator@vsphere.local or another member of the vCenter Single Sign-On Administrators group.
    If you specified a different domain during installation, log in as administrator@ mydomain.
  3. Navigate to the Certificate Management UI.
    1. From the Home menu, select Administration.
    2. Under Certificates, click Certificate Management.
  4. If the system prompts you, enter the credentials of your vCenter Server.
  5. Under STS Signing Certificate, click Actions > Refresh with vCenter certificate.

  1. Click Refresh.
    The VMCA refreshes the STS signing certificate on this vCenter Server system and on any linked vCenter Server systems.
  2. (Optional) If the Force Refresh button appears, vCenter Single Sign-On has detected a problem. Before clicking Force Refresh, consider the following potential results.
    • If all the impacted vCenter Server systems are not running at least vSphere 7.0 Update 3, they do not support the certificate refresh.
    • Selecting Force Refresh requires that you restart all vCenter Server systems and can render those systems inoperable until you do so.
    1. If you are unsure of the impact, click Cancel and research your environment.
    2. If you are sure of the impact, click Force Refresh to proceed with the refresh then manually restart your vCenter Server systems.
I guess my setup had a problem? or it’s still valid or a long time, I don’t know why my setup says force refresh, but lets do it…
Mhmmm… k vCenter still working normally, and no forced reboot, just saying all systems need to be rebooted….
I navigated away and back and it shows the new cert…
reboot anyway… sign in, no issues…
But the old root still exists, can it be deleted?
Yes… Check out how on my next Blog post.

Renew Root Certificate on vCenter

Renew Root Certificate on vCenter

I’ve always accepted the self signed cert, but what if I wanted a green checkbox? With a cert sign by an internal PKI….  We can dream for now I get this…

First off since I did a vCenter rename, and in that post I checked the cert, that was just for the machine cert (the Common name noticed above snip), this however didn’t renew/replace the root certificate. If I’m going to renew the machine cert, may as well do a new Root, I’m assuming this will also renew the STS cert, but well validate that.

Source: Regenerate a New VMCA Root Certificate and Replace All Certificates (vmware.com)

Prerequisites

You must know the following information when you run vSphere Certificate Manager with this option.

Password for administrator@vsphere.local.
The FQDN of the machine for which you want to generate a new VMCA-signed certificate. All other properties default to the predefined values but can be changed.

Procedure

Log in to the vCenter Server on an embedded deployment or on a Platform Services Controller and start the vSphere Certificate Manager.
OS Command
For Linux:               /usr/lib/vmware-vmca/bin/certificate-manager
For Windows:      C:\Program Files\VMware\vCenter Server\vmcad\certificate-manager.bat
*Is Windows still support, I thought they dropped that a while ago…)

Select option 4, Regenerate a new VMCA Root Certificate and replace all certificates.

ok dokie… 4….

and then….

five minutes later….

Checking the Web UI, shows the main sign in page already has the new Cert bound, but attempting to sign in and get the FBA page just reported back that “vmware services are starting”. The SSH session still shows 85%, I probably should have done this via direct console as I’m not 100% if if affect the SSH session. I’d imagine it wouldn’t….

10 minutes later, I felt it was still not responding, on the ESXi host I could see CPU on VCSA up 100% and stayed there the whole time and finally subsided 10 minutes later, I brought focus to my SSH session and pressed enter…

Yay and the login…. FBA page loads.. and login… Yay it works….

So even though the Root Cert was renewed, and the machine cert was renewed… the STS was not and the old Root remains on the VCSA….

So the KB title is a bit of a lie and a misnomer “Regenerate a New VMCA Root Certificate and Replace All Certificates”… Lies!!

But it did renew the CA cert and the Machine cert, in my next post I’ll cover renewing the STS cert.

 

Migrate ESXi VM to Proxmox

I’m going to simulate migrating to Proxmox VE in my home lab.

I saw this YT video comparing the two and gave me the urge to try it out in my home lab.

In this test I’ll take one host from my cluster and migrate it to use Proxmox.

Step one, move all VMs off target host.
Step two, remove host from cluster.
Step three, shutdown host.

In this case it’s an old HP Folio laptop. Next Install PVE.

Step one Download Installer.
Step two, Burn image or flash USB stick with image.
Step 3 boot laptop into PVE installer.

I didn’t have a network cable plugged in, and in my haste I didn’t pay attention to the bridge main physical adapter, it was selected as wlo1 the wireless adapter. I found references to the bridge info being in /etc/network/interfaces some reason this was only able to get pings to work. all other ports and services seemed completely unavailable.  Much like this person, I simply did a reinstall (this time minding the physical port on network config). Then got it working.

First issue I had was it poping up saying Error Code 100 on apt-get update.

Using the built in shell feature was pretty nice, use it to follow this to change the sources to use no-subscription repos.

The next question was, how can I setup another IP thats vlan tagged.

I thought I had it when I created a “Linux VLAN”, and defining it an IP within that subnet and tagging the VLAN ID. I was able to get ping replies, even from my machine in a different subnet, I couldn’t define the gateway since it stated it was defined on the bridge, make sense for a single stack. I figured it was cause ICMP is UDP and doesn’t rely on same paths (session handshakes) and this was probably why the web interface was not loading. I verified this by connecting a different machine into the same subnet and it loaded the web interface find, further validating my assumptions.

However when I removed the gateway from the bridge and provided the correct gateway for the VLAN subnet I defined, the wen interface still wasn’t loading from my alternative subnetting machine. Checking the shell in the web interface I see it lost connectivity to anything outside it’s network ( I guess the gateway change didn’t apply properly) or some other ignorance on my part on how Proxmox works.

I guess I’ll leave the more advanced networking for later. (I don’t get why all other hypervisors get this part so wrong/hard, when VMware makes it so easy, it’s a checkbox and you simply define the VLAN ID in, it’s not hard…) Anyway I simply reverted the gateway back to the bridge. Can figure that out later.

So how to convert a VM to run on ProxMox?

Option 1) Manually convert from VMDK to QCOW2

or

Option 2) Convert to OVF and deploy that.

In both options it seems you need a mid point to store the data. In option 1 you need to use local storage on a Linux VM, almost twice it seems once to hold the VMDK, and then enough space to also hold the QCOW2 converted file. In option 2 the OP used an external drive source to hold the converted OVF file on before using that to deploy the OVF to a ProxMox host.

I decided to try option 1. So I spun up a Linux machine on my gaming rig (Since I still have Workstation and lots of RAM and a spindle drive with lots of storage). I picked Fedora Workstation, and installed openssh-server, then (after a while, realizing to open firewall out on the ESXi server for ssh), transferred the vmdk to the fedora VM:

106 MB/s not bad…

Then installed the tools on the fedora VM:

yum install -y qemu-img

NM it was already installed and converted it…

On Proxmox I couldn’t figure out where the VM files where located “lvm-thin” by default install. I found this thread and did the same steps to get a path available on the PVE host itself. Then used scp to copy the file to the PVE server.

After copying the file to the PVE server, ran the commands to create the VM and attach the hdd.

After which I tried booting the VM and it wouldn’t catch the disk and failed to boot, then I switched the disk type from SCSI to SATA, but then the VM would boot and then blue screen, even after configuring safe mode boot. I found my answer here: Unable to get windows to boot without bluescreen | Proxmox Support Forum

“Thank you, switching the SCSI Controller to LSI 53C895A from VirtIO SCSI and the bus on the disk to IDE got it to boot”.

I also used this moment to uninstall VMware tools.

Then I had no network, and realized I needed the VirtIO drivers.

If you try to run the installer it will say needs Win 8 or higher, but as pvgoran stated “I see. I wasn’t even aware there was an installer to begin with, I just used the device manager.”

That took longer then I wanted and took a lot of data space too, so not an efficient method, but it works.

No coredump target has been configured. Host core dumps cannot be saved.

ESXi on SD Card

Ohhh ESXi on SD cards, it got a little controversial but we managed to keep you, doing the latest install I was greet with the nice warning “No coredump target has been configured. Host core dumps cannot be saved.

What does this mean you might ask. Well in short, if there ever was a problem with the host, log files to determine what happened wouldn’t be available. So it’s a pick your poison kinda deal.

Store logs and possibly burn out the SD/USB drive storage, which isn’t good at that sort of thing, or point it somewhere else. Here’s a nice post covering the same problem and the comments are interesting.

Dan states “Interesting solution as I too faced this issue. I didn’t know that saving coredump files to an iSCSI disk is not supported. Can you please provide your source for this information. I didn’t want to send that many writes to an SD card as they have a limited number (all be it a very large number) of read/writes before failure. I set the advanced system setting, Syslog.global.logDir to point to an iSCSI mounted volume. This solution has been working for me for going on 6 years now. Thanks for the article.”

with the OP responding “Hi Dan, you can definately point it to an iscsi target however it is not supported. Please check this KB article: https://kb.vmware.com/s/article/2004299 a quarter of the way down you will see ‘Note: Configuring a remote device using the ESXi host software iSCSI initiator is not supported.’”

Options

Option 1 – Allow Core Dumps on USB

Much like the source I mentioned above: VMware ESXi 7 No Coredump Target Has Been Configured. (sysadmintutorials.com)

Edit the boot options to allow Core Dumps to be saved on USB/SD devices.

Option 2 – Set Syslog.global.logDir

You may have some other local storage available, in that case set the variable above to that local or shared storage (shared storge being “unsupported”).

Option 3 – Configure Network Coredump

As mentioned by Thor – “Apparently the “supported” method is to configure a network coredump target instead rather than the unsupported iSCSI/NFS method: https://kb.vmware.com/s/article/74537

Option 4 – Disable the notification.

As stated by Clay – ”

The environment that does not have Core Dump Configured will receive an Alarm as “Configuration Issues :- No Coredump Target has been Configured Host Core Dumps Cannot be Saved Error”.
In the scenarios where the Core Dump partition is not configured and is not needed in the specific environment, you can suppress the Informational Alarm message, following the below steps,

Select the ESXi Host >

Click Configuration > Advanced Settings

Search for UserVars.SuppressCoredumpWarning

Then locate the string and and enter 1 as the value

The changes takes effect immediately and will suppress the alarm message.

To extract contents from the VMKcore diagnostic partition after a purple screen error, see Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen (1004128).”

Summary

In my case it’s a home lab, I wasn’t too concerned so I followed Option 4, then simply disabled file core dumps following the second steps in Permanently disable ESXi coredump file (vmware.com)

Note* Option 2 was still required to get rid of another message: System logs are stored on non-persistent storage (2032823) (vmware.com)

Not sure, but maybe still helps with I/O to disable coredumps. Will update again if new news arises.

Manually Fix Veeam Backup Job after VM-ID change

The Story

There’s been a couple time where my VM-IS’s change:

  • A vSphere server has crashed beyond a recoverable state.
  • A server has been removed and added back into the inventory in vSphere.
  • Manually move a VM to a new ESXi host.
    • VM removed from inventory, and readded.
  • Loss vCenter Server.
  • Full VM Recovery via Veeam.

What sucks is when you go to run the Job in Veeam after any of the above, the job simply fails to find the object. You can edit the job by removing the VM and re-adding it, but this will build a whole new chain, which you can see in the repo of Veeam after such events occur:

As you can see two chains, this has been an annoyance for a long time for me, as there’s no way to manually set the VM-ID in vCenter, it’s all automanaged.

I found this Veeam thread discussing the same issue, and someone mentioned “an old trick” which may apply, and linked to a blog post by someone named “Ideen Jahanshahi”.

I had no idea about this, let’s try…

Determine VM-ID on vCenter

The source uses powerCLI, which I’ve covered installing, but easier is to just use the Web UI, and in the address bar grab it after the vms parameter.

Determine VM-ID in Veeam

The source installs SSMS, and much like my fixing WSUS post, I don’t like installing heavy stuff on my servers to do managerial tasks. Lucky for me, SQLCMD is already installed on the Veeam server so no extra software needed.

Pre-reqs for SQLCMD

You’ll need the hostname. (run command hostname).

You’ll need the Instance name. (Use services.msc to list SQL services)

Connect to Veeam DB

Open CMD as admin

sqlcmd -E -S Veeam\VEEAMSQL2012

use VeeamBackup
:setvar SQLCMDMAXVARTYPEWIDTH 30
:setvar SQLCMDMAXFIXEDTYPEWIDTH 30
SELECT bj.name, bo.object_id FROM bjob bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN Bobjects bo ON bo.id = oij.object_id WHERE bj.type=0
go

Some reason above code wouldn’t work on my latest build/install of Veeam, but this one worked:

SELECT name, job_id, bo.object_id FROM bjobs bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN BObjects bo ON bo.id = oij.object_id WHERE bj.type=0

In my case after remove the VM from inventory and readding it:

As you can see they do not match, and when I check the VM size in the job properties the size can’t be calculated cause the link is gone.

Fix the Broken Job

UPDATE bobjects SET object_id = 'vm-55633' WHERE object_id='vm-53657'

After this I checked the VM size in the job properties and it was calculated, to my amazement it fully worked it even retained the CBT points, and the backup job ran perfectly. Woo-hoo!

This info is for educational purposes only, what you do in your own environment is on you. Cheers, hope this helps someone.

vCLS High CPU usage

The Story

So I went to vMotion a VM to do some maintenance work on a host. Target machine well over 50% CPU usage.. what?! That can’t be right, it’s not running anything…

I tried hard powering the VM off, but it just came right back up suckin CPU cycles with it….

The Hunt

alright Google, what ya got for me… I found this blog post by “Tripp W Black” he mentions stopping a vCenter Service called “VMware ESX Agent Manager”, which he stops and then deletes the offending VMs, sounds like a plan. Let’s try it, so login into VAMI. (vcenter.consonto.com:5480)

K, let’s stop it… let me hard power off the VM now… ehh the VM is staying dead and host CPU:

K let’s go kill the other droid I have causing an issue…

ok I got them all down now, but the odd part is I can’t delete them from disk much like Sir Black mentioned in their blog post. The options is greyed out for me, let’s start the service and see what happens…

The Pain

Well, that was extremely annoying, it seemed to have worked only for a moment and the CPU usages came right back, so I stopped the service again, but I can’t delete the VMs…

Similar issues in vSphere 8, even suggestions to stay running in retreat mode, which I’ll get to in a moment. So, if you are unfamiliar, vCLS are small VMs that are distributed to ESXi hosts to keep HA and DRS features operational, even if vCenter itself goes down. The thing is, I’m not even using HA or DRS, I created a cluster for merely EVC purposes, so I can move VMs between hosts live at my own leisure and without downtime. What’s annoying is I shouldn’t have to spend half my weekend day trying to solve a bug in my HomeLab due to poor design choices.

The Constructive Criticism

VMware…. do not assume a cluster alone requires vCLS. Instead, enable vCLS only when HA or DRS features are enabled.

Now that we have that very simple thing out of the way.

The Fix

So, as we mentioned we are able to stop the vCLS VMs when we stop the EAM service on vCenter, but that won’t be a solution if the server gets rebooted. I decided to Google to see how other people delete vCLS when it doesn’t seem possible.

I found this reddit thread, in which they discuss the same thing mentioned above “Retreat Mode”. However, after setting the required settings (which is apparently tattoo’d after done), I still couldn’t delete the VMs, even after restarting the vpxd service. Much like ‘bananna_roboto’ I ended up deleting the vCLS VMs from the ESXi host UI directly, however when checking vCenter UI the still showed on all the hosts.

After rebooting the vCenter server, all the vCLS VMs were gone, at first, I thought they’d come back, but since the retreat mode setting was applied it seems they do not get recreated. Hence, I will leave Retreat mode enabled as suggested in the reddit thread for now, since I am not using HA or DRS.

So if you want to use EVC in a cluster, but not HA and DRS and would like to skim even more memory from your hosts, while saving on buggy CPU cycles, apparently “Retreat mode” is what you need.

If you do need those features, and you are unable to delete the old vCLS VMs, and restarting the EAM service doesn’t resolve your issue (which it didn’t for me), you may have to open a support case with VMware.

Any, I hope this helped someone. Cheers.