vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

Clear vCenter Alert Certificate Status

Story

So lately updated a couple vCenter server servers, and in my process I hit a couple errors that required some resolving…

  1. Expired Certs on Source vCenter
  2. Error [500] Auth Provider, due to something, potentially bad certs.
  3. An HPE Bug, filling up ramdisk, causing HA config issues.
  4. Change in security process; preventing login.

The Problem

So a couple hiccups along the way. And now it’s time to resolve this one…

Yeahhhh and alert on Certificates… Seems like VMware and certificate management is like Oil n Water. They don’t mix well.

I’ve had some terrible times managing certificates  with VMware. However as blogged about here, seems there’s finally a way to use your own certificates via the WebUI.

Anyway… to the point, you figured you simply navigate to the vCenter WebUI -> Home -> Administration -> Certificates. Only to realize there’s nothing reporting as invalid or expired.

Checking for Expired Certs

What gives? Ahhh yes, more hidden secret stuff that is not in your face when it comes to the WebUI. Can you guess? That’s right another VMware KB

So while the other issues I’ve mentioned does have references and script in relation to certs, the only “check” in those previous posts was using openssl on the VCSA shell to grab the certificate from the listening service on the dedicated port. Which was based on a particular symptom which spurred that check. So here’s the KB telling you how to actually check the certificates the easiest way I found so far (no check.py; python script needed)

for store in $(/usr/lib/vmware-vmafd/bin/vecs-cli store list | grep -v TRUSTED_ROOT_CRLS); do echo "[*] Store :" $store; /usr/lib/vmware-vmafd/bin/vecs-cli entry list --store $store --text | grep -ie "Alias" -ie "Not After";done;

That’s it! :D…. which just like the KB indicated which cert was bad, in this case, an old Root CA that was used in previous deployments of vCenter before upgrades, So it turns out even though you follow the required KB to get past the pre-check of expired certs. It doesn’t delete the old certificates CA Cert.

There it is, the second CA Cert with expiry in 2019… OK so… You figured it would be easy to clean this up, but remember you couldn’t even see it in the WebUI, so you best believe there is no WebUI way to do this that protects you from human error.

Removing old Expired Certs

Instead, very brilliantly, you get… yes another KB! Booo Yeah… So let’s do this!

The main thing to note about this is…

Certificates are copied back to the VECS store because the CA Certificate which is expiring is published to the VMware Directory Service (VMDIR). When the Certificate is removed from VECS, VMDIR adds the Certificate back to VECS during a sync operation. This is done in order to ensure the integrity of the TRUSTED_ROOTS Certificate store, as deletion of an incorrect Certificate from this store could cause the environment to be irreparably damaged.

OK…. All I take away from this is Certs are important so they have a second cert store as a backup to the first cert store… that’s all I can take away form this odd statement.

/usr/lib/vmware-vmafd/bin/vecs-cli entry list --store TRUSTED_ROOTS --text | less

“Find the Certificate you wish to remove and make a note of the Alias and the X509v3 Subject Key Identifier.

Note: There Could be several Certificates to remove. Any expired and not in use certificates should be removed to avoid certificate related alarms.”

Yes that is the plan…

List the trusted certs published to the VMware Directory Service using the following command (administrator@vsphere.local password required). This command is in the same location as vecs-cli:

/usr/lib/vmware-vmafd/bin/dir-cli trustedcert list

Huh… in this case it looks like it is not here, so I should be safe to delete it from the normal store and it shouldn’t auto populate back in.

If you do see it (CN equal to x509v3 Key Identifier) then follow the linked KB to remove it, which seems to save a copy of the cert and use that saved copy to run another command to remove it from the store… super weird.

/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store TRUSTED_ROOTS --alias 3276134ad93b3688b5dc5dcfaa402e9bfd7af12f

Restart all services on the PSCs and on the vCenter Servers and ensure that all services start and respond normally and that you can log in and manage the environment.

service-control --stop --all
service-control --start --all

Took a liil while, then logging in… alert still there, I guess I just have to Reset to Green?

For Now Clicked the Reset to Green link. Even after Yet another vCenter patch, it still did not show up anymore. Yay.

Fixing [400] An error occurred while sending an authentication request to the vCenter Single Sign-On server

So After the last two blog posts about fixing vCenter7’s access issues due to it’s due certificate monument work flows. I was greeted with this error when trying to sign into the web UI on vCenter.

[400] An error occurred while sending an authentication request to the vCenter Single Sign-On server- An error occurred when processing meta data during vCenter Single Sign-On setup:the service provider validation failed. Verify that the server URL is correct and is in FQDN format, or that the hostname is a trusted service provider alias.

After a a quick google search I found yet another VMware KB discussing it.

 Resolution
This is an expected behavior.
VMware vSphere 7.0 enforce FQDN or IP address reverse resolvable to FQDN to allow authentication for Single-Sign on.
Greeeeeeeeeeeeeeaaaaaaaaat! Thanks VMware, just another example of security destroying functionality.
What did I do? Exactly what it stated, I navigated to the WebUI URL using the hostnames Fully Qualified Domain Name E.G: Hostname.domain.end
Cause I was attempting to access it just by just the hostname as domain info was being auto resolved by the domain suffix during queries.

vSphere HA Agent cannot be correctly installed or configured

I updated a vCenter server to 7.0.x when logging into the newly updated vCenter one host in the cluster state the following alert.

Error: “vSphere HA agent cannot be correctly installed or configured” (2056299) (vmware.com)

The KB didn’t sound promising. Checking the hosts event logs. a bunch of errors about /tmp ramdisk being full…

The ramdisk ‘tmp’ is full – VMware ESXi on HPE ProLiant – Davoud Teimouri – Virtualization and Data Center

For real? Wow, not gettin’ lucky last couple weeks. Sure enough exact same issue, cleared /tmp temporality, and downloaded the patch. When I vMotion the VMs from this host onto another host the VMs themselves showed alerts.

Virtual machine failed to become vSphere HA Protected and HA may not attempt to restart it after a failure.

I kept chugging alone in hopes I’d resolve each VM later. However as soon as I placed the issued host into maintenance mode, the alerts from all VMs disappeared. Applied to patch exactly as the HPE KB stated for the ESXi version it was on.

With luck on my side, the host came up clean, and came out of maintenance mode without an issue, and all error and alerts were resolved. Woooo!

Hope this helps anyone doing a vCenter upgrade to 7.x

Fixing vCenter [500] An error occurred while fetching identity providers.

Story

So The other day I posted about upgrading vCenter to 7.0.x while everything went fine during the upgrade. For some odd reason a couple days later when I went to navigate to the vCenter login page I was greeted with:

[500] An error occurred while fetching identity providers.

Kind of wished I had read this reddit post right off the hop, cause the first reply was is going to be my answer at the end of this post.

I did however first hit this KB about it as well I was a bit thrown off has it indicated to only do it if you see the following in the logs:

(/var/log/vmware/trustmanagement/trustmanagement-svcs.log)

2021-03-10T09:27:03.474Z [tomcat-exec-14  INFO  com.vmware.identity.token.impl.X509TrustChainKeySelector  opId=] Failed to find trusted path to signing certificate <STS Certificate Subject, example - C=US,CN=ssoserverSign\,dc\=vsphere\,dc\=local>
java.security.cert.CertPathBuilderException: Unable to find certificate chain.

Which I could not see, so I wasn’t sure if this was the issue or not. What I did see in my logs was the following:

2021-09-17T23:58:03.945Z [tomcat-exec-14 WARN com.vmware.vcenter.trustmanagement.impl.VcIdentityProviders opId=] com.vmware.sso.interop.ldap.NoSuchObjectLdapException: No such object
LDAP error [code: 32]

and

2021-09-18T01:19:01.322Z [tomcat-exec-26 INFO com.vmware.vapi.security.AuthenticationFilter opId=] Not successful authentication
java.lang.RuntimeException: Authentication data not found
Caused by: com.vmware.vapi.dsig.json.SignatureException: Cannot verify the signature over the provided data

So it wasn’t matching. Looking at my firewall I couldn’t see any LDAP connections from vCenter to my LDAP server since the upgrade. So I decided instead to try a reboot. This simply made things worse.

No Healthy Upsteam

Now when I’d try access vCEnter Web UI I was greeted with a blank white web page with simple text stating “No Healthy Upstream”, now looking into this, people reached this problem for several different reasons. As mentioned here and here and for some odd reason this guy just changed his IP address?! Weird.

For me I checked the local Hosts file and it was fine, and couple other mentioned fixes and they all didn’t work for me.

Try Anyway

For some reason at this point I decided to double the mentioned work around in the initial VMware KB I found as the main login symptom was exactly the same even though I couldn’t validate the same log entries within the logs.

How to Copy Files to VCSA via WinSCP

Now a couple real quick things to note here. You need to copy a script to the VCSA. If you get unable to agree on a cipher suite, you’ll need to update your copy of WinSCP to a newer version. Also instead of doing what VMware says to change the shell on the VCSA, do what this guy suggests instead:

“In the new connection dialog, specify the Host name, User name and then click the Advanced button,

(VCSA 6.5)

Choose the Environment/SFTP option

Specify for SFTP server: shell /usr/libexec/sftp-server”

so much easier.

I decided to take a look at the script after copying it to the VCSA, and it had this line which had me hopeful it would actually work to resolve my issue:

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE

So I followed along with the workaround specified in the KB…

1) Download the attached fixsts.sh script from this article and upload to the impacted PSC or vCenter Server with Embedded PSC to the /tmp folder.

2) If the connection to upload to the vCenter by the SCP client is rejected, run this from an SSH session to the vCenter:

chsh -s /bin/bash

3) Connect to the PSC or vCenter Server with an SSH session if you have not already per Step 2.

4) Navigate to the /tmp directory:

cd /tmp

5) Run chmod +x fixsts.sh to make the file executable.

chmod +x ./fixsts.sh

6) Run ./fixsts.sh.

./fixsts.sh

Restart services on all vCenters and/or PSCs in your SSO domain by using below commands:

service-control --stop --all
service-control --start --all

my results:

To my Amazement it actually worked, and I was able to login into the vCenter server!! Wooo!

*Update* Here’s a great blog post covering managing or creating custom certificates with vCenter 7

Kinda funny that 7.0 is stated as 6.8 in the scripts.. mhmm

ESXi Update Network Config Failed
Set ESXi IP via CLI

Real quick post here. I was moving my ESXi hosts and vCenter to a new dedicated subnet. I did the usual; had a temp Windows System in the new subnet, create VMK with temp IP in new subnet, connect to ESXi Web UI via new Temp IP in new Subnet via temp Windows machine. Reconfigure default TCP/IP stack default gateway, change VMK0 IP address (and edit management port group VLAN id if applicable). and Away I’d go.

However on this one host for some unknown stupid reason it would simply fail “Failed – An Error occurred during host configuration”, and the detailed log was just as vague “operation failed diagnostics report unable to set network unreachable” OK… whatever, that shouldn’t matter do as I tell you! Here’s a snippet of the error, and the CLI command that simply worked without bitching.

I just figured let’s try the CLI way and see if it worked, and it turns out it did. The source I used to figure out the command syntax.

The commands I used:

Get IPs:

esxcli network ip interface ipv4 get

Set new IP:

esxcli network ip interface ipv4 set -i vmk1 -I 1.1.1.1 -N 255.255.255.0 -t static

Hope this helps someone.

Upgrade and Migrate a vCenter Server

Intro

Hello everyone! Today I’ll be doing a test in my home lab where I will be upgrading, not to be confused with updating, a vCenter server. If you are interested in staying on the version your vCenter is currently on but just patch to the latest version, see my other blog post: VMware vCenter Updates using VAMI – Zewwy’s Info Tech Talks

Before I get into it, there are a couple thing expected from you:

  1. An existing instance of vCenter deployed (for me yup, 6.7)
  2. A backup of the config or whole server via a backup product
  3. A Copy of the latest vCenter ISO (either from VMware directly or for me from VMUG)

Side Story

*Interesting Side Note* VM Creation dates property is only a thing since vCenter 6.7. Before that it was in the events table that gets rotated out from retention policies. 🙂

*Side Note 2* I was doing some vmotions of VMs to prepare rebooting a storage device hosting some datastores before the vCenter update, and oddly even though the Task didn’t complete it would disappear from the recent task view. Clicking all Tasks showed the task in progress but @ 0% so no indication of the progress. The only trick that worked for me was to log off and back in.

A quick little side story, it was a little while since I had logged into VMUG for anything, and I have to admit the site setup is unbelievably bad designed. It’s so unintuitive I had to Google, again, how to get the ISO’s I need from VMUG.

Also for some reason, I don’t know why, when I went to log in it stated my username and password is wrong. Considering I use a password manager, I was very confident it was something wrong on their end. Attempting to do a password reset, provided no email to my email address.

Distort I decided to make a another account with the same email, which oddly enough when created brought me right back to my old account on first log in. Super weird. According to Reddit I was not the only one to experience oddities with VMUG site.

Also on the note of VMware certification, I totally forgot you have to take one of the mandatory classes before you can challenge, or take any of the VMware exams.

“Without the mandatory training? Yes, they represent a reasonable value proposition. With mandatory training? No, they do not. Requiring someone who’s been using your products for a decade to attend a class which covers how to spell ESXi is patronizing if not downright condescending. I only carry VMware certifications because I was able to attain them without going through the nonsense mandatory training.”

“The exam might as well cost $3500 and “include” the class for “free”.”

Don’t fully agree with that last one cause you can take any one class (AFAIK) and take all the exams. I get the annoyance of the barrier to entry, gotta keep the poor out. 😛

Simple Summary about VMUG.

  1. Create account and Sign up for Advantage from the main site.
  2. Download Files from their dedicated Repo Site.

Final gripes about VMUG:

  1. You can’t get Offline Bundles to create custom ESXi images.
  2. You can’t seem to get older versions of the software from there.
  3. The community response is poor.
  4. The site is unintuitive and buggy.

So now that we finally got the vCenter 7 ISO

For a more technical coverage of updating vCenter see VMware’s guide.

For shits.. moving esxi hosts, and vcenter to new subnet.

1) Build Subnet, and firewall rules and vlans
2) Configure all hosts with new VMPG for new vlan
3) Move each host one at a time to new subent, ensure again that network will be allowed to the vCenter server after migration
4) Can’t change VMK for mgmt to use VLAN from the vCenter GUI, have to do it at host level.
i) Place host into maintenance mode, remove from inventory (if host were added by IP, otherwise just disconnect)
ii) Update hosts IP address via the hosts console, and update DNS records
iii) Re-add the host to the cluster via new DNS hostname

Changing vCenter Server IP address

Source: How to change vCenter and vSphere IP Address ( embedded PSC ) – Virtualblog.nl

changed IP address in the VAMI, it even changed the vpxa config serverIP address to the new IP automatically. it worked. :O

Upgrading vCenter

Using the vCenter ISO

The ISO is not a bootable one, so for me I mount it on to a Windows machine that has access to the vCenter server.

Run the installer exe file…

Click Upgrade

I didn’t enter the source ESXi host IP.. lets see

nope wants all the info, fill all fields including source esxi host info.

Yes.

Target ESXi Host for new VCSA deployment. Next

Target VCSA VM info. Next

Would you like, large or eXtra large?

pick VMs datastore location, next.

VM temp info, again insure network connections are open between subnets if working with segregated networks.

Ready to deploy.

Deploying VM to target ESXi host. Once this was done got a message to move on to Stage 2, which can be done later, I clicked next.

Note right here, when you get a prompt for entering the Root password, I found it to be the target Root password not actually the source.

Second Note Resolving Certs Expired Pre-Check

While working on a client upgrade, it was more in my face when doing the source server pre-checks and would not continue stating certificates expired.

I was wondering how to check Existing certs and while this KB states you can check it via the WebUI There  could be a couple issues.

1) You might not even be able to login into the WebUI as mentioned in this Blog, a bit of a catch 22. (Note* same goes for SSO domains, it can’t be managed by VAMI, so if there’s an AD issue with a source, you often get a service 503 error attempting to log on to the WebUI)

2) It might not even show up in that area of the WebUI.

In these cases I managed to find this blog post… which shockingly enough is the very guy who wrote the fixsts script used to fix my problem in this very blog post :O

Checking Certs via the CLI

Grab Script from This VMware KB

Download the checksts.py script attached to the above KB article.
Upload to attached script to the VCSA or external PSC.

For example, /tmp

Once the script has been successfully uploaded to VCSA, change the directory to /tmp.

For example:

cd /tmp

Run python checksts.py.

OK Dokie then, I guess this script doesn’t check the required cert… so instead I followed along with this VMware KB (Yes another one).

In which case I ran the exact commands as specified in the KB and saved the certificate to a txt, file and opened it up in Windows by double clicking the .crt file.

openssl s_client -connect MGMT-IP:7444 | more

So now instead of running the fixsts script, this KB states to run the following to reset this certificate to use the Machine Cert (self signed with valid date stamps, at least that’s what this server showed when checking them via the Certificate management are in the vCenter WebUI).

For the Appliance (I don’t deal with the Windows Server version as it EOL)

/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT > /var/tmp/MachineSSL.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store MACHINE_SSL_CERT --alias __MACHINE_CERT > /var/tmp/MachineSSL.key
/usr/lib/vmware-vmafd/bin/vecs-cli entry getcert --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > /var/tmp/sts_internal_backup.crt
/usr/lib/vmware-vmafd/bin/vecs-cli entry getkey --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT > /var/tmp/sts_internal_backup.key
/usr/lib/vmware-vmafd/bin/vecs-cli entry delete --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT -y
/usr/lib/vmware-vmafd/bin/vecs-cli entry create --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --cert /var/tmp/MachineSSL.crt --key /var/tmp/MachineSSL.key

Then:

  • service-control --stop --all
  • service-control --start --all

In my case for some odd reason I saw a bunch of these… when stopping and starting the services

2021-09-20T18:35:47.049Z Service vmware-sts-idmd does not seem to be registered with vMon. If this is unexpected please make sure your service config is a valid json. Also check vmon logs for warnings.

I was nervous at first I may have broke it, after sometime it didn’t complete the startup command sequence, and after some time the WebUI was fully accessible again. Let’s validate the cert with the same odd method we did above.

Which sure enough showed a date valid cert that is the machine cert, self-signed.

Running the Update Wizard… Boooo Yeah!

 

Uhhh, ok….

 

Ok dokie?

I didn’t care too much about old metrics.

nope.

Let’s go!

After some time…

Nice! and it appears to have worked. 🙂

Another Side Trail

I was excited cause I deployed this new VCSA off the FreeNAS Datastore I wanted to bring and reboot. but low and behold some new random VMs are on the Datastore…

doing some research I found this simple explanation of them however it wasn’t till I found this VMware article with the info I was more after.

Datastore selection for vCLS VMs

The datastore for vCLS VMs is automatically selected based on ranking all the datastores connected to the hosts inside the cluster. A datastore is more likely to be selected if there are hosts in the cluster with free reserved DRS slots connected to the datastore. The algorithm tries to place vCLS VMs in a shared datastore if possible before selecting a local datastore. A datastore with more free space is preferred and the algorithm tries not to place more than one vCLS VM on the same datastore. You can only change the datastore of vCLS VMs after they are deployed and powered on.

If you want to move the VMDKs for vCLS VMs to a different datastore or attach a different storage policy, you can reconfigure vCLS VMs. A warning message is displayed when you perform this operation.

You can perform a storage vMotion to migrate vCLS VMs to a different datastore. You can tag vCLS VMs or attach custom attributes if you want to group them separately from workload VMs, for instance if you have a specific meta-data strategy for all VMs that run in a datacenter.

In vSphere 7.0 U2, new anti-affinity rules are applied automatically. Every three minutes a check is performed, if multiple vCLS VMs are located on a single host they will be automatically redistributed to different hosts.

Note:When a datastore is placed in maintenance mode, if the datastore hosts vCLS VMs, you must manually apply storage vMotion to the vCLS VMs to move them to a new location or put the cluster in retreat mode. A warning message is displayed.

The enter maintenance mode task will start but cannot finish because there is 1 virtual machine residing on the datastore. You can always cancel the task in your Recent Tasks if you decide to continue.
The selected datastore might be storing vSphere Cluster Services VMs which cannot be powered off. To ensure the health of vSphere Cluster Services, these VMs have to be manually vMotioned to a different datastore within the cluster prior to taking this datastore down for maintenance. Refer to this KB article: KB 79892.

Select the checkbox Let me migrate storage for all virtual machines and continue entering maintenance mode after migration. to proceed.

huh, the checkbox is greyed out and I can’t click it.
vmotioned them and the process kept moving up.

How to remove a Datastore from a vSphere Cluster

How to Remove a Datastore

Intro

Hey everyone,

I figured I’d write up a quick little help guide on removing a Datastore. Now this isn’t new and likely to be buried on the internet because of it. However in my searches I have found the following sources to be great reads. I highly recommend you check them out.

1)  Official Source VMware KB2004605.

2) A Blog guide by Sam McGeown, here.

3) A post by Mike on cswitchzero.

Now let’s go through the checklist from the official source one by one.

Check List

  • If the LUN is being used as a VMFS datastore, all objects (for example, virtual machines, templates, and Snapshots) stored on the VMFS datastore must be unregistered or moved to another datastore.-This one is pretty easy navigate to the datastore files and check. You may find some remanence from the following though.
  • All CD/DVD images located on the VMFS datastore must also be unmounted/unregistered from the virtual machines.-This shouldn’t even be the case if you did check one.
  • The datastore is not used for vSphere HA heartbeat.-This setting will use a folder labeled “.vSphere-HA”
    For a Quick overview of Datastore Heart beating See here
    To “remove” aka change them See here
  • The datastore is not part of a datastore cluster.-You can find useless help on this process from VMware here. I’m assuming it’s an easy task via the WebUI
  • The datastore is not managed by Storage DRS.-If you removed it from the datastore cluster, how could this be an issue?
  • The datastore is not configured as a diagnostic coredump Partition/File and Scratch Partition. For more information, see the following:
  • Storage I/O Control is disabled for the datastore.-See here on how to enable (disabling is the exact reverse)
  • No third-party scripts or utilities running on the ESXi host can access the LUN that has issue.-Honestly I’m not sure how you could check this… even when doing some quick research, you can have scripts I guess that are not on the hosts, but run by alternative machines via PowerCLI. As described in this community post. I guess you’d have to know, either way the scripts would just fail, shouldn’t affect the vSphere cluster.
  • If the LUN is being used as an RDM, remove the RDM from the virtual machine. Click Edit Settings, highlight the RDM hard disk, and click Remove. Select Delete from disk if it is not selected and click OK.Note: This destroys the mapping file but not the LUN content.

    – This is more involving the removing of the backend physical device. Which in my case is the final goal. Though if yours was just to remove a datastore while keeping the physical storage in place this can be ignored.

  • As noted by Sam but not the official source or Mike is if you see a .dvsData folder. as stated by SAM “The .vdsData folder is created on any VMFS store that has a Virtual Machine on it that also participates in the VDS – so by migrating your VMs off the datastore you’ll be ensuring the configuration data is elsewhere.”
  • Check that there are no processes locking the VMFS with this command:
esxcli storage core device world list -d

Datastore Removal Steps

Step 1) Follow the Checklist above.

Make sure no files reside on the Datastore.

Step 2) Unmount Datastore from all ESXi hosts.

As noted by SAM blog post even in vSphere 5.x using the C# phat client, this was possible to do via a wizard against all hosts that have the datastore mounted. Even on the newer HTML5 WebUI this is still possible (I think everyone wants to fully forget that VMware chose flash for a short time).

At this point the Datastore will show up as inaccessible to vSphere. As noted by both Mike and Sam. This will be the same anywhere from 5.x-7.x (As noted by Mike it might be slightly more important to follow procedures with earlier versions of ESXi 3 or 4). If the Check list was followed, there should be no issues unmounting the datastore.

If you need to do this via esxcli (Source):

# esxcli storage filesystem list

Unmount the datastore by running the command:

# esxcli storage filesystem unmount [-u UUID | -l label | -p path ]

For example, use one of these commands to unmount the LUN01 datastore:

# esxcli storage filesystem unmount -l LUN01

# esxcli storage filesystem unmount -u 4e414917-a8d75514-6bae-0019b9f1ecf4

# esxcli storage filesystem unmount -p /vmfs/volumes/4e414917-a8d75514-6bae-0019b9f1ecf4

Step 3) Detach the LUN from all hosts.

As noted by Sam, if you are on 5.x you might want to automate this via PowerCLI. Then noted by Mike, newer 7.x can now do this in bulk via the Management WebUI.

6/7 WebUI -> Hosts n Clusters -> Hosts -> Cluster -> Host -> Configure Tab -> Storage Device (left side tree) -> Highlight Device -> Detach

for esxcli

Obtaining the NAA ID of the LUN to be removed

esxcli storage vmfs extent list

To detach the device/LUN, run the command:

# esxcli storage core device set --state=off -d NAA_ID

6. To verify that the device is offline, run the command:

# esxcli storage core device list -d NAA_ID

The output, which shows that the Status of the disk is off.

Step 4) Rescan HBAs

At this point, if you rescan all HBAs on all hosts the inaccessible datastore should be gone from the WebUI.

At this point you can remove the LUN from being seen (disc from showing up under devices) this will either be iSCSI based configurations (remove static and dynamic IPs from the iSCSI initiator settings on each host.) Mostly likely for a shared VMFS datastore.

It could be a local disc over a local storage controller (such as a logical drive created in RAID) such as behind a Pxxx storage controller.

Removing the source device will always be dependent on how it was configured in the first place.

Summary

So today we covered removing a Datastore. The important thing to remember is removing a Datastore takes a lot more steps than removing one, cause so many different VM’s and services can be applied to a datastore once it has started being used.

In many cases, the SysLog and Scratch partition are big hang ups, and should be looked at closely. Which, however, as stated if you are actually checking for files on the datastore this stuff will be pretty evident.

In most cases, ensure you follow the check list and the process should be pretty smooth. Hope this helps someone.

*Note* I often provide screen shots to provide some context, in this case I decided to leave it more generic to span multiple versions of vSphere.

ESXi new install; failed to create new Datastore

Well I booted up a new server, created a new logical drive, bot ESXi and Failed to create datastore… what is this?

Google help? Yeah Forms help.

1. Show connected disks.

ls -lha /vmfs/devices/disks/

(Verify the disk is seen. You will probably see your disk ID then :1. This is a partition on the disk. We only need to work about the main disk ID.)

Neat. next

2. Show the error on disk.

partedUtil getptbl /vmfs/devices/disks/(disk ID)

(It will probably indicate that the GPT is located beyond the end of the disk.)

Ohhh yeah, huh… fix it

3. Wipe disk and rewrite with a basic MSDOS partion.

partedUtil setptbl /vmfs/devices/disks/(disk ID) msdos

(The output from this should be similar to msdos and the next line will be o o o o)

Go to create data store after this, yay it worked. Please note to use your own values, images are just for reference.

*UPDATE* I went to reuse some old drives from an old RAID controller. In this case I had removed the logical drive from the old RAID configuration, pulled the disks. Since they were same Caddy as an alternative server, and went on to create some new logical drives to use as an alternative datastore on this particular host.

In the examples above, it would fail at creation of the datastore. In this example it failed at the point in the wizard to define the partition to create. with an error as follows:

“Either the selected disk already has a VMFS datastore or the host cannot perform a partition table conversion. Select another disk” in a nice red banner.

Now attempting My usual fix as mentioned above resulted in…

… to be updated (i have such a headache right now from the endless issues)

Had to clear the drives to fix this problem (delete logical drive) rip Drives out of server, use a USB enclosure to use “diskpart” and the “clean command on windows to clean the drives.

Then after that the health light on the server went off, saying my one disk or caddy is “unauthentic” even though it was just working. Apparently terrible engineer caddy’s.

Which to find out this issue I had to get into iLO which the admin password was unknown so had to run up my old blog post to get into that. and now after all that.. I have a headache.

Good job computers, you managed to make my day fantastic… again.

ESXi VM disconnected after applying patch

Keep this short.

I had to update a ESXi host locally as it’s mgmt connection would drop with all VMs having to go down on it. As it’s a single host with virtualized network components.

On one of the VMs this was an opportunity to update it since it needed to be shutdown temperately anyway. I took a snapshot of the VM, updated it, validated updates were fine, removed the snapshot. Then proceeded to update the host:

vim-cmd hostsvc/Maintenance_mode_enter
esxcli software vib update -d "/path/to/file.zip"
vim-cmd hostsvc/Maintenance_mode_exit
reboot

Nothing special here. However once the host came back up and I was able to access it via vCenter, one of the VM’s was shown as “disconnected” I’ve seen this with ESXi hosts before, but not particularly with a VM.

Oddly enough there’s only one datastore on the host and all other VMs are fine, and checking the datastore, all files are where they should be.

I figured maybe remove the VM from inventory and just re-add it via the vmx file, however the option was greyed out.

It turned out there was apparently still a snapshot left on the VM (noticed via delta files existing within the VMs folder path).

Removing all the snapshots resolved the issue. Turns out the VM was also running, but didn’t show the green play icon, thus I wasn’t aware of it’s powered on state. Which also explains the greyed out context menu for removing from inventory.

Hope this helps someone.