ESXi VM disconnected after applying patch

Keep this short.

I had to update a ESXi host locally as it’s mgmt connection would drop with all VMs having to go down on it. As it’s a single host with virtualized network components.

On one of the VMs this was an opportunity to update it since it needed to be shutdown temperately anyway. I took a snapshot of the VM, updated it, validated updates were fine, removed the snapshot. Then proceeded to update the host:

vim-cmd hostsvc/Maintenance_mode_enter
esxcli software vib update -d "/path/to/file.zip"
vim-cmd hostsvc/Maintenance_mode_exit
reboot

Nothing special here. However once the host came back up and I was able to access it via vCenter, one of the VM’s was shown as “disconnected” I’ve seen this with ESXi hosts before, but not particularly with a VM.

Oddly enough there’s only one datastore on the host and all other VMs are fine, and checking the datastore, all files are where they should be.

I figured maybe remove the VM from inventory and just re-add it via the vmx file, however the option was greyed out.

It turned out there was apparently still a snapshot left on the VM (noticed via delta files existing within the VMs folder path).

Removing all the snapshots resolved the issue. Turns out the VM was also running, but didn’t show the green play icon, thus I wasn’t aware of it’s powered on state. Which also explains the greyed out context menu for removing from inventory.

Hope this helps someone.

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.

 

Creating Custom ESXi Image

Follow these steps

  1. Download Offline Bundle of ESXi Image
  2. Download Drivers E.G The Native ESXi USB NIC drivers
  3. Install PowerCLI (Set-ExecutionPolicy Remotesigned; Import-Module PowershellGet; Install-Module -Name VMware.PowerCLI)
  4. In PowerCLI connect the standard SoftwareDepot by typing:

    Add-EsxSoftwareDepot -DepotUrl <Path to zip>

  5. Get the ImageProfile list:

    Get-EsxImageProfile

  6. Clone standard ImageProfile:

    New-EsxImageProfile -CloneProfile ESXi-6.7.0-8169922-standard -Name MyProfile -Vendor <vendor>

  7.  [Only If Required] If your vib file has Acceptance Level – CommunitySupported, we need to set this Acceptance Level for our ImageProfile:

    Set-EsxImageProfile -ImageProfile MyProfile -AcceptanceLevel CommunitySupported

  8. Add our vib to SoftwareDepot:

    Get-EsxSoftwarePackage -PackageUrl <path to vib>

  9. Add our vib to ImageProfile:

    Add-EsxSoftwarePackage -PackageUrl

Error:

Search result.

Answer driver for specfic version (7.1, need 6.5)

So I downloaded the proper driver but I couldn’t figure out how to pick the right software package since the “get” command was actually already loaded the other driver, so it kept trying to add the 7.1 driver. Only thing I could think of was to close the powershell windows and start fresh…

10. Export ImageProfile to ISO image:

Export-EsxImageProfile -ImageProfile MyProfile -ExportToIso -FilePath

That was it! Sadly the laptop I wanted to use this on was still boot looping, and sadly the USB NIC “Insagnia” didn’t seem to work and was getting NFS4 client failed to load, and not network adapters found on the machine. But was worth a shot.

VMware vCenter Updates using VAMI

This is a quick post on the latest security release notification from VMware.

VMSA-2021-0002 (vmware.com)

If for whatever reason an update is not possible you can follow these workarounds.

While you can use VUM to distribute updates and patches to ESXi hosts.

You’ll have to use VAMI for updating vCenter.

You can download the latest patches here (vmware account required).

I did this on my lab vCenter,  took a lil while but not bad.

  1. Made a backup of the VCSA using Veeam
  2. Shutdown Veeam or any other backup solution that might use vCenter
  3. Notified anyone that might use vCenter that it would be inaccessible during update
  4. Attached ISO to VCSA VM (You can do as 4sysops did and upload to a datastore, or you can simply open the VCSA console via VMRC, and attach the ISO from your Downloads folder)
  5. Log into VAMI (https://vcsa:5480)
  6. Click Update on left nav, then Update -> Check CD-ROM
  7. The update should be available as the option, then click Stage and Install
  8. Accept the EULA, use/don’t use CEIP, Check I have a backup, Click Install.

It could take an hour or so, then everything is back to running state, here’s the summary page after completion:

You can read the alternative methods such as using CLI, or how to handle a vCenter HA cluster upgrade using the link above to 4sysops guide on upgrading vCenter.

Sorry this post is not as extensive as usual, just a heads up about the latest VMware patches. Stay Safe out there.

 

ESXi 6.7 on HPE DL380 G7

I had this long blog post I was going to write about HP screwing me on a ESXi upgrade, but in a nut shell you can read these ones about that how shebang:

  1. MonsterMuffin (crude)
  2. Claud “Admin” (Less crude)

As both of them mention you have to do a clean install, and you probably won’t have a config saved from that exact version as you are just updating to it, so your config on the old 5.x or 6.0 won’t work either. If you have dvswitches and all that fun jazz probably not a huge deal but if you have standard vswitches and lots of custom configurations around them including vlan tagging, well this can be crappy.

I did manage through all my trial n errors to get a working copy but it required workarounds I don’t think would have been supported, so meh just follow those…

I tried everything to get a ESXi system upgraded to 6.7 without loosing or reconfiguring the host, you figure just do anew install and reload the config.

However you can only load config for the same version a backup of that one was created. I eventually came across other things other PSOD’s and had to even at one point edit the boot file to remove a HP dedicated driver from loading. After all that meh, just install new with the custom images mentioned in the above Blogs.

I’m really sorry I would have covered these tasks in far more detail but spent a good couple days smashing my head just trying to get it to work. something are just not worth the effort, and blogging every annoying error and steps along the way on this one… is just one of those things.

Cheers.

Managing HPE Storage controllers on VMware ESXi

HPE Storage on ESXi

Quick Overview

Assumptions, Device drivers and tools are already on the ESXi host as servers such as these running on ESXi should be using authorized images from the vendor and on the Hardware Compatibility List (HCL).

If not use this guys blog on how to manually install the tools that should otherwise already be on the server in question.

I recently decided to double check some server setups running for testing. Since it was all tests I figured I’d talk about some of the implications of simple misconfigurations or even just the unexpected.

Most of these commands I used from following Kalle’s blog and the command list was super useful.

List PCI Devices

To start if you are in a hoop and need to find what storage controller is in use by the hypervisor, run this to list all the devices (least the ones on the PCI bus)

lspci -vvv

This will present you this a long list of devices for my test device (an HP DL385 Gen8) it turned out to be an HP Smart Array P420i:

That’s cool.

Storage Config

To see the current config run:

/opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show config

This shows to me what I already knew, I have 2 logical drives both created with RAID 1+0 tolerance with different amount of different sized drives. In this case one from 4 900 Gig SAS drives, and the other from 12 300 Gig SAS drives.

From this information we can’t determine the speed of the drives.

Controller Status

To view the status of the controller:

 /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show status

From this we can tell the type of controller, double verifying the results from the lspci command and that there is cache available. Still not sure at this point what type of cache we are dealing with. Our goal is to use the Battery Based Write Cache for the logical volumes.. but we still have some things to cover before we get there.

 /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show detail

with these details we get to see more of the juicy information, here we can tell we have a cache board for the controller available in “slot 0” as indicated by the “slot” attribute.

Also note the Drive Write Cache, which is when the physical drive itself enables cacheing. However, we again, want to use the BBWC to prevent data loss in the event of a power outage as to not leave our VM’s with corrupted virtual drives. Read this thread on a bit more details about this.

Physical Disk Status

To view all the disks and if they are OK:

/opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 pd all show status

in my case they were all OK.

Physical Disk Details

Now this is where we get to see more details on those SAS disks I talked about ealier:

 /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 pd all show detail

here we can tell now that the 300 Gig SAS disk is a 10K SAS disk, not bad… 🙂

Logical Drive Status

Run this to get a very basic status report of the logical drives created from all the physical drives.

 /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld all show status

 

Logical Drive Details

Change the all to the logical volume ID number, in this case 2 for the 300 Gig based array.

/opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 show

Just to how the difference against the logical disk I know I enabled cache on and has unreal better performance…

/opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 1 show

Now I created these logical drives during the boot of the server using the BIOS/EFUI tools on the system. Lucky though we can adjust these settings right from the esxcli. 🙂

Enable Logical Write Cache

Just like magic:

/opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 logicaldrive 2 modify arrayaccelerator=enable

Being specific to change logical drive 2 which was the one that did not have cache enabled originally… checking it after running the above command shows it has cache! 🙂

All Commands

Just incase Kalle’s site goes down here’s the list he shared for both ESXi 5.x and 6.x

Show configuration
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl all show config
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show config
Controller status
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl all show status
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show status
Show detailed controller information for all controllers
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl all show detail
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl all show detail
Show detailed controller information for controller in slot 0
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 show detail
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 show detail
Rescan for New Devices
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli rescan
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli rescan
Physical disk status
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 pd all show status
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 pd all show status
Show detailed physical disk information
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 pd all show detail
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 pd all show detail
Logical disk status
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld all show status
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld all show status
View Detailed Logical Drive Status
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 2 show
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 show
Create New RAID 0 Logical Drive
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 create type=ld drives=1I:1:2 raid=0
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 create type=ld drives=1I:1:2 raid=0
Create New RAID 1 Logical Drive
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2 raid=1
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2 raid=1
Create New RAID 5 Logical Drive
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2,2I:1:6,2I:1:7,2I:1:8 raid=5
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 create type=ld drives=1I:1:1,1I:1:2,2I:1:6,2I:1:7,2I:1:8 raid=5
Delete Logical Drive
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 2 delete
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 delete
Add New Physical Drive to Logical Volume
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 2 add drives=2I:1:6,2I:1:7
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 add drives=2I:1:6,2I:1:7
Add Spare Disks
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 array all add spares=2I:1:6,2I:1:7
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 array all add spares=2I:1:6,2I:1:7
Enable Drive Write Cache
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify dwc=enable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 modify dwc=enable
Disable Drive Write Cache
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify dwc=disable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 modify dwc=disable
Erase Physical Drive
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 pd 2I:1:6 modify erase
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 pd 2I:1:6 modify erase
Turn on Blink Physical Disk LED
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 2 modify led=on
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 modify led=on
Turn off Blink Physical Disk LED
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 ld 2 modify led=off
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 ld 2 modify led=off
Modify smart array cache read and write ratio (cacheratio=readratio/writeratio)
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify cacheratio=100/0
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 modify cacheratio=100/0
Enable smart array write cache when no battery is present (No-Battery Write Cache option)
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 modify nbwc=enable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 modify nbwc=enable
Disable smart array cache for certain Logical Volume
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 logicaldrive 1 modify arrayaccelerator=disable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 logicaldrive 1 modify arrayaccelerator=disable
Enable smart array cache for certain Logical Volume
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 logicaldrive 1 modify arrayaccelerator=enable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 logicaldrive 1 modify arrayaccelerator=enable
Enable SSD Smart Path
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 array a modify ssdsmartpath=enable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 array a modify ssdsmartpath=enable
Disable SSD Smart Path
ESXi 5.5 -> /opt/hp/hpssacli/bin/hpssacli ctrl slot=0 array a modify ssdsmartpath=disable
ESXi 6.5 -> /opt/smartstorageadmin/ssacli/bin/ssacli ctrl slot=0 array a modify ssdsmartpath=disable

vCenter 503 Service Unavailable

I was going to test a auditing script from a DefCon presenter on my AD server, when I was adding the USB controller and the USB stick I was passing thorugh to get the script in my VM was being weird.

First USB 3.0 connected just fine, and connected the USB device to the VM, but diskpart was not showing it. So I went to remove it and try a USB 2.0 controller, that failed to connect since the USB 3.0 was still showing there and I selected to remove it again, which it errored another concurrent task. Makes sense, till refreshing the page told me unprivileged account. I wasn’t sure what this was about, so I decided to open another window and navigate to my center web app… 503 service unavailable:

“503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x000055aec30ef1d0] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)”

What the… rebooting the VCSA showed no success still same error even with an incognito window.. ughh.

I found this thread: https://communities.vmware.com/thread/588755

I was going through this, and decided to try to renew the certs, even though my internal PKI certs were still valide (AFAIK, and checking the cert provided when accessing the page). Now here’s the thing, while I ran the certificate-manager script and renewed all the certs, I noticed my AD server somehow was down. I booted it back up. I’m not exactly sure which fixed it. So I decided to take another snapshot while it was in this “fixed state” and revert to the  broken state. After restoring o the broken state nothing was responding at all on the https service from the VCSA, so I gave it a simple reboot (which I did initially before I noticed my AD server was down, for some reason). Sure enough after the reboot everything was working fine with my internal PKI certs.

I guess if you set vCenter to use MS AD as the primary login domain and that domain is not available the web management service becomes unavailable… that kind of sucks. I should have noticed my AD was not operational but I didn’t have monitoring on it 😉 or use my local workstation as a AD member. Mostly just random VMs I have for testing.

Like most people, should have looked at the logs for a better idea of what the root cause was. I threw 2 darts at a dart board and had to revert to find the true root cause. Not the best way to troubleshoot, but sometimes if logs are not available it is another method…

Installing PowerCLI 12.0 Offline

PowerCLI 12.0

Offline Install

Checking VMwares source wasn’t too insightful…

Just this with the “Download” button redirecting to an alternative site non-other than powershellgallery.com …clicking manual Download gives you the raw nuget package let’s try to install first normally.

Install-Module -Name VMware.PowerCLI

No way it failed, expected, and it even states a warning about the network.

Alright so using an online computer copy the nuget package to the offline (use USB sticks, Floppy drives, Zip Drives, serial modem if that’s what it takes…)

In my case I was testing this on a VM and simply used a USB stick to mount it to the VM from the VMRC console, and copied the nuget file to c:\temp\PowerCLI

This from this MS Doc page on the cmdlet, is for Visual Studio, we are using powershell only…

This topic describes the command within the Package Manager Console in Visual Studio on Windows. For the generic PowerShell Install-Package command, see the PowerShell PackageManagement reference.

Sure enough this is where I gave up on this path. All the new stuff is nice with it all being connected makes life super easy, but in those locked down situations this is a hassel. Since I wasn’t sure how to install the nuget package via a simple ID option like Install-Package for VS PS, there wasn’t one for the regular PS Install-Package cmdlet. Then I went to google how to accomplish this and was a bit annoyed at all the steps required to do it via the package manager… Read this by William on Stackoverflow for more details.

Lucky for me I found an alternative blog post, which does an alternative offline install and much, much simpler.

From the online system instead of saving the nuget package we save the modules files themselves directly.

 Save-Module -Name VMware.PowerCLI -Path C:\temp\PSModules

Copy the entire contents of the PSModules folder to a storage medium of your choice (e.g. USB flash drive) and transfer the files to the desired offline system where PowerCLI is needed.

If you have admin rights on the target system, you can copy files to the location below.

 C:\Program Files\WindowsPowerShell\Modules

At this point he goes on about some settings and stuff, I wasn’t exactly sure how to use PowerCLI, as usually it opens up in a custom PS window before. Now you simple import-module *modulename*

Import-Module VMware.PowerCLI

Now creating custom ESXi images should be a breeze!

Extra Bits

Customer Experience Improvement Program (CEIP)

The VMware Customer Experience Improvement Program collects data about the use of VMware products. You can either agree (true) or disagree (false). For offline systems, only the rejection (false) makes sense. The command shown below suppresses future notifications within PowerCLI.

Set-PowerCLIConfiguration -Scope AllUsers -ParticipateInCeip $false

Ignore invalid SSL certificates

When using self-signed certificates in vCenter, PowerCLI will deny the connection. This behavior can be suppressed with the command:

Set-PowerCLIConfiguration -Scope AllUsers -InvalidCertificateAction Warn

Found the types from this old 5.1 documentation you can also set it to ignore instead of warn. 🙂 Cheers!

ESXi Upgrade Failure

Upgrading one of my ESXi hosts in my lab failed on me, sure enough I figured this might happened and put a head on my usually headless server. This means I plugged in a monitor. at the screen I was this:

well that sucks, googling I found this thread from VMware.

looking closer at the boot error before this it stated:

system does not have secure boot enabled. This being an old mini desktop from the mid 2000’s it had uEFI but did not have the “feature” of secure boot. Clearly an after thought of the time. Now the odd part is when I hit the boot menu key “f12” in my case, I had the “legacy” BIOS style, list as P0: Hard Disk and EFI: Hard Disk. When I picked P0 one it booted just fine. So I figured just a simple boor order fix adjust some settings much like the thread (disable EFI boot and stick with legacy). I couldn’t see a way in my EFI/BIOS options to disable the alternative boot types, so I put the legacy type at the top of the list and the EFI one at the bottom, yet every time I booted it would boot the EFI one. When I check the vCentre system it wouldn’t remediate aka update to the new version, so I had to click remediate, run downstairs, and ensure I was there to pick the Legacy Disk boot, even after setting the boot order in the BIOS it wouldn’t stick to legacy and this was the only way I could get the upgrade to succeed.

Dang Computers…

Oh yeah.. this happened to me to, while I was trying to migrate some servers, I wanted to move some VM’s vNic into different VMPGs so I decided to rename the one they were currently using. I created the new VMPG in the alternative vSwitch, and i was a bit stumped to see them already there. I had presumed that once I renamed the VMPG it would reflect as the new name on the VM settings and still be on that old vSwitch (in secret it is). When I went to delete the vSwitch it told me error failed to delete “a specified parameter is not correct”. Googling I found this 10 year old blog that still relevant in ESXi 6.5.

Had to simply edit the VMs vNics and change them back. Dang Computers…

VMware ESXi boot and the Config

Sadly this post will be really short as again, lots going on. Recovering a host that failed after a regular reboot, which had a superblock corruption on it’s main OS drive. Also, the BELK series will be done, I just need a bit more time. Sorry for the delays.

“Failed to load /sb.v00” [Inconsistent Data]

Since this drive was not on the main datastore on the host all the VMs were unaffected.

Now loading linux showed the drive data was till accessible, but I also had a feeling this USB drive was on it’s way out. I created a copy using DD, *sadly I didn’t do it the smart way and place it on a drive big enough to save it as a image file, but instead directly to another drive of the same size.

I tried to install the same image of ESXi on top of the current one in hopes it would fix the boot partition files along the way. This only made the host get past /sb.v00 and vault randomly past it with “Fatal Error: 6 [Buffer Too Small]”

I was pretty tired at this point since the server boot times are rather long and attempts were becoming tedious. I did another DD operation of the drive, to the same drive (still not having learned my lesson) and when I awoke to my dismay, it failed only transferring 5 gigs with an I/O error. This really made me sure the drive was on the way out, but it was still mountable (the boot partitions 5, 6 and 8)

At this point you might be wondering, why doesn’t he just re-install and reload a backup config? Which is fair question, however one was not on hand, but surely it must be somewhere on the drive. I know how to create and recover on a working host but a one that can’t boot? Then I found this gem.

Now through out my attempts I did discover the boot partitions to be 5 and 6 and I did even copy them from a new install to my copied version I made about and it did boot but was a stock config. I was stumped till I read the section from the above blog post on “How to recover config from a system that doesn’t boot”. Line 7 was what nailed it on the head for me:

“mount /dev/sda5 /mnt/sda5

7. In the /mnt/sda5 directory, you can find the state.tgz file that contains ESXi configuration. This directory (in which state.tgz is stored) is called /bootblank/ when an ESXi host is booted.”

I was just like … wat? That’s it. Grabbed the bad main drive mounted on a linux system, saw the state.tgz file and made a copy of it, connected the new drive that had a base ESXi config, replaced the state.tgz file with the one I copied, booted it and there was the host in full working state with all network configs and registered VMs and everything.

Not sure why the config is stored in the boot partition, but there you go. Huge Shout out to Michael Bose for his write I suggest you check it out. I have saved it case it disappears from the internet and I can re-publish it. For now just visit the link. 🙂