How to vMotion a VM without vCenter WITHOUT Shared Storage

While I have covered this in the past here:
How to vMotion a VM without vCenter – Zewwy’s Info Tech Talks

This was using shared network storage between hosts…. what If you have no vCenter AND no shared storage? In my previous post I suggested to go check out VMware arena’s post but that just covers how to copy files from one host to another, and what I’ve noticed is while it does work if you let it complete, the vmdk is no longer thin and takes up the full space as specified by its defined size.  This is also mentioned as such in this serverfault thread “Solutions like rsync or scp will be rate-limited and have no knowledge of the content (e.g. sparse VMDK files, thin-provisioned volumes, etc.)”

So options provided there are:

  1. Export the VM as an OVF file, move to a local system, then reimport the OVF to your ESXi destination.I attempted this but on the host I could only export vmdk. While attempting to do so I got network issues (browser asking to download multiple files, but I must not have noticed and time and timed out? not sure). This also requires an intermediary device and double down/up on the network, I’m hoping for a way between hosts directly.
  2. Use vSphere and perform a host/storage migration.This post is how to do it without. Also note I attempted this but in my case I’m using my abomination ESXi host I created in my previous blog post, and vCenter fails the task with errors. (Again SCP succeeds but doesn’t retain thin provisioning). Not sure why SCP succeeds but vCenter fails seems to be more redundant to poor connection and keeps going, which happens when the WiFi NICs underload in those situations.
  3. Leverage one of Veeam’s free products to handle the ad hoc move.

I love Veeam, but in this case I’m limited in resources, lets see if we can do it via native ESXi here.

So that exhausts all those options. What else we got…

Move VMware ESXi VM to new datastore – preserve thin-provisioning – Server Fault

Oh someone figured out what I did in my intital post all the way back in 2013… wonder how I missed that one.. oh well, same answer as my initial post though required shared storage… moving on…

LOL no way… Willam Lam all the way back from over 14 years ago! Answering the question I had about compression of the files. and saying the OVF export is still the best option.. mhmmm…

I don’t want to stick to just scp, man did it suck getting to 97% done on a 60 Gig provisioned VMDK, that’s only taking up roughly 20 gigs, to have to not work cause I put my machine to sleep thinking it was a remote connection (SSH) to the machine and the machine is doing the actual transfer… just to wake my machine the next morning to have a “corrupt” vmdk that fails to boot or svmotion to get thin. I have machines with fast local storage but poor network, it’s a problem from back in the day with poor slow internet speeds. So what do we have? We got gzip and tar, what’s the diff?

In conclusion, GZIP is used to compress individual files, whereas TAR is used to combine numerous files and directories into a single archive. They are frequently used together to create compressed archive files, often with the “.tar.gz” extension.

Also answered here.

“If you come from a Windows background, you may be familiar with the zip and rar formats. These are archives of multiple files compressed together.

In Unix and Unix-like systems (like Ubuntu), archiving and compression are separate.

tar puts multiple files into a single (tar) file.
gzip compresses one file (only).
So, to get a compressed archive, you combine the two, first use tar or pax to get all files into a single file (archive.tar), then gzip it (archive.tar.gz).

If you have only one file, you need to compress (notes.txt): there’s no need for tar, so you just do gzip notes.txt which will result in notes.txt.gz. There are other types of compression, such as compress, bzip2 and xz which work in the same manner as gzip (apart from using different types of compression of course).”

OK, so from this it would seem like a lot of wasted I/O to create a tar file of the main VDMK flat file, but we could gain from compressing it. Let’s just do a test of simple compression and monitor the host performance while doing so.

Another thing I noticed that I didn’t seem to cover in my previous post in doing this trick was the -ctk.vmdk files. Which are change block tracking files, as noted from here:

“Version 3 added support for persistent changed block tracking (CBT), and is set when CBT is enabled for a virtual disk. This version first appeared in ESX/ESXi 4.0 and continues unchanged in recent ESXi releases. When CBT is enabled, the version number is incremented, and decremented when CBT is disabled. If you look at the .vmdk descriptor file for a version 3 virtual disk, you can see a pointer to its *-ctk.vmdk ancillary file. For example: version=3

# Change Tracking File
changeTrackPath=”Windows-2008R2x64-2-ctk.vmdk”
The changeTrackPath setting references a file that describes changed areas on the virtual disk.
If you want to back up the changed area information, then your software should copy the *-ctk.vmdk file and preserve the “Change Tracking File” line in the .vmdk descriptor file. If you do not want to back up the changed area information, then you can discard the ancillary file, remove the “Change Tracking File” line, read the VMDK file data as if it were version 1, and roll back the version number on restore.

I’ll have to consider this when running some of the commands coming up. Now we still don’t know how much, if any, space we’ll save from compression alone and the time it’ll take to create the compressed file… from my research I found this resource pretty helpful:

Which Linux/UNIX compression algorithm is best? (privex.io)

Since we want to keep it native doing quick tests via the command line shows ESXi to have both gzip and xz but not lx4 or lbzip2, which kind of sucks as they showed to have the best performance in terms of compression speeds… as quoted by the article “As mentioned at the start of the article, every compression algorithm/tool has it’s tradeoffs, and xz’s high compression is paid for by very slow decompression, while lz4 decompresses even faster than it compressed.” Which is exactly what I want to see in the end result, if we save no space, then the process will burn I/O and expected life of the drive being used or pretty much zero gains.

Highest overall compression ratio: XZ If we gonna do this this is what we want, but how long it takes and how much resources (CPU cycles, and thus overall WATTS) trade off will come into question (though I’m not actually taking measurements and doing calculations, I’m looking at it at points and time and making assumed guessed at overall returns).

Time to find out what we can get from this (I’m so glad I looked up xz examples cause it def is not intuitive (no input then output parameters, read this to know what I mean) :

xz -c /vmfs/volumes/SourceDatastore/VM/vm-flat.vmdk > /vmfs/volumes/TargetDatastore/whereever/vmvmdk.xz

Mhmmm no progress… crap didn’t read far enough along and I should have specified the -v flag, not sure why that wouldn’t be defaulted, having no response of the console kind of sucks… but checking the host resources via the web GUI shows CPU being used, and write speed….. sad….

CPU usage:

and Disk I/O:

Yeah… maybe 4 MB/s and this is against a SSD storage on a SATA bus, there’s no way the storage drive or the controller is at fault here… this is not going to be worth it…

Kill command, check compressed file less than 300 MB in size, OI, that def not going to pay off here…

I decided to try taring everyting into one file without compression hoping to simply get it to to one file roughly 20gigs in size with max I/O. As mentioned here:

“When I try the same without compression, then I seem to get the full speed of my drive. ”

However to my dismay (maybe it ripped the SSDs cache too hard?) I unno I’d get I/O error, even though the charts showed insane throughput, I decided to switch to another datastore a spindle drive on the ESXi host and you can see the performance just sucks compared to the SSD itself.

Which now again stuck waiting cause instead of amazing throughput its stuck going only 20 MB/s apparently… uggghhhh.

To add to this frustration, I figured I’d try the OVF export option again, but I guess cause the tar operation has a read on the file, I’m assume a file lock, when attempting the OVF export it just spits an web response “File Not Found”. So, I can’t even have a race knowing full well the SSD could read much faster than what it’s currently operating at. I don’t really know what the bottleneck is at this point…

Even at this rate it’s feeling almost pointless, but man just to keep a vmdk thin, why, oh WHY SCP can’t you just copy the file as the size it is… mhmmm there has to be a way other than all this crap….

I don’t think this guy had any idea he went from thin too thick on the VM….

I thought about SSHFS, but it’s not available on ESXi server….

Forgot about Willams project GhettoVCB Great if I actually wanted more of a backup solution… considered for future blog, but over kill to just move a VM.

The deeper I go here the more the simply export to OVF template and import is seeming reaaaaaalll appeasing.

Awww man this tar operation looks like its takin more size then the source. doing a du -h on the source shows 19.7 Gigs… tar file has now surpassed 19.8 Gigs in size… with no sign of slowing down or stopping lol. Fuck man I think tar is also completely unaware of thin disk and I think it’ll make the whole tar file what ever the provisioned size was (aka thick). Shiiiiiiiiiiiiit!

Trying the Export VM option looked so promising,

until the usual like always… ERROR!!

FFS man!!! Can’t you just copy the files via SSH between hosts? Yeah but only if you’re willing to copy the whole disk and if you’re lucky holepunch it back to thin at the destination… can’t you do it with the actual size on disk… NO!

Try the basic answer on almost all posts about this, just export as template and import… Browser download ERROR… like Fuck!!!

Firefox… nope same problem… Fuck…. Google what ya got for me? well seems like almost the same as my initial move of using SCP but use WinSCP via my client machine and uttering in a middle man in the process, but I guess using the web interface  to download/upload was already a man in the middle process anyway… fine let’s see if I can do that… my gawd is this ever getting ridiculous… what a joke… Export VM from ESXi embedded host client Failed – Network Error or network interruption – Server Fault

And of course when I connect via Win SCP it see the hard drive as being 60 Gigs, so even though trafser speed are good to taking way more space then needs and thus waste data over the bus… FUCK MAN!!!!!

If only there was a way to change that, oh wait there is, I blogged about it before here: How to Shrink a VMDK – Zewwy’s Info Tech Talks

OK Make a clone just to be safe (you should always have real backups but this will do. and amazing this operation on the SSD was fast and didn’t fail.

Woo almost 300 MB/s and finished in under 4 minutes. Now let’s edit the size.

Well I tried the edit size, but only after doing a vmkfstools convertion of the vmdk would it show the new size in WinSCP, even then transferred the files and it was still corrupted in the end..

ESXi 6.5 standalone host help export big VM ? | MangoLassi

Mhmmm another link to Willams site, covering the exact same thing, but this time using a tool ovftool….

and wait a second… He also said there’s a way to use the ovftool on the ESXi server itself, in this post here….. mhmmmm If I install the Linux OVF tool on the ESXi host, I should be able to transfer the VM while keeping the thin disk all “native” on ESXi… close enough anyway…

Step 1) Download the OFV tool, Linux Zip

Step 2) Upload Zip file via Web GUI to Datastore. (Source ESXi)

Step 3) unzip tool (unzip ovfool.zip), then delete zip.

Step 4) Open outbound 443 on Source ESXi server. Otherwise you get error on tool.

Step 5) run command to clone VM, get error that managed by ESXi host.

Step 6) remove hosts from ESXi and run command again… fail cause network error (Much like OVF export error, seem that happens over port 443/HTTPS)

Man Fuck I can’t fucking win here!!!

I think I’m gonna have to do it the old fashioned way… doing it via “seeding”. plug in a Drive into the source ESXi, and physically move it to the target.

Tooo beeeee continued……

I grabbed the OVFtool for windows (the machine I was doing all the mgmt work on anyway, yet it too failed with network issues.

I decided to reboot the mgmt services on the host:

Then gave it one last shot…

Holy efff man the first ever success yet… don’t know if this would havefixed all my other issues, the export failing for https, and all the others? And the resulting OVA was only about 8 Gigs. Time to see if I can deploy it now on the target host.

I deployed the OVA to the target via the WebGUI without issue.

I also tested the ESXi webGUI export VM option and this time it also succeeded without failure, checking the host resources CPU is fairly high on both ovftool export or the webGUI export option. Using esxtop showed hostd process taking up most of the CPU usage during the processes. Further making me believe restarting that service is what fixed my issues…

Wireless ESXi Host

The Story

So, the other day I pondered an idea. I wanted to start making some special art pieces made from old motherboards, and then I also started to wonder could I actually make such an art piece… and have it functional?

I took an apart my old build that was a 1U server I made from an old PA-500 and a motherboard I repurposed from a colleague who gifted me their old broken system. Since it was a 1U system, I had purchased 2 special pieces to make it work, a special CPU heatsink (complete solid copper, with a side blower fan, and a 300 watt 1U PSU. both of which made lots of noise.

I also have another project going called “Operation Shut the fuck up” in which all the noisy servers I run will be either shutdown or modified to make zero noise. I hope with the project to also reduce my overall power consumption.

So I started by simply benching the Mobo and working off that, which spurred a whole interest into open case computer designs. I managed to find some projects on Thingiverse for 2020 extrusions and corner braces, cable ties… the works. The build was coming along swimmingly. There was just one thing that kept bugging me about the build… The wires…

Now I know the power cable will be required reguardless, but my hope was to have/install an outlet at the level the art piece was going to be placed at and have it nicely nested behind the art piece to hide it. Now there were a couple ways to resolve this.

  1. Use an Ethernet over Power (Powerline) adapter to use the existing copper power lines already installed in the house. (Not to be confused with PoE).
    There was just one problem with this, my existing Powerline kit died right when I wanted to use it for the purpose. (Looking inside looks like the, soldered to the board, fuse blew, might be as simple as replacing that but it could be a component behind the fuse failed and replacing it would simply blow the new fuse).
    *This is still a very solid option as the default physical port can be used and no other software/configuration/hackery needs to be done, (Plug n Play).
  2.  The next best option would be to use one of these RJ45 to Wireless adapters:
    Wireless Portable WiFi Repeater/Bridge/AP Modes, VONETS VAP11G-300.
    VONETS VAP11G-500S Industrial 2.4GHz Mini WiFi Bridge Wireless Repeater/Router Ethernet to WiFi Adapter
    This option is not as good as the signal quality over wireless is not has good as physical even when using Powerline adapters. However, this option much like the Powerline option, again allows the use of the default NIC, and only the device itself would need to be preconfigured using another system but otherwise again no software/configuration/hackery needs to be done.
  3.  Straight up use a WiFi Adapter on the ESXi host.

Now if you look up this option you’ll see many different responses from:

  1. It can’t be done at all. But USB NICs have community drivers.
    This is true and I’ve used it for ESXi hosts that didn’t have enough NICs for the different Networks that were available (And VLAN was not a viable option for the network design). But I digress here, that’s not what were are after, Wifi ESXi, yes?
  2.  It can’t be done. But option 1, powerline is mentioned, as well as option 2 to use a WiFi bridge to connect to the physical port.
  3.  Can’t be done, use a bridge. Option 2 specified above. and finally…
  4.  Yeah, ESXi doesn’t support Wifi (as mentioned many times) but….. If you pass the WiFi hardware to a VM, then use the vSwitching on the host.. Maybe…

As directly quoted by.. “deleted” – “I mean….if you can find a wifi card that capable, or you make a VM such as pfsense that has a wifi card passed through and that has drivers and then you router all traffic through some internal NIC thats connected to pfsense….”

It was this guys comment that I ran with this crazy idea to see if it could be done…. Spoiler alert, yes that’s why I’m writing this blog post.

The Tasks

The Caveats

While going through this project I was hit with one pretty big hiccup which really sucks but I was able to work past it. That is… It won’t be possible to Bridge the WAN/LAN network segments in OPNsense/PFsense with this setup. Which really sucked that I had to find this out the hard way… as mentioned by pfsense parent company here:

“BSS and IBSS wireless and Bridging

Due to the way wireless works in BSS mode (Basic Service Set, client mode) and IBSS mode (Independent Basic Service Set, Ad-Hoc mode), and the way bridging works, a wireless interface cannot be bridged in BSS or IBSS mode. Every device connected to a wireless card in BSS or IBSS mode must present the same MAC address. With bridging, the MAC address passed is the actual MAC of the connected device. This is normally a desirable facet of how bridging works. With wireless, the only way this can function is if all the devices behind that wireless card present the same MAC address on the wireless network. This is explained in depth by noted wireless expert Jim Thompson in a mailing list post.

As one example, when VMware Player, Workstation, or Server is configured to bridge to a wireless interface, it automatically translates the MAC address to that of the wireless card. Because there is no way to translate a MAC address in FreeBSD, and because of the way bridging in FreeBSD works, it is difficult to provide any workarounds similar to what VMware offers. At some point pfSense® software may support this, but it is not currently on the roadmap.”

Cool what does that mean? It means that if you are running a flat /24 network, as most people in home networks run a private subnet of 192.168.0.0/24, that this device will not be able to communicate in the layer 2 broadcast domain. The good news is ESXi doesn’t needs to work, or utilizes features of broadcast domains. It does however mean that we will need to manage routes as communications to the host using this method will have to be on it’s own dedicated subnet and be routed accordingly based on your network infrastructure. If you have no idea what I’m talking about here then it’s probably best not to continue on with this blog post.

Let’s get started. Oh another thing, at the time of this writing a physical port is still required to get this setup as lots of initial configurations still need to take place on the ESXi host via the Web GUI which can initially only be accessible via the physical port, maybe when I’m done I can make a mirco image of the ESXi hdd with the required VM, but even then the passthrough would have to be configured… ignore this rambling I’m just thinking stupid things…

Step 1) Have a ESXi host with a PCI-e based WiFi card.

I’ve tested this with both desktop Mobo with a PCI-e Wifi card, and a laptop with a built in Wifi Card, in both cases this process worked.

As you can see here I have a very basic ESXi server with some old hardware but otherwise still perfectly useable. For this setup it will be ESXi on USB stick, and for fun I made a Datastore on the remaining space on the USB stick since it was a 64 Gig stick. This is generally a bad idea, again for the same reasons mentioned above that USB sticks are not good at HIGH random I/O, and persistent I/O on top of that, but since this whole blog post is getting an ESXi host managed via WiFi which is also frowned upon why not just go the extra mile and really piss everyone off.

Again I could have done everything on the existing SATA based SSD and avoid so much potential future issue…. but here I am… anyway…

You may also note that at this time in the post I am connecting to a physical adapter on the ESXi host as noted by the IP addresses… once complete these IP addresses will not be used but remain bound the physical NIC.

Step 2) Create VM to manage the WiFi.

Again I’m choosing to use OPNsense cause they are awesome in my opinion.

I found I was able to get away with 1 GB of memory (even though min stated is 2) and 16 GB HDD, if I tried 8 GB the OPNsense installer would fail even though it states to be able to install on 4 GB SD Cards.

Also note I manually change boot from BIOS to EFI which has long been supported. At this stage also check off boot into EFI menu, this allows the VMRC tool to connect to ISO images from my desktop machine that I’m using to manage the ESXi host at this time.

Installing OPNsense

Now this would be much faster had I simply used the SSD, but since I’m doing everything the dumbest way possible, the max speed here will be roughly 8 MB/s… I know this from the extensive testing I’ve done on these USB drives from the ESXi install. (The install caused me so much grief hahah).

Wow 22 MB/s amazing, just remember though that this will be the HDD for just the OPNsense server that won’t need storage I/O, it’ll simply boot and manage the traffic over the WiFi card.

And much like how ESXi installed on the exact same USB drive, we are going to configure OPNsense to not burn out the drive. By following the suggestions in this thread.

Configuring  OPNsense

Much like the ESXi host itself at this point I have this VM connected to the same VMPG that connects to my flat 192.168 network. This will allow us to gain access to the web interface to configure the OPNsense server exactly in the same manner we are currently configuring the ESXi host. However, for some reason the main interface while it will default assign to LAN it won’t be configured for DHCP and assumes 192.168.1.1/24 IP… cool, so log into the console and configure the LAN IP address to be reachable per your config, in my case I’m going to give it an IP address in my 192.168.0.0/24 network.

Again this IP will be temporary to configure the VM via the Web GUI. Technically the next couple steps can be done via the CLI but this is just a preference for me at this time, if you know what you are doing feel free to configure these steps as you see fit.

I’m in! At this point I configure SSH access and allow root and password login. Since this it a WiFi bridged VM and not one acting as a firewall between my private network and the public facing internet this is fine for me and allows more management access. Change these how you see fit.

At this point, I skip the GUI wizard.  Then configured the settings per the link above.

Even with only 1 GB of memory defined for the VM, I wonder if this will cause any issues, reboot, system seems to have come up fine… moving on.

Holy crap we finally have the pre-reqs in place. All we have to do now is configure the WiFi card for PCI passthrough, give it to the VM, and reconfigure the network stacks. Let’s go!

Locate WiFi card and Configure Passthrough

So back on the ESXi web interface go to … Host -> Manage -> Hardware and configure the device for pasththrough until, you find all devices are greyed out? What the… I’ve done this 3 times what happed….

All PCI Passthrough devices grayed out on ESXi 6.7U3 : r/vmware (reddit.com)

FFS, OK I should have mentioned this in the pre-reqs but I guess in all my previous builds test this setting must have been enabled and available on the boards I was using… I hope I’m not hooped here yet again in this dang project…

Great went into the BIOS could find nothing specific for VT-d or VT-x (kind of amazed VM were working on this thing the whole time. I found one option  called XD bit or something, it was enabled, I changed it to disabled, and it caused the system to go into a boot loop. It would start the ESXi boot up and then half way in randomly reboot, I changed the setting back and it works just fine again.

I’m trying super hard right now not to get angry cause everything I have tried to get this server up and running while not having to use the physical NIC has failed… even though I know it’s possible cause I did this 2 other times successfully and now I’m hung cause of another STUPID ****ING technicality.

K I have one other dumb idea up my ass… I have a USB based WiFi NIC, maybe just maybe I can pass that to OPNsense…

VMware seems to possibly allow it: Add USB Devices from an ESXi Host to a Virtual Machine (vmware.com)

OPNsense… Maybe? compatible USB Wifi (opnsense.org)

Here goes my last and final attempt at this hardware….

Attempting USB WiFi Passthrough

Add device, USB Controller 2.0.

Add Device, Find USB device on host from drop down menu.

Boot VM….. (my hearts racing right now, cause I’m in a HAB (Heightened Anger Baseline) and I have no idea if this final work around is going to work or not).

Damn it doesn’t seem to be showing under interfaces… checking dmesg on the shell…

I mean it there’s it has the same name as the PCI-e based WiFi card I was trying to use, but that is 1) pulled from the machine, and 2) we couldn’t pass it through, and dmesg shows it’s on the usbus1… that has to be it… but why can’t I see it in the OPNsense GUI?

OMG… I think this worked… I went to Interfaces wireless, then added the run0 I saw in dmesg….

I then added as an available interface….

For some weird reason it gave it a weird assignment as WifIBridge… I went back into the console and selected option 2 to assign interfaces:

Yay now I can see an assignable interface to WAN. I pick run0

Now back into OPNsense GUI… OMG… there we go I think we can move forward!

Once you see this we can FINALLY start to configure the wireless connection that will drive this whole design! Time for a quick break.

Configuring WiFi on OPNsense

No matter if you did PCI-e passthrough or USB passthrough you should now have an accessible OPNsense via LAN, and assigned the WiFi device interface to WAN. Now we need to get WAN connected to the actual WiFi.

So… Step 1) remove all blocking options to prevent any network issues, again this is an internal bridge/router, and not a Edge Firewall/NAT.

Uncheck Block Private Networks (Since we will be assigning the WAN interface a Private IP), and uncheck Block bogon networks.

Step 2) Define your IP info. In my case I’m going to be providing it a Static IP. I want to give it the one that is currently being used to access it that is bound to the vNIC, but since it’s alread bound and in use we’ll give it another IP in the same subnet and move the IP once it’s released from the other interface. For now we will also leave it as a slash 32 to prevent a network overlap of the interface bound on LAN thats configured for a /24.

No IPv6.

Step 3) Define SSID to connect to and Password.

I did this and clicked apply and to my dismay.. I couldn’t get a ping response… I ssh’d into the device by the current VMX nic IP and even the device itself couldn’t ping it (interface is down, something is wrong).

Checking the OPNsense GUI under INterface Assignments I noticed 2 WiFI interfaces (somehow I guess from me creating it above, and then running the wizard on the console?).

Dang I wanted to grab a snip, but from picking the main one (the other one was called a clone), it has now been removed from the dropdown, and after picking that one the pings started working!

Not sure what to say here, but now at this point you should have a OPNsnese server accessible by LAN (192.168.0.x) and WAN (192.168.0.x). The next thing is we need to make the Web interface accessible by the WAN (Wireless) interface.

Basically, something as horrendous as this drawing here:

Anyway… the first goal is to see if the WiFi hold up, to test this I simply unplug the physical cable from the beaitful diagram above, and make sure the pings to the WAN interface stay up… and they both went down….

This happened to me on my first go around on testing this setup… I know I fixed it.. I just can’t remember how… maybe a reboot of the VM, replug in physical cable. Before I reboot this device I’ll configure a gateway as well.

Interesting, so yup that fixed the WiFi issue, OPNsense now came up clean and WiFi still ping response even when physical nic is removed from the ESXi host… we are gonna make it!

interesting the LAN IP did not come up and disappeared. But that’s OK cause I can access the Web GUI via the WAN IP (Wirelessly).

finally OK, we finally have our wireless connection, now we just need to create a new vSwitch and MGMT network on the ESXi host that we will connect to the OPNsense on the VMX0 side (LAN) that you can see is free to reconfigure. This also free’d the IP address I wanted to use for the WAN, but since I’ve had so many issues… I’m just going to keep the one I got working and move on.

Configure the Special Managment network.

I’m going to go on record and say I’m doing it this way simply cause I got this way to work, if you can make it work by using the existing vSwitch and MGMT interfaces, by all means giver! I’m keeping my existing IPs and MGMT interfaces on the default switch0 and creating a new one for the wireless connection simply so that if I want to physically connect to the existing connection.. I simply plug in the cable.

Having said that on the ESXi host it’s time to create a new vSwitch:

Now create the new VMK, the IP given here is the in the new subnet that will be routed behind the OPNsense WAN. In my example I created a new subnet 192.168.68.0/24 this will be routed to the WAN IP address given to OPNsense in my example here that will be 192.168.0.33. (Outside the scope of this blog post I have created routes for this on my devices gateway devices, also since my machine is in the same subnet at the OPNsense WAN IP, but the OPNsense WAN IP address is not my subnets gateway IP this can cause what is known as asymetric routing, to resolve this you simply have to add the same route I just mentioned to the machine managing the devices. You have been warned, design your stuff better than I’m doing here… this is all simply for educational purposes… don’t do this ever in production)

Now we need to create a VMPG for the VM to connect the VMX0 IP into the new vSwitch to provide it the gateway IP for that new subnet (192.168.68.1/24)

Now we can finally configure the vNIC on the OPNsense VM to this new VMPG:

Before we configure the OPNsense box to have this new IP address let’s configure the ESXi gateway to be that:

OK finally back on the OPNsense side let’s configure the IP address…

Now to validate this it should simply be making sure the ESXi host can ping this IP…

All I should have to do now is configure the route on my machine doing all this work and I should also be able to ping it…

More success… final step.. unplug physical nic to pings stay up?? OMG and they do!!! hahaha:

As you can see the physical NIC IP drops but the new secret MGMT IPs behind the WiFi stay up! There’s one final thing we need to do though.

Configure Auto Start of OPNsense

This is a critical step in the design setup as the OPNsense needs to come up automatically in order to be able to manage the ESXi host if there is ever a reboot of the host.

Then simply configure the auto start setting for this VM:

I also go in and change the auto start delay to 30 seconds.

Summary

And there you have it… and ESXi host completely managed via WiFi….

There are a ton of limitations:

  1. No Bridging so you can’t keep a flat layer 2 broadcast domain. Thus:
  2. Requires dedicated routes and complex networking.
  3. All VM traffic is best handled directly on internal vSwitch otherwise all other VM traffic will share the same WiFi gateway providing a terrible experince.
  4. The Web interface will become sluggish when the network interface is under load.
  5.  However it is overall actually possible.
  6. * Using PCI-e passthrough disallows snapshots/vMotions of the OPNsense VM but USB does allow it, when doing a storage vMotion the VM crashed on me, for some reason auto start disabled too had to manually start the VM back up. (I did this by re-IPing the ESXi server via console and plugging in a phsyical cable)
  7. With USB WiFi Nic connections can be connected/disconnected from the host, but with PCI-e Passthrough these options are disabled.
  8. With USB NIC you can add more vNICs to OPNsense and configure them, it just brings down the network overall for about 4-5 min, but be patient it does work.Here’s a Speedtest from a Windows Virtual Machine on the ESXi host.

Hope you all enjoyed this blog post. See ya all next time!

*UPDATE* Remember when I stated I wanted to keep those VMKs in place incase I ever wanted to plug the physical cable back in? Yeah that burnt me pretty hard. If you want a backup physical IP make it something different then you existing network subets and write it down on the NIC…

For some really strange reason HTTPS would work but all other connections such as SSH would timeout very similar to an asymmetric routing issue, and it actually cause it kind was. I’m kinda shocked that HTTPS even managed to work… huh…

Here’s a conversation I had with other on VMware IRC channel trying to troubleshoot the issue. Man I felt so dumb when I finally figured out what was going on.

*Update 2* I notice that the CPU usage on the OPNsense VM would be very high when traffic through it was taking place (and not even high bandwidth here either) AND with the pffilter service disabled, meaning it working it pure routing mode.

High CPU load with 600Mbit (opnsense.org)

Poor speeds and high CPU usage when going through OPNsense?

“Furthermore, set the CPU to 1 core and 4 sockets. Make sure you use VirtIO nics and set Multiqueue to 4 or 8. There is some debate going on if it should be 4 or 8. By my understanding, setting it to 4 will force the amount of queues to 4, which in this case matches your amount of CPU cores. Setting it to 8 will make OPNsense/FreeBSD select the correct amount.” Says Mars

“In this case this is also comparing a linux-based router to a BSD based one. Linux will be able to scale throughput much easily with less CPU power required when compared to the available BSD-based routers. Hopefully with FreeBSD 13 we’ll see more optimization in this regard and maybe close the gap a bit compared to what Linux can do.” Says opnfwb

Mhmmm ok I guess first thing I can try is upping the CPU core count. But this VM also hosts the connection I need to manage it… Seems others have hit this problem too…

Can you add CPU cores to VM at next restart? : r/vmware (reddit.com)

while the script is decent, the comment by cowherd is exactly what I was thinking I was going to do here: “Could you clone the firewall, add cores to the clone, then start it powering up and immediately hard power off the original?”

I’ll test this out when time permits and hopefully provide some charts and stats.

No coredump target has been configured. Host core dumps cannot be saved.

ESXi on SD Card

Ohhh ESXi on SD cards, it got a little controversial but we managed to keep you, doing the latest install I was greet with the nice warning “No coredump target has been configured. Host core dumps cannot be saved.

What does this mean you might ask. Well in short, if there ever was a problem with the host, log files to determine what happened wouldn’t be available. So it’s a pick your poison kinda deal.

Store logs and possibly burn out the SD/USB drive storage, which isn’t good at that sort of thing, or point it somewhere else. Here’s a nice post covering the same problem and the comments are interesting.

Dan states “Interesting solution as I too faced this issue. I didn’t know that saving coredump files to an iSCSI disk is not supported. Can you please provide your source for this information. I didn’t want to send that many writes to an SD card as they have a limited number (all be it a very large number) of read/writes before failure. I set the advanced system setting, Syslog.global.logDir to point to an iSCSI mounted volume. This solution has been working for me for going on 6 years now. Thanks for the article.”

with the OP responding “Hi Dan, you can definately point it to an iscsi target however it is not supported. Please check this KB article: https://kb.vmware.com/s/article/2004299 a quarter of the way down you will see ‘Note: Configuring a remote device using the ESXi host software iSCSI initiator is not supported.’”

Options

Option 1 – Allow Core Dumps on USB

Much like the source I mentioned above: VMware ESXi 7 No Coredump Target Has Been Configured. (sysadmintutorials.com)

Edit the boot options to allow Core Dumps to be saved on USB/SD devices.

Option 2 – Set Syslog.global.logDir

You may have some other local storage available, in that case set the variable above to that local or shared storage (shared storge being “unsupported”).

Option 3 – Configure Network Coredump

As mentioned by Thor – “Apparently the “supported” method is to configure a network coredump target instead rather than the unsupported iSCSI/NFS method: https://kb.vmware.com/s/article/74537

Option 4 – Disable the notification.

As stated by Clay – ”

The environment that does not have Core Dump Configured will receive an Alarm as “Configuration Issues :- No Coredump Target has been Configured Host Core Dumps Cannot be Saved Error”.
In the scenarios where the Core Dump partition is not configured and is not needed in the specific environment, you can suppress the Informational Alarm message, following the below steps,

Select the ESXi Host >

Click Configuration > Advanced Settings

Search for UserVars.SuppressCoredumpWarning

Then locate the string and and enter 1 as the value

The changes takes effect immediately and will suppress the alarm message.

To extract contents from the VMKcore diagnostic partition after a purple screen error, see Collecting diagnostic information from an ESX or ESXi host that experiences a purple diagnostic screen (1004128).”

Summary

In my case it’s a home lab, I wasn’t too concerned so I followed Option 4, then simply disabled file core dumps following the second steps in Permanently disable ESXi coredump file (vmware.com)

Note* Option 2 was still required to get rid of another message: System logs are stored on non-persistent storage (2032823) (vmware.com)

Not sure, but maybe still helps with I/O to disable coredumps. Will update again if new news arises.

Manually Fix Veeam Backup Job after VM-ID change

The Story

There’s been a couple time where my VM-IS’s change:

  • A vSphere server has crashed beyond a recoverable state.
  • A server has been removed and added back into the inventory in vSphere.
  • Manually move a VM to a new ESXi host.
    • VM removed from inventory, and readded.
  • Loss vCenter Server.
  • Full VM Recovery via Veeam.

What sucks is when you go to run the Job in Veeam after any of the above, the job simply fails to find the object. You can edit the job by removing the VM and re-adding it, but this will build a whole new chain, which you can see in the repo of Veeam after such events occur:

As you can see two chains, this has been an annoyance for a long time for me, as there’s no way to manually set the VM-ID in vCenter, it’s all automanaged.

I found this Veeam thread discussing the same issue, and someone mentioned “an old trick” which may apply, and linked to a blog post by someone named “Ideen Jahanshahi”.

I had no idea about this, let’s try…

Determine VM-ID on vCenter

The source uses powerCLI, which I’ve covered installing, but easier is to just use the Web UI, and in the address bar grab it after the vms parameter.

Determine VM-ID in Veeam

The source installs SSMS, and much like my fixing WSUS post, I don’t like installing heavy stuff on my servers to do managerial tasks. Lucky for me, SQLCMD is already installed on the Veeam server so no extra software needed.

Pre-reqs for SQLCMD

You’ll need the hostname. (run command hostname).

You’ll need the Instance name. (Use services.msc to list SQL services)

Connect to Veeam DB

Open CMD as admin

sqlcmd -E -S Veeam\VEEAMSQL2012

use VeeamBackup
:setvar SQLCMDMAXVARTYPEWIDTH 30
:setvar SQLCMDMAXFIXEDTYPEWIDTH 30
SELECT bj.name, bo.object_id FROM bjob bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN Bobjects bo ON bo.id = oij.object_id WHERE bj.type=0
go

Some reason above code wouldn’t work on my latest build/install of Veeam, but this one worked:

SELECT name, job_id, bo.object_id FROM bjobs bj INNER JOIN ObjectsInJobs oij ON bj.id = oij.job_id INNER JOIN BObjects bo ON bo.id = oij.object_id WHERE bj.type=0

In my case after remove the VM from inventory and readding it:

As you can see they do not match, and when I check the VM size in the job properties the size can’t be calculated cause the link is gone.

Fix the Broken Job

UPDATE bobjects SET object_id = 'vm-55633' WHERE object_id='vm-53657'

After this I checked the VM size in the job properties and it was calculated, to my amazement it fully worked it even retained the CBT points, and the backup job ran perfectly. Woo-hoo!

This info is for educational purposes only, what you do in your own environment is on you. Cheers, hope this helps someone.

vCLS High CPU usage

The Story

So I went to vMotion a VM to do some maintenance work on a host. Target machine well over 50% CPU usage.. what?! That can’t be right, it’s not running anything…

I tried hard powering the VM off, but it just came right back up suckin CPU cycles with it….

The Hunt

alright Google, what ya got for me… I found this blog post by “Tripp W Black” he mentions stopping a vCenter Service called “VMware ESX Agent Manager”, which he stops and then deletes the offending VMs, sounds like a plan. Let’s try it, so login into VAMI. (vcenter.consonto.com:5480)

K, let’s stop it… let me hard power off the VM now… ehh the VM is staying dead and host CPU:

K let’s go kill the other droid I have causing an issue…

ok I got them all down now, but the odd part is I can’t delete them from disk much like Sir Black mentioned in their blog post. The options is greyed out for me, let’s start the service and see what happens…

The Pain

Well, that was extremely annoying, it seemed to have worked only for a moment and the CPU usages came right back, so I stopped the service again, but I can’t delete the VMs…

Similar issues in vSphere 8, even suggestions to stay running in retreat mode, which I’ll get to in a moment. So, if you are unfamiliar, vCLS are small VMs that are distributed to ESXi hosts to keep HA and DRS features operational, even if vCenter itself goes down. The thing is, I’m not even using HA or DRS, I created a cluster for merely EVC purposes, so I can move VMs between hosts live at my own leisure and without downtime. What’s annoying is I shouldn’t have to spend half my weekend day trying to solve a bug in my HomeLab due to poor design choices.

The Constructive Criticism

VMware…. do not assume a cluster alone requires vCLS. Instead, enable vCLS only when HA or DRS features are enabled.

Now that we have that very simple thing out of the way.

The Fix

So, as we mentioned we are able to stop the vCLS VMs when we stop the EAM service on vCenter, but that won’t be a solution if the server gets rebooted. I decided to Google to see how other people delete vCLS when it doesn’t seem possible.

I found this reddit thread, in which they discuss the same thing mentioned above “Retreat Mode”. However, after setting the required settings (which is apparently tattoo’d after done), I still couldn’t delete the VMs, even after restarting the vpxd service. Much like ‘bananna_roboto’ I ended up deleting the vCLS VMs from the ESXi host UI directly, however when checking vCenter UI the still showed on all the hosts.

After rebooting the vCenter server, all the vCLS VMs were gone, at first, I thought they’d come back, but since the retreat mode setting was applied it seems they do not get recreated. Hence, I will leave Retreat mode enabled as suggested in the reddit thread for now, since I am not using HA or DRS.

So if you want to use EVC in a cluster, but not HA and DRS and would like to skim even more memory from your hosts, while saving on buggy CPU cycles, apparently “Retreat mode” is what you need.

If you do need those features, and you are unable to delete the old vCLS VMs, and restarting the EAM service doesn’t resolve your issue (which it didn’t for me), you may have to open a support case with VMware.

Any, I hope this helped someone. Cheers.

USB NICs on ESXi hosts

Quick post here, I wanted to use a USB based NIC to allow one of my hosts to be able to host the firewall used for internet access, this would allow for host upgrades without downtime.

My first concern was the USB bus on the host, being a bit older, I double checked and sad days it was only USB 2.0. Checking my internet speed, it turns out it’s 300 mbps, and USB 2.0 is 480 mbps, so while I may only be able to use less then half of the full speed of the gig NIC, it was still within spec of the backend, and thus won’t be a bottle neck.

Now when I plugged in the USB nic, I sadly was not presented with a new NIC option on the host.

When I googled this I found an awesome post by non-other than one of my online hero’s Willam Lam. Which he states the following:

“With the release of ESXi 7.0, a USB CDCE (Communication Device Class Ethernet) driver was added to enable support for hardware platforms that now leverages a Virtual EEM (Ethernet Emulation Module) for their out-of-band (OOB) management interface, which was the primary motivation for this enhancement.

One interesting and beneficial side effect of this enhancement is that for any USB network adapters that conforms to the CDCE specification, they would automatically get claimed by ESXi and show up as an available network interface demonstrated in my homelab with the screenshot below.”

Then shows a snippet of running a command:

esxcfg-nics -l

Which for me listed the same results as the UI:

Considering I’m running the latest built of 7.x, I guess the device not “conform to the CDCE specification”.

A bit further in the post he shows running:

lsusb

When ran shows the device is seen by the host:

Let’s try to install the Flings USB Driver, see if it works.

“This Fling supports the most popular USB network adapter chipsets found in the market. The ASIX USB 2.0 gigabit network ASIX88178a, ASIX USB 3.0 gigabit network ASIX88179, Realtek USB 3.0 gigabit network RTL8152/RTL8153 and Aquantia AQC111U.”

Step 1 – Download the ZIP file for the specific version of your ESXi host and upload to ESXi host using SCP or Datastore Browser. Done

Luck the error message was clickable, and it provided a helpful hint to navigate to the host as it maybe due to certificate not trusted, and sure enough that was the case.

Step 2 – Place the ESXi host into Maintenance Mode using the vSphere UI or CLI (e.g. esxcli system maintenanceMode set -e true)

Some reason the command line wasn’t returning from the command above, and I had to enable Maintenance mode via the UI. Done.

Step 3 – Install the ESXi Offline Bundle (6.5/6.7) or Component (7.0)

For (7.0+) – Run the following command on ESXi Shell to install ESXi Component:

esxcli software component apply -d /path/to/the component zip

For (6.5/6.7) – Run the following command on ESXi Shell to install ESXi Offline Bundle:

esxcli software vib install -d /path/to/the offline bundle zip

and my results:

Ohhh FFS… Google!!!!!! HELP!!!! Only one hit…

only only 2 responses close to an answer are… “Ok I can confirm that if you create a 7u1 ISO and upgrade to that first, you can then add the latest fling module to it. Key bit of info that is not in the installation instructions” and “Workaround: Update the ESXi host to 7.0 Update 1. Retry the driver installation.”

Uhhhh I thought I just updated my hosts to the latest patches… what am I running?

“7.0.3, 21686933″… checking the source Flings page, oh… it’s a dropdown menu… *facepalm*

I downloaded the ESXi 8 version, let me try the 703 one…

ESXi703-VMKUSB-NIC-FLING-55634242-component-19849370.zip

Reboot! and?

Ehhh it worked, I can now bind it to a vSwitch. I hope this helps someone :). I’m also wondering if this will burn me on future ESXi updates/upgrade. I’ll post any updates if it does.

TPM security on a ESXi VM

Great part about vSphere 7 is it introduced the ability to add a TPM based hardware to a VM.

Let’s see if we can pull it off in our lab.

What I need a Key Provider, Lucky for use with 7.0.3 VMware provides a “Native Key Provider

During my deployment of the NKP, one requirement is to make a backup of the key I guess, which was failing for me. I found this VMware thread with someone having the same issue.

Sure enough, the comment by “acartwright” was pretty helpful, as I too opened the browser console and noticed the CORS errors. The only diff was I wasn’t using CNAMEs, per say, but I had done a pilot of vCenter renaming. the fact the names showing up as not matching and the ones that were listed in the console reminded me of that. When I went to check the hostname, and local host file, sure enough they had the incorrect name in there.

So, after following the steps in my old blog post to fix the hostname and the localhosts file, I tried to backup the NKP and it worked this time. 😀

So, sure there after this I went to add the TPM and I couldn’t find it, oh right it’s a newer feature, I’ll have to update the VM’s compatibility mode.

Made snapshot, updated to latest hardware ID, boots fine, lets add the TPM hardware, error can’t add TPM with snapshots. Ugh, fine delete snapshot (tested VM boots fine before doing this), add TPM success.

Before changing the VM boot option to EFI, boot the VM and boot the OS into Windows RE, use mbr2gpt command to convert the boot partitions to the proper type supported by EFI.

Once completed, change VM boot options to EFI, and check off secure boot.

Congrats you just configured a ESXi VM with a vTPM module. 🙂

 

ESXi 6.x Datastore Not Mounted

Quick post here, I had to recover from a flooded basement. Sorry for the day outage. I had to put my disc in another server and load FreeNAS, and import my ZFS volumes, recreate the iSCSI targets, and then I added them to my ESXi hosts, and rescanning the HBAs shows the disks…

but the datastores were not visible…

so I googled and found this VMware thread with some helpful commands to try. (I do kind of agree with the OP, that its annoying they removed the front end UI for import that could handle this)

esxcli storage vmfs snapshot list

esxcfg-volume -M UUID

Ehh it worked!

Hope this helps someone. If this doesn’t work you might have some other underling issue?

vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

ESXi Update Network Config Failed
Set ESXi IP via CLI

Real quick post here. I was moving my ESXi hosts and vCenter to a new dedicated subnet. I did the usual; had a temp Windows System in the new subnet, create VMK with temp IP in new subnet, connect to ESXi Web UI via new Temp IP in new Subnet via temp Windows machine. Reconfigure default TCP/IP stack default gateway, change VMK0 IP address (and edit management port group VLAN id if applicable). and Away I’d go.

However on this one host for some unknown stupid reason it would simply fail “Failed – An Error occurred during host configuration”, and the detailed log was just as vague “operation failed diagnostics report unable to set network unreachable” OK… whatever, that shouldn’t matter do as I tell you! Here’s a snippet of the error, and the CLI command that simply worked without bitching.

I just figured let’s try the CLI way and see if it worked, and it turns out it did. The source I used to figure out the command syntax.

The commands I used:

Get IPs:

esxcli network ip interface ipv4 get

Set new IP:

esxcli network ip interface ipv4 set -i vmk1 -I 1.1.1.1 -N 255.255.255.0 -t static

Hope this helps someone.