vCenter 503 Service Unavailable

I was going to test a auditing script from a DefCon presenter on my AD server, when I was adding the USB controller and the USB stick I was passing thorugh to get the script in my VM was being weird.

First USB 3.0 connected just fine, and connected the USB device to the VM, but diskpart was not showing it. So I went to remove it and try a USB 2.0 controller, that failed to connect since the USB 3.0 was still showing there and I selected to remove it again, which it errored another concurrent task. Makes sense, till refreshing the page told me unprivileged account. I wasn’t sure what this was about, so I decided to open another window and navigate to my center web app… 503 service unavailable:

“503 Service Unavailable (Failed to connect to endpoint: [N7Vmacore4Http20NamedPipeServiceSpecE:0x000055aec30ef1d0] _serverNamespace = / action = Allow _pipeName =/var/run/vmware/vpxd-webserver-pipe)”

What the… rebooting the VCSA showed no success still same error even with an incognito window.. ughh.

I found this thread: https://communities.vmware.com/thread/588755

I was going through this, and decided to try to renew the certs, even though my internal PKI certs were still valide (AFAIK, and checking the cert provided when accessing the page). Now here’s the thing, while I ran the certificate-manager script and renewed all the certs, I noticed my AD server somehow was down. I booted it back up. I’m not exactly sure which fixed it. So I decided to take another snapshot while it was in this “fixed state” and revert to the  broken state. After restoring o the broken state nothing was responding at all on the https service from the VCSA, so I gave it a simple reboot (which I did initially before I noticed my AD server was down, for some reason). Sure enough after the reboot everything was working fine with my internal PKI certs.

I guess if you set vCenter to use MS AD as the primary login domain and that domain is not available the web management service becomes unavailable… that kind of sucks. I should have noticed my AD was not operational but I didn’t have monitoring on it 😉 or use my local workstation as a AD member. Mostly just random VMs I have for testing.

Like most people, should have looked at the logs for a better idea of what the root cause was. I threw 2 darts at a dart board and had to revert to find the true root cause. Not the best way to troubleshoot, but sometimes if logs are not available it is another method…

Installing PowerCLI 12.0 Offline

PowerCLI 12.0

Offline Install

Checking VMwares source wasn’t too insightful…

Just this with the “Download” button redirecting to an alternative site non-other than powershellgallery.com …clicking manual Download gives you the raw nuget package let’s try to install first normally.

Install-Module -Name VMware.PowerCLI

No way it failed, expected, and it even states a warning about the network.

Alright so using an online computer copy the nuget package to the offline (use USB sticks, Floppy drives, Zip Drives, serial modem if that’s what it takes…)

In my case I was testing this on a VM and simply used a USB stick to mount it to the VM from the VMRC console, and copied the nuget file to c:\temp\PowerCLI

This from this MS Doc page on the cmdlet, is for Visual Studio, we are using powershell only…

This topic describes the command within the Package Manager Console in Visual Studio on Windows. For the generic PowerShell Install-Package command, see the PowerShell PackageManagement reference.

Sure enough this is where I gave up on this path. All the new stuff is nice with it all being connected makes life super easy, but in those locked down situations this is a hassel. Since I wasn’t sure how to install the nuget package via a simple ID option like Install-Package for VS PS, there wasn’t one for the regular PS Install-Package cmdlet. Then I went to google how to accomplish this and was a bit annoyed at all the steps required to do it via the package manager… Read this by William on Stackoverflow for more details.

Lucky for me I found an alternative blog post, which does an alternative offline install and much, much simpler.

From the online system instead of saving the nuget package we save the modules files themselves directly.

 Save-Module -Name VMware.PowerCLI -Path C:\temp\PSModules

Copy the entire contents of the PSModules folder to a storage medium of your choice (e.g. USB flash drive) and transfer the files to the desired offline system where PowerCLI is needed.

If you have admin rights on the target system, you can copy files to the location below.

 C:\Program Files\WindowsPowerShell\Modules

At this point he goes on about some settings and stuff, I wasn’t exactly sure how to use PowerCLI, as usually it opens up in a custom PS window before. Now you simple import-module *modulename*

Import-Module VMware.PowerCLI

Now creating custom ESXi images should be a breeze!

Extra Bits

Customer Experience Improvement Program (CEIP)

The VMware Customer Experience Improvement Program collects data about the use of VMware products. You can either agree (true) or disagree (false). For offline systems, only the rejection (false) makes sense. The command shown below suppresses future notifications within PowerCLI.

Set-PowerCLIConfiguration -Scope AllUsers -ParticipateInCeip $false

Ignore invalid SSL certificates

When using self-signed certificates in vCenter, PowerCLI will deny the connection. This behavior can be suppressed with the command:

Set-PowerCLIConfiguration -Scope AllUsers -InvalidCertificateAction Warn

Found the types from this old 5.1 documentation you can also set it to ignore instead of warn. 🙂 Cheers!

ESXi Upgrade Failure

Upgrading one of my ESXi hosts in my lab failed on me, sure enough I figured this might happened and put a head on my usually headless server. This means I plugged in a monitor. at the screen I was this:

well that sucks, googling I found this thread from VMware.

looking closer at the boot error before this it stated:

system does not have secure boot enabled. This being an old mini desktop from the mid 2000’s it had uEFI but did not have the “feature” of secure boot. Clearly an after thought of the time. Now the odd part is when I hit the boot menu key “f12” in my case, I had the “legacy” BIOS style, list as P0: Hard Disk and EFI: Hard Disk. When I picked P0 one it booted just fine. So I figured just a simple boor order fix adjust some settings much like the thread (disable EFI boot and stick with legacy). I couldn’t see a way in my EFI/BIOS options to disable the alternative boot types, so I put the legacy type at the top of the list and the EFI one at the bottom, yet every time I booted it would boot the EFI one. When I check the vCentre system it wouldn’t remediate aka update to the new version, so I had to click remediate, run downstairs, and ensure I was there to pick the Legacy Disk boot, even after setting the boot order in the BIOS it wouldn’t stick to legacy and this was the only way I could get the upgrade to succeed.

Dang Computers…

Oh yeah.. this happened to me to, while I was trying to migrate some servers, I wanted to move some VM’s vNic into different VMPGs so I decided to rename the one they were currently using. I created the new VMPG in the alternative vSwitch, and i was a bit stumped to see them already there. I had presumed that once I renamed the VMPG it would reflect as the new name on the VM settings and still be on that old vSwitch (in secret it is). When I went to delete the vSwitch it told me error failed to delete “a specified parameter is not correct”. Googling I found this 10 year old blog that still relevant in ESXi 6.5.

Had to simply edit the VMs vNics and change them back. Dang Computers…

VMware ESXi boot and the Config

Sadly this post will be really short as again, lots going on. Recovering a host that failed after a regular reboot, which had a superblock corruption on it’s main OS drive. Also, the BELK series will be done, I just need a bit more time. Sorry for the delays.

“Failed to load /sb.v00” [Inconsistent Data]

Since this drive was not on the main datastore on the host all the VMs were unaffected.

Now loading linux showed the drive data was till accessible, but I also had a feeling this USB drive was on it’s way out. I created a copy using DD, *sadly I didn’t do it the smart way and place it on a drive big enough to save it as a image file, but instead directly to another drive of the same size.

I tried to install the same image of ESXi on top of the current one in hopes it would fix the boot partition files along the way. This only made the host get past /sb.v00 and vault randomly past it with “Fatal Error: 6 [Buffer Too Small]”

I was pretty tired at this point since the server boot times are rather long and attempts were becoming tedious. I did another DD operation of the drive, to the same drive (still not having learned my lesson) and when I awoke to my dismay, it failed only transferring 5 gigs with an I/O error. This really made me sure the drive was on the way out, but it was still mountable (the boot partitions 5, 6 and 8)

At this point you might be wondering, why doesn’t he just re-install and reload a backup config? Which is fair question, however one was not on hand, but surely it must be somewhere on the drive. I know how to create and recover on a working host but a one that can’t boot? Then I found this gem.

Now through out my attempts I did discover the boot partitions to be 5 and 6 and I did even copy them from a new install to my copied version I made about and it did boot but was a stock config. I was stumped till I read the section from the above blog post on “How to recover config from a system that doesn’t boot”. Line 7 was what nailed it on the head for me:

“mount /dev/sda5 /mnt/sda5

7. In the /mnt/sda5 directory, you can find the state.tgz file that contains ESXi configuration. This directory (in which state.tgz is stored) is called /bootblank/ when an ESXi host is booted.”

I was just like … wat? That’s it. Grabbed the bad main drive mounted on a linux system, saw the state.tgz file and made a copy of it, connected the new drive that had a base ESXi config, replaced the state.tgz file with the one I copied, booted it and there was the host in full working state with all network configs and registered VMs and everything.

Not sure why the config is stored in the boot partition, but there you go. Huge Shout out to Michael Bose for his write I suggest you check it out. I have saved it case it disappears from the internet and I can re-publish it. For now just visit the link. 🙂

Using A USB Device as a Datastore on VMware ESXi

USB datastore

Attach USB device in Windows -> DiskPart -> Select Disk -> Clean

To see USB device on host, stop arbitrator service, save to config, reboot

Now the hard parts, normally even following guide to see USB stick on host mount 4 gig volume, I had 16 gig cruiser to use.

Source

Perform a lspci -v to get all the USB UHCI and EHCI controllers to show up.
This shows up for example as:

DEVICE=/dev/disks/t10.JMicron_USB_to_ATA2FATAPI_Bridge
partedUtil mklabel ${DEVICE} msdos
END_SECTOR=$(eval expr $(partedUtil getptbl ${DEVICE} | tail -1 | awk '{print $1 " \\* " $2 " \\* " $3}') - 1)
/sbin/partedUtil "setptbl" "${DEVICE}" "gpt" "1 2048 ${END_SECTOR} AA31E02A400F11DB9590000C2911D1B8 0"
/sbin/vmkfstools -C vmfs5 -b 1m -S $(hostname -s)-local-datastore ${DEVICE}:1

That easy 😉

Requesting, Signing, and Applying internal PKI certificates on VCSA 6.7

The Story

Everyone loves a good story. Well today it begins with something I wanted to do for a while but haven’t got around to. I remember adjusting the certificates on 5.5 vCenter and it caused a lot of grief. Now it may have been my ignorance it also may have been due to poor documentation and guides, who knows. Now with VMware now going full linux (Photon OS) for the vCenter deployments (much more light weight) it’s still nice to see a green icon in your web browser when you navigate the nice new HTML5 based management interface. Funny that the guide I followed, even after applying their own certificate still had a “not secure” notification in their browser.

This might be because he didn’t install his Root CA certs into the computers trusted CA store on the machine he was navigating the web interface from. However I’m still going to thank RAJESH RADHAKRISHNAN for his post in VMArena. it helped. I will cover some alternatives however.

Not often I do this but I’m lazy and don’t feel like paraphrasing…

VCSA Certificate Overview

Before starting the procedure just a quick intro for managing vSphere Certificates, vSphere Certificates can manage in two different modes

VMCA Default Certificates

VMCA provides all the certificates for vCenter Server and ESXi hosts on the Virtual Infrastructure and it can manage the certificate lifecycle for vCenter Server and ESXi hosts. Using VMCA default the certificates is the simplest method and less overhead.

VMCA Default Certificates with External SSL Certificates (Hybrid Mode)
This method will replace the Platform Services Controller and vCenter Server Appliance SSL certificates, and allow VMCA to manage certificates for solution users and ESXi hosts. Also for high-security conscious deployments, you can replace the ESXi host SSL certificates as well. This method is Simple, VMCA manages the internal certificates and by using the method, you get the benefit of using your corporate-approved SSL certificates and these certificates trusted by your browsers.

Here we are discussing about the Hybrid mode, this the VMware’s recommended deployment model for certificates as it procures a good level of security. In this model only the Machine SSL certificate signed by the CA and replaced on the vCenter server and the solution user and ESXi host certificates are distributed by the VMCA.

I guess before I did the whole thing, were today I’m just going to be changing the cert that handles the web interface, which is all I really care about in this case.

Requirements

  • Working PKI based on Active directory Certificate Server.
  • Certificate Server should have a valid Template for vSphere environment
    Note : He uses a custom template he creates. I simply use the Web Server template built in to ADCS.
  • vCenter Server Appliance with root Access

Requesting the Certificate

Now requesting the certificate requires shell access, I recommend to enable SSH for ease of copying data to and from the VCSA as well as commands.

To do this log into the physical Console of the VCSA, in my case it’s a VM so I opened up the console from the VCSA web interface. Press F2 to login.

Enable both SSH and BASH Shell

OK, now we can SSH into the host to make life easier (I used putty):

Run

 /usr/lib/vmware-vmca/bin/certificate-manager

and select the operation option 1

Specify the following options:

  • Output directory path: path where will be generated the private key and the request
  • Country : your country in two letters
  • Name : The FQDN of your vCSA
  • Organization : an organization name
  • OrgUnit : type the name of your unit
  • State : country name
  • Locality : your city
  • IPAddess : provide the vCSA IP address
  • Email : provide your E-mail address
  • Hostname : the FQDN of your vCSA
  • VMCA Name: the FQDN where is located your VMCA. Usually the vCSA FQDN

Once the private key and the request is generated select Option 2 to exit

Next we have to export the Request and key from the location.

There are several options on how to compete this. Option 1 is how our source did it…

Option 1 (WinSCP)

using WinSCP for this operation .

To perform export we need additional permission on VCSA , type the following command for same

chsh -s /bin/bash root

Once connected to vCSA from winscp tool navigate the path you have mentioned on the request and download the vmca_issued_csr.csr file.

Option 2 (cat)

Simple Cat the CSR file, and use the mouse to highlight the contents. Then paste it into ADCS Request textbox field.

Signing The Request

Now you simply Navigate to your signing certificate authorizes web interface. usually you hope that the PKI admin has secured this with TLS and is not just using http like our source, but instead uses HTTPS://FQDN/certsrv or just HTTPS://hostname/certsrv.

Now we want to request a certificate, an advanced certificated…

Now simply, submit and from the next page select the Base 64 encoded option and Download the Certificate and Certificate Chain.

Note :- You have to export the Chain certificate to .cer extension , by default it will be PKCS#7

Open Chain file by right click or double click navigate the certificate -> right click -> All Tasks -> export and save it as filename.cer

Now that we have our signed certificate and chains lets get to importing them back into the VCSA.

Importing the Certificates

Again there are two options here:

Option 1 (WinSCP)

using WinSCP for this operation .

To perform export we need additional permission on VCSA , type the following command for same

chsh -s /bin/bash root

Once connected to vCSA from winscp tool navigate the path you have mentioned on the request and upload the certnew.cer file. Along with any chain CA certs.

Option 2 (cat)

Simply open the CER file in notepad, and use the mouse to highlight the contents. Then paste it into any file on the VCSA over the putty session.

E.G

vim /tmp/certnew.cer

Press I for insert mode. Right click to paste. ESC to change modes, :wq to save.

Run

 /usr/lib/vmware-vmca/bin/certificate-manager

and select the operation option 1

Enter administrator credentials and enter option number 2

Add the exported certificate and generated key path from previous steps and Press Y to confirm the change

Custom certificate for machine SSL: Path to the chain of certificate (srv.cer here)
Valid custom key for machine SSL: Path to the .key file generated earlier.
Signing certificate of the machine SSL certificate: Path to the certificate of the Root CA (root.cer , generated base64 encoded certificate).

Piss what did I miss…

That doesn’t mean shit to me.. “PC Load letter, wtf does that mean!?”

Googling, the answer was rather clear! Thanks Digicert!

Since I have an intermediate CA, and I was trying either the Intermediate or the offline it would fail.. I needed them both in one file. So opened each .cer and pasted them into one file “signedca.cer”

Now this did take a while, mostly around 70% and 85% but then it did complete!

Checking out the web interface…

Look at that green lock, seeing even IP listed in the SAN.. mhm does that mean…

Awwww yeah!!! even navigating the VCSA by IP and it still secure! Woop!

Conclusion

Changing the certificate in vCenter 6.7 is much more flexable and easier using the hybird approach and I say thumbs up. 😀 Thanks VMware.

Ohhh yea! Make sure you update your inventory hosts in your backup software with the new certificate else you may get error attempting backup and restore operations, as I did with Veeam. It was super easy to fix just validate the host under the inventory area, by going through the wizard for host configuration.

Rename a vSwitch in vSphere

I noticed I had named some vSwitches in the new hosts builds I had. This was nice. However I also noticed I couldn’t name a vSwitch when creating in vCenter. So how did I name them.

I quickly searched google, but the primary results were not what I was expecting….

1, 2, 3, 4

All of which either stated to edit the host config file, or use cli commands… well I know I did do the first thing, and I don’t remember using the CLI. Also I don’t remember having to reboot the host. The only diff I can think of is that I named them at creation, not after the fact, but the vCenter wizard has no option for that… but sure enough I checked my documentation.

If you login into a host directly, you can name a vSwitch right when creating it. This just requires to be done on each host in the cluster. It’s nice but is it worth it?

Once you have it setup it is really nice to have named vSwitches.

of course this doesn’t include dvSwitches, as those you can name and usually require uplinks to communicate between hosts. However you can still deploy a test dvSwitch to multiple hosts without an uplink though those VMs would only be useful on a single host… which defeats the purpose of it, but you can move the VMs as a whole group between VMs, and if that “Test switch” need any change it would be distributed between all hosts.

Mitigating from CVE-2018-3646 on ESXi 6.7

To keep this short, new VCSA 6.7 has VUM built in. No more Flash needed. Yay finally.

So I upload the latest 6.7u3 image, create my baseline, and test remedy one of my simple laptop hosts. After system reboots and comes back on VCSA dashbaord… uhhh what’s with this yellow warning icon…. Summary…

OK great, so after years of Intel being ahead of AMD, looks like at the cost of some pretty shitty shortcuts. and these shortcuts have caused Intel a huge problem, and pretty everyone else. Since it affected everyone, everyone has some form of write up on it. In this case VMware has coded the above warning, with a reference to this KB, so you know read that if you want a dry overview.

As you can see I have what shows as 4 logical processors, but after applying the mitigation (setting VMkernel.Boot.hyperthreadingMitigation to true on the host advanced settings) and rebooting…

Yay the warnings gone, but apparently so are half my logical processors?

If your wondering why they didn’t enable this by default is due system resource management, which of course, is exactly what vSphere is. Since it affects the available resource of the host it may not be able to accommodate the workload it was originally designed for. In my case it’s a lab and my work load is obviously very light, and this isn’t an issue for me.

Was it worth the mitigation? I don’t really know at this point as I’m unaware of any easy simple tactics any attacker could use to attack my footprint. At the same type CPU resource is not my major constraint, it’s usually memory.

For now better safe than sorry. In my next post I hope to cover the vCenter upgrade path and an error that happened to me along the way, luckily it wasn’t that hard to recover from. 🙂

Cheers!

ESXi 6.5 Stuck on Initializing Scheduler
PSOD PCPU1 could not start

I’m making this post short to note this odd experience with this host build.

First weird thing was when trying to install ESXi I couldn’t get past vmkusb not sure what it was about but only found this decent reddit post with the same problem.

In short he noticed it would only get past this if a second USB was plugged into the USB2 ports, and sure enough that worked for me too. strange….

Then a couple days later while doing some more test boots, I get a Purple screen of death, complaining about the PCPU1 not starting or some shit, ugh again lets see what others have to say… well I found this vmware thread on it, he basically stated that resetting the BIOS settings worked, after farting around with some bios settings, I had other failed boots and my CPU and system was rather hot. I let everything cool down, added more powerful fans and tried again after resetting the BIOS to factory and much like the post it worked.

After finalizing the build a little more, I switched to an old 2.5″ HDD.. same problem but I noticed it gets stuck on initializing scheduler before PSODing, while I searched for what might be up with that I found this

It did help shit, same problem next boot, instead of just resetting the BIOS I played around with a couple more settings like I enabled intel AES-NI which helps for CPU offloading of AES computations. and another one which I sadly forgot, and then my next boot was fine. saving this at this point in case it comes back again.

VMware Host Oddities

I’ll keep this post short as I find it rather interesting turn of events…

I was ensuring config files were being backed up as good Sysadmin Posture would suggest. I noticed an issue on one of the hosts when I went to run the configuration backup. When I was hit with a “general system error”. I Wish I would have saved the actual error message but when I went to research it, it gave me more people complaining about the error when trying to vMotion (In this case I grabbed the most specific line from the call back in hopes to get something) most results simply pointed to saying restarted the ESXi hosts management agents. I decided to see if I could vMotion or if any other symptoms from the host, but everything showed green in vCenter, and vMotions worked without an issue.

I left it for the night (before) and decided to update the host since it was running 6.5-u2 and needed to be updated to 6.5-u3. I was hoping it would resolve the config backup issue (I had an older copy on hand but wanted most up to date) So I decided to do the update via a ISO image and a host reboot, moved VMs, no problem, placed host into Maintenance mode, no issues, send host for shutdown (I wasn’t using iLO or iDRAC, so needed to manually mount the USB hosting the installation media), while I was at the console waiting for the host to show “shutting down” it simply …. didn’t… after 10 minutes which is far more then generous I decided to press F2 to get into the console to have to tell me … uggghhh I wish I would have taken a picture of the error, something along the lines of you can’t use the console cause you been locked out…. it was weird, after waiting another 10 minutes and nothing happening (I’m assuming there must have maybe been some actual underlying issue with the datastore holding the scratch?) I decided to just do a hard shutdown (hold power for 5 seconds). I could rebuild this server faster if need be. Powering it back up was perfectly fine through all POST checks, booted the ESXi 6.5u3 installer, booted fined, checked for logical drives, found all of them perfectly fine including the SD card holding the OS (scratch changed to a persistent location, another datastore on the local host with multiple disks used for a datastore) anyway I selected it and it saw there was an existing installation and selected to update it, successful, reboot. reboots fine, but host doesn’t come back to vCenter…..

Log into host directly, and notice it states it has 18 VMs, all with a status of error, now I had moved them and they are even still running on another host (thankfully it didn’t do anything stupid to try and hijack them) at this point I had called VMware support and placed an SR. Once I had a tech on the line, I discussed my symptoms and issues. At this point I asked what was the best course of action as, I didn’t want to rebuild the host (I could, but time). In this case I simply un-registered all the VMs that were in error as they were not associated with the host), then attempted to re-connect the host. It first failed complaining about bad username and password (I’m not sure if this was the root account used to add it, or the VPX user it uses to manage the host) but it prompted the wizard like adding a new host, since all other settings were still fine on the host (network and vswitches w/ VMPGs and VMKs, and all Datatstores) after this wizard the host was re-added back into the cluster with the latest updates, and running the backup config command:

esxcli storage vmfs extent list

worked without error. So yay! all fixed, but one thing still bothered me… what was the root cause…

I decided to check my scratch settings real quick, and sure enough it hold good and is on redundant datastore on the local ESXi host, so not sure what the root cause was, but it was fixed. I’m posting this for my own reference. Just incase this issue re-arises.