vCenter Network Partitioned

Have you ever experienced a Network Partitioned warning in vSphere 5? Hopefully not, but if you find yourself with this warning in vSphere. Don’t panic its not as bad as it could have been in 4.x.

This literally just means that the host can not communicate with any of the designated VMK’s checked off for management traffic. In my case it happened after making network changes to my infrastructure. In this case I still had bonded links at my switches, but somehow the VMK load balancing algo had switched to “route based on originating port ID”, this load balance algo doesn’t work with bonded NICs, and needs to be “route based on IP hash”. My end goal was to get off bonded links for my host and use the default load balance algo that VMware uses, as this can be down with non stacked switches and can be done with minimal switching knowledge (in case others need to manage the system in the future).

It took me a little bit to catch the issue, cause the symptoms were that each host could ping any device in their respected management subnets but NOT the other host, flat /24 subnet too, really had me baffled. As I couldn’t vMotion in this state either, but lucky the VMs on each host remained active (as they have separate communication VMPGs on dedicated physical connections).

Once I caught the error, I was able to verify vMotion worked again. That’s all there is to it!

To Paraphrase to solution:

1) Check which VMKs have management checked off.
2) Check those vSwitches physical connections.
3) If multiple ports check configs on physical switch and load balance algo.
4) Google any errors along the way.
5) Check host to host communication by consoling into host and using vmkping.

Jan 2018 Update

I remember this…

Changing Network Location to Domain

Have you ever restored a VM? Have you done your DR testing by actually doing a full recovery with AD? Did you find you had a couple odd things occur after restore, such as not being able to RDP into your recovered server? Chances are your network profile has changed to public, instead of Domain. This in turn causes certain firewall rules to trigger.

I remember coming across this issue multiple times, especially when people usually want private instead of public and vice versa. So chances are you’ve come across this, telling you to use PowerShell cmdlet to change its setting, which to my guess makes a registry change. The other option they specified was to use the GUI.

Well I find changing local security policies and all that other stuff rather annoying. Soo after a bit more googling I found a really nice answer, which worked and was very simple to implement. Very nicely written and easy to follow by a Evan A Barr. You can view his site here.

To Paraphrase to solution Using Network Connection Properties:

0) by adding a DNS suffix so that NLA can properly locate the domain controller.
1) Go to Network Connections.
2) Go to the properties of of the network adapter in the wrong location.
3) Go to the properties for IPv4.
4) Click the "Advanced..." button.
5) Select the DNS tab.
6) Enter your domain name into the text box for "DNS suffix for this connection:".
7) Disable and then enable the connection to get NLA to re-identify the location.

Renewing expired certificates on vCenter 5.5

Do you follow best practice? Have you setup a VMware HA cluster with vCenter. Do you have your own PKI and certificates? Did you not have active monitoring on said certs? Then chance are you are in the exact same boat as me! This blog post assumes you are well advise in using the SSL Cert Automation Tool as well as creating certificates for use with the tool.

This one begins on a Monday after the weekend. I was getting alerts of failed backup jobs. I managed to configure Veeam at my work place and have been happy with the product and support from day 1. I also configured a cold site for backup retention in the event our primary site, you know…. implodes. Anyway, I was used to getting “failed” alerts when really there was simply a communication hiccup across my IPsec tunnel, which usually the job would complete successfully and just report the error. This time however it was different, the errors were for normal backup jobs and reported “incorrect username and password.” I knew the service account’s password, used by Veeam, never expired or changes. Instantly telling me something else is wrong. I then attempt to login into vSphere connecting to my vCenter server, and sure enough it says the same thing wrong username and password, to which another notice pops up saying all communications are untrusted due to expired certs. Doh!

At this point you’ll probably have done exactly what I did… check your installation documentation right?!?! I mean if you are running custom certs, I’m assuming you follow other best practices such as documenting. :P. But after that you are probably googling once you discover part of the SSL tool are not working!

Chances are you came across VMwares KB on renewing certs on a 5.5 version instance of vCenter only to discover that at step 5 a) that the tool reports the local machine doesn’t have the SSO service installed. This really comes down to what the “tool” really is, and that’s a batch script. Yeah you read that right a BATCH script, so you could imagine how ugly and how painful that must have been to code. Like seriously 5.5 was released in Sept 2013 and they were coding using PowerShell by then… shame on you VMware. Anyway, the most likely problem here is in the way this batch script actually checks for the installed service (I looked at the source code of the “tool” but I didn’t actually locate the part that handles this and I’m strictly making assumptions here) is that it probably has a more direct string to which it looks for, again assuming here a reg key or something of that nature and its probably using a version number to check against, if the version changes the script would reply a “can’t find this”. and thus you get the above error which you know is wrong. So how do you fix this, well you grab the exact version of the tool for the updated instance of vCenter you are on (this requires a valid VMware subscription to grab the version of the tool you need). I managed to update one form post in hopes it helps others at this stage of the game.

At this point I kept following through the tutorial, just an FYI I was going through all this with a VMware tech support, and they had to get another tech who specialized in these cases. I came across other issues as well such as in Step 5 d) I got a error similar to this. Sadly I’m writing this up several days after the event so I can’t remember what exactly we did to recover from this one.
At this point gotta keep pushing through the KB which has a total of 24 steps, so you could imagine how painful all this is to do. At the same time I’m not sure HA is even available, and all my backups couldn’t run and any management of VMs would have to be done manually till vCenter could be back up and running. I’ve talked to others and many people suggest to stick with self signed certs even though we all know its not best practice. Thanks VMware for making best practice really hard to implement and maintain.
Also at the very end steps I didn’t not actually have a listed service ID for web client but only the web logger, although you can have separate service ID instance for these, in my case I had to use the web logger service ID to complete the final step. Then after the Web Client wasn’t working properly which I fixed by reinstalling the service/feature via add/remove programs. The fact there is no repair option on this installer bugs me.

To Paraphrase to solution:

1) Ensure you are using the latest and correct version of the SSL tool *cough BATCH script*.
2) Create all your new certificates and chains.
3) Follow the KB article very carefully, specially when it says to do some steps manually vs using the "tool".
4) Google any errors along the way.
5) Bash your head in for following best practices.

Jan 2018 Updates

This brings back bad memories, It’ll soon be time to update to 6.5. We’ll see how VMware has handled internal PKI this time.

Locked out of iLO, on ESXi hosts!

Scenario: You’ve taken over a system admin position, with little documentation and you are locked out of a pre-configured iLO port and the default username and password has been changed!

Prerequisites:
iLO Resources

iLO drivers and tools for server

Info Source

software install ESX host SSH

Let’s say you have a handful of ESXi hypervisors. Sweet, but what are they running on? A couple of G7 servers… They may not be G8’s but hey, can’t have everything.
Sweet part is most of these big servers have remote management capabilities. HP offers iLO, this sweet little separate attached hardware can do many remote tasks, such as, but not limited to: hard resets/power cycle, send SNMP Traps, and a couple of other things.
Now you’ve found the iLO host name, you send and receive ICMP requests (Pings) and sure enough you can access the web interface; fantastic!

Now you attempt to log on and realize every attempt fails. A couple question might start to arise, does this accept domain credentials (directory based authentication)? Is there a default User name and password?
Which of course by reading the user guide you discover it does, with one exception… Directory-based authentication requires an iLO license…
Now unless you’re in a big corporate environment running a decent sized datacenter chances are you don’t have directory based access on your iLO port.

Now you take the default admin account; Administrator, and type in the password as indicated on the info tab on the server… no luck…
as the guide suggests it has been changed… now comes all the fun stuff.. Bringing iLO up to date..

Now if you’re lucky the iLO drivers and tools for the ESXi host is already installed in which case you won’t even need to reboot your hypervisor to get into iLO.
However, if you don’t, make sure you migrate any active production VM’s to another host, or schedule a maintenance window.
At this point I’m assuming you have access to the host directly, admin access if not by directory services then a local root account.
SSH into the host directly, if you have /opt/hp/tools directory chances are you have the required iLO drivers and tools.

Otherwise ensure you follow these steps:
1) Log into vCenter (If managing a Cluster) and migrate/shutdown active VM’s.
2) Right click the host about to get iLO drivers installed on, Maintenance mode.
3) SCP hp-HPUtil-esxi5.0-bundle-1.4-15.zip to /tmp (to host)
4) In the SSH session (as admin/root) enter |esxcli software vib install -d “/tmp/hp-HPUtil-esxi5.0-bundle-1.4-15.zip”|
5) Let the server reboot
6) Wait for ICMP response, after a couple min of response you can take out of Maintenance Mode in vCenter

Congrats you can now manage the iLO port without needing to reboot the server ever again! Let’s get into that web interface!

You have a couple options at this point, reset the settings to factory defaults, or reset the admin password. Since we don’t have to reconfigure everything let’s reset that password.
There’s a good chance the previous admin changed the name and password or the default admin so how do we figure what that account is?
All one has to do is export the existing configuration and display it.
On the SSH session on the ESXi host type: “/opt/hp/tools/hponcfg -w /tmp/ilo-config.txt | cat /tmp/ilo-config.txt”
This will spit out an XML looking set of config options, at the bottom one should see a ADD USER class, with a USER_LOGIN field.

Now you build an XML file, if you’re puttying into the ESXi host just use vim to paste and edit this:

<ribcl VERSION=”2.0″>
<login USER_LOGIN=”USER_LOGIN” PASSWORD=”NewPassword”>
<user_INFO MODE=”write”>
<mod_USER USER_LOGIN=”USER_LOGIN”>
<password value=”newpass”/>
</mod_USER>
</user_INFO>
</login>
</ribcl>

Just put < and > around them cause apparently you can’t simply display XML code in HTML… opps

Save the file to /tmp/reset_admin_pw.xml and run “/opt/hp/tools/hponcfg -f /tmp/reset_admin_pw.xml”
It should reply complete, now just log into the web interface with the USER_LOGIN ID and NewPassword. While finally in there, create a new default admin and password and document it.
Create your own account with a private password, as the default account should only be used in an emergency.
Also update the firmware to the latest and renew your certs!

Jan 2018 Update

Updated one source link on the drivers source, as HP for some reason didn’t have a redirect to hpe since dividing their company. Again Well done.