How to remove a Datastore from a vSphere Cluster

How to Remove a Datastore

Intro

Hey everyone,

I figured I’d write up a quick little help guide on removing a Datastore. Now this isn’t new and likely to be buried on the internet because of it. However in my searches I have found the following sources to be great reads. I highly recommend you check them out.

1)  Official Source VMware KB2004605.

2) A Blog guide by Sam McGeown, here.

3) A post by Mike on cswitchzero.

Now let’s go through the checklist from the official source one by one.

Check List

  • If the LUN is being used as a VMFS datastore, all objects (for example, virtual machines, templates, and Snapshots) stored on the VMFS datastore must be unregistered or moved to another datastore.-This one is pretty easy navigate to the datastore files and check. You may find some remanence from the following though.
  • All CD/DVD images located on the VMFS datastore must also be unmounted/unregistered from the virtual machines.-This shouldn’t even be the case if you did check one.
  • The datastore is not used for vSphere HA heartbeat.-This setting will use a folder labeled “.vSphere-HA”
    For a Quick overview of Datastore Heart beating See here
    To “remove” aka change them See here
  • The datastore is not part of a datastore cluster.-You can find useless help on this process from VMware here. I’m assuming it’s an easy task via the WebUI
  • The datastore is not managed by Storage DRS.-If you removed it from the datastore cluster, how could this be an issue?
  • The datastore is not configured as a diagnostic coredump Partition/File and Scratch Partition. For more information, see the following:
  • Storage I/O Control is disabled for the datastore.-See here on how to enable (disabling is the exact reverse)
  • No third-party scripts or utilities running on the ESXi host can access the LUN that has issue.-Honestly I’m not sure how you could check this… even when doing some quick research, you can have scripts I guess that are not on the hosts, but run by alternative machines via PowerCLI. As described in this community post. I guess you’d have to know, either way the scripts would just fail, shouldn’t affect the vSphere cluster.
  • If the LUN is being used as an RDM, remove the RDM from the virtual machine. Click Edit Settings, highlight the RDM hard disk, and click Remove. Select Delete from disk if it is not selected and click OK.Note: This destroys the mapping file but not the LUN content.

    – This is more involving the removing of the backend physical device. Which in my case is the final goal. Though if yours was just to remove a datastore while keeping the physical storage in place this can be ignored.

  • As noted by Sam but not the official source or Mike is if you see a .dvsData folder. as stated by SAM “The .vdsData folder is created on any VMFS store that has a Virtual Machine on it that also participates in the VDS – so by migrating your VMs off the datastore you’ll be ensuring the configuration data is elsewhere.”
  • Check that there are no processes locking the VMFS with this command:
esxcli storage core device world list -d

Datastore Removal Steps

Step 1) Follow the Checklist above.

Make sure no files reside on the Datastore.

Step 2) Unmount Datastore from all ESXi hosts.

As noted by SAM blog post even in vSphere 5.x using the C# phat client, this was possible to do via a wizard against all hosts that have the datastore mounted. Even on the newer HTML5 WebUI this is still possible (I think everyone wants to fully forget that VMware chose flash for a short time).

At this point the Datastore will show up as inaccessible to vSphere. As noted by both Mike and Sam. This will be the same anywhere from 5.x-7.x (As noted by Mike it might be slightly more important to follow procedures with earlier versions of ESXi 3 or 4). If the Check list was followed, there should be no issues unmounting the datastore.

If you need to do this via esxcli (Source):

# esxcli storage filesystem list

Unmount the datastore by running the command:

# esxcli storage filesystem unmount [-u UUID | -l label | -p path ]

For example, use one of these commands to unmount the LUN01 datastore:

# esxcli storage filesystem unmount -l LUN01

# esxcli storage filesystem unmount -u 4e414917-a8d75514-6bae-0019b9f1ecf4

# esxcli storage filesystem unmount -p /vmfs/volumes/4e414917-a8d75514-6bae-0019b9f1ecf4

Step 3) Detach the LUN from all hosts.

As noted by Sam, if you are on 5.x you might want to automate this via PowerCLI. Then noted by Mike, newer 7.x can now do this in bulk via the Management WebUI.

6/7 WebUI -> Hosts n Clusters -> Hosts -> Cluster -> Host -> Configure Tab -> Storage Device (left side tree) -> Highlight Device -> Detach

for esxcli

Obtaining the NAA ID of the LUN to be removed

esxcli storage vmfs extent list

To detach the device/LUN, run the command:

# esxcli storage core device set --state=off -d NAA_ID

6. To verify that the device is offline, run the command:

# esxcli storage core device list -d NAA_ID

The output, which shows that the Status of the disk is off.

Step 4) Rescan HBAs

At this point, if you rescan all HBAs on all hosts the inaccessible datastore should be gone from the WebUI.

At this point you can remove the LUN from being seen (disc from showing up under devices) this will either be iSCSI based configurations (remove static and dynamic IPs from the iSCSI initiator settings on each host.) Mostly likely for a shared VMFS datastore.

It could be a local disc over a local storage controller (such as a logical drive created in RAID) such as behind a Pxxx storage controller.

Removing the source device will always be dependent on how it was configured in the first place.

Summary

So today we covered removing a Datastore. The important thing to remember is removing a Datastore takes a lot more steps than removing one, cause so many different VM’s and services can be applied to a datastore once it has started being used.

In many cases, the SysLog and Scratch partition are big hang ups, and should be looked at closely. Which, however, as stated if you are actually checking for files on the datastore this stuff will be pretty evident.

In most cases, ensure you follow the check list and the process should be pretty smooth. Hope this helps someone.

*Note* I often provide screen shots to provide some context, in this case I decided to leave it more generic to span multiple versions of vSphere.

WSUS Cleanup Unused Updates

How I got here

I needed to swap a disc, for a storage array to rebuild the logical volume.

Check, “disk is not authentic” **** off HPE. Workaround (disable sensors, no thanks). Fix 1, get authentic disk, not happening. Fix 2, move to alternative storage.

Alt storage available. Begin migration process (multiple ways to accomplish this, not in scope of this post). Good time to clean up source data, in this case WSUS update files. Lets clean them up…

Should be easy, eh? Open WSUS -> Options -> Server Cleanup Wizard -> Check  (Unused updates and update revisions)

Reality:

**** off Microsoft…. OK let’s see what Google has for me today….

Rabbit Hole Begins

Classic Adam with some suggestions, as mentioned here and here, same help suggestions as follows:

“* Make the following “Advanced Settings” for WSUS Application Pool in IIS:
– Queue Length: 25000 from 1000
– Limit Interval (minutes): 15 from 5
– “Service Unavailable” Response: TcpLevel from HttpLevel
* (Stop IIS first) Edit the web.config ( C:\Program Files\Update Services\WebServices\ClientWebService\web.config ) for WSUS:
– Replace <httpRuntime maxRequestLength=”4096″ /> with <httpRuntime maxRequestLength=”204800″ executionTimeout=”7200″/>
* Adjust the private memory limit.
– If you have WSUS Automated Maintenance (WAM), from the WAM Shell run:
.\Clean-WSUS.ps1 -SetApplicationPoolMemory 4096
– If you don’t have WAM, edit the pool’s configuration directly to change it to 4194304 (4GB)”

To stop IIS “issreset /stop”

Seems his copy n paste answer to this problem. Well I did all the above, and same results. Let’s try a reboot maybe that helps make these settings apply (doubt it). Nope same error. these changes did nothing to resolve the problem.

Same results. However as noted by the OP in the second link, in which Adam tell the OP to follow his guide on validating something in the SUSDB. However this simply links to his “Reinstall WSUS guide” in which he states you need SSMS “To tell if the WID carries more than the SUSDB database, you’ll need to install SQL Server Management Studio (SSMS) and connect to the WID instance to browse the databases.”

Installing MSsqlcmd

Nah SSMS is heavy you can also use “Microsoft® Command Line Utilities for SQL Server” for WSUS on 2016 I recommend version 14 along with (I bleieve is needed) ODBC Drivers (at time of this writing version 17, required Visual C++ 2017 redist)

*correction ODBC 17, did not work, installed wanted ODBC driver 11 for some reason.. this one. (FFS)

and…

are you shitting me.. what gives… Someone already blog posted about this..

Grab version ODBC version 13.1!

OMG it worked, it somehow hardcoded to check for only this particular version of ODBC, unreal… lets move on.

To help guide me in its use I followed this blog post. Thanks mavboss.

Install Visual C++ 2017 Redist.

Install ODBC drivers (AFAIK enable ODBC Driver for SQL Server SDK, during install wizard, MAKE SURE v13.1!!)

Install MSsqlcmd (v14 at the time of this writing, yes, even though the wizard picture states v13)

Holy Sheeeshh, k let’s see if we can connect to the WID…

Connecting to the WID with SQLCMD

cd "c:\Program Files\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn"
SQLCMD -E -S np:\\.\pipe\MICROSOFT##WID\tsql\query

Ehhh look at that, ok next part the queries mentioned in the initial second link share…

Ehhhh, well its going, but its taking a long time, I can see why the timeouts were extended in the app pool section…

one thing I noticed was when you run the wizard CPU goes up but does not max out, maybe a few spikes here n there. Running this stored procedure pins the CPU at 100%. will report how long this takes…

hour n 30 minutes later the process is still going…. Oi… publishing for now will update this post when new info is discovered. For now this is no answer to the problem, just a hold up to the end of the rabbit hole.

Over 3 and a half hours later it completed :O. I was just about to figure out how to cut it off when right when I was thinking about it the process dropped in CPU usage and some disk usage went up :O

And amazingly got a result from the cmd prompt. Me being the lazy guy I am, had no interest in counting the number of results, so I took the results saved them in text file in a shared file folder. Then opened it on my main work station and pasted it into excel.

Jeeeeeeeee le weeez, over 8000 results, no wonder WSUS kept crashing, plus the 5 to 15 minute timeout wouldn’t help for shit with it having taken nearly 4 hours to complete the query. OK now…. how am I going to clean this up. I have a feeling it’ll be best to write a SP myself, or at least a generalized query to delete some of these in bulk, maybe start off with 10 items and work up to 100 items at a time, even at 100 it’ll take 80 runs to clear them all….

Nutty, I don’t think removing one item will make the front end work like it did for the OP, however I’ll try to manually delete some…

That took about a minute… that times 8000… uhhhh

That’s going to take way too long… researching the stored proecdure in question I found this Blog post.

I ran the indexes mentioned but found no improvement in running the SP.

little more looking into sqlcmd, was able to determine how I could run the SP per numbered line…

SQLCMD -S np:\\.\pipe\MICROSOFT##WID\tsql\query -Q "use SUSDB; exec spDeleteUpdate @localUpdateID=69691;"

Time to write a powershell script to help bulk run this task. The linked Blog shows how SQL script, but that script itself builds a table from the Stored procedure “getObsoleteUpdateToDelete” which took 4 hours so I don’t want to run that again, since I already saved the results in a txt file.

I should be able to use PowerShell to easily iterate each line of the text file (adjust the number of items within the source file) to do the bulk operation. 😀

Let’s do this…

PS C:\Program Files\Microsoft SQL Server\Client SDK\ODBC\130\Tools\Binn> foreach($line in Get-Content 'C:\temp\New folder\list.txt'){write-host "removing $line"; SQLCMD -S np:\\.\pipe\MICROSOFT##WID\tsql\query -Q "use SUSDB; exec spDeleteUpdate @localUpdateID=$line;"}

… This one liner script allows me to run the cleanup on as many or as little updates as needed, simply add each update ID per line within the line.txt file. Done. Simple!

It took literally day’s almost a week, of slowly updating my list file and running the for each command to remove all the records from the database. Then finally opened up that WSUS wizard ran the cleanup wizard and….

Ooo no way finally! what a Pain that was. But got it done. No SSMS required.

ESXi new install; failed to create new Datastore

Well I booted up a new server, created a new logical drive, bot ESXi and Failed to create datastore… what is this?

Google help? Yeah Forms help.

1. Show connected disks.

ls -lha /vmfs/devices/disks/

(Verify the disk is seen. You will probably see your disk ID then :1. This is a partition on the disk. We only need to work about the main disk ID.)

Neat. next

2. Show the error on disk.

partedUtil getptbl /vmfs/devices/disks/(disk ID)

(It will probably indicate that the GPT is located beyond the end of the disk.)

Ohhh yeah, huh… fix it

3. Wipe disk and rewrite with a basic MSDOS partion.

partedUtil setptbl /vmfs/devices/disks/(disk ID) msdos

(The output from this should be similar to msdos and the next line will be o o o o)

Go to create data store after this, yay it worked. Please note to use your own values, images are just for reference.

*UPDATE* I went to reuse some old drives from an old RAID controller. In this case I had removed the logical drive from the old RAID configuration, pulled the disks. Since they were same Caddy as an alternative server, and went on to create some new logical drives to use as an alternative datastore on this particular host.

In the examples above, it would fail at creation of the datastore. In this example it failed at the point in the wizard to define the partition to create. with an error as follows:

“Either the selected disk already has a VMFS datastore or the host cannot perform a partition table conversion. Select another disk” in a nice red banner.

Now attempting My usual fix as mentioned above resulted in…

… to be updated (i have such a headache right now from the endless issues)

Had to clear the drives to fix this problem (delete logical drive) rip Drives out of server, use a USB enclosure to use “diskpart” and the “clean command on windows to clean the drives.

Then after that the health light on the server went off, saying my one disk or caddy is “unauthentic” even though it was just working. Apparently terrible engineer caddy’s.

Which to find out this issue I had to get into iLO which the admin password was unknown so had to run up my old blog post to get into that. and now after all that.. I have a headache.

Good job computers, you managed to make my day fantastic… again.

Veeam: NFC storage connection is unavailable.
Failed to create NFC download stream.

I’ll keep this post short as well.

Run replication Job… ERROR. Check error, huh haven’t this one before…

4/16/2021 11:22:16 AM :: Processing VMName Error: NFC storage connection is unavailable. Storage: [stg:datastore-3,nfchost:host-2,conn:vcenter]. Storage display name: [ESXi-Local-Datastore].
Failed to create NFC upload stream. NFC path: [nfc://conn:vCenter,nfchost:host-23982,stg:datastore-23983@VMName_replica/VMName.vmx].

To note on this, I did some changes, I changed a route between sites (as I needed to reduce a entire network from being improperly routed, but some of services still required access from the main location, thus some dedicated /32 routes were put in their place).

I had also just patched the host in question and was testing jobs after the patch. Since I wasn’t exactly sure which was the cause. I decided to do regular troubleshooting to get more details to the root cause.

I love Veeam, they got a nice KB to help with this. So I followed along, checking the main Veeam server log areas didn’t have the log file in question, so was pretty confident it was still using the proxy at the alt site.

Checking the log as mentioned by the KB, sure enough the same error line showed up which indicated it was DNS related. Checking the proxie’s DNS settings…. DOH. It was using a server within the routes I had removed, and didn’t create a dedicated /32 for, as I wasn’t expecting any systems here to need to communicate to that subnet.

Now that I know what the issue is… this feels familiar… oh yeah the Veeam Soap Fault issue I had to deal with

The funny part about this is… 1) it’s the same server/proxy 2) Again DNS related 3) Again going to stick with host file to avoid dependencies on DNS servers

In my case the error showed the ESXi server by the hostname WAS fully qualified, but access to a DNS server to resolve it was unavailable. As soon as I saw this I had two options:

  1. Create a route to allow the Proxy to reach required DNS servers (which won’t be available in a DR case) OR
  2. Just add a static record in the Proxy host file. (DNS server not required, but if hostname or IP changes needs to be adjusted manually here)

As you can see I have the exactly 2 options similar to my first post, the difference is now it was fully dependent on DNS. Since this is a self hosted instance of a Veeam proxy, there’s a good chance DNS access might not be available when time comes, so this option was chosen.

It’s important to note that when these types of choices are made it is well documented WHY they were made.

So in this case… to resolve it I added a record in the Proxy’s local host file

172.x.x.x     ESXi.domain.postfix

You may notice that the ESXi hostname is not within the initial error, it only tells your the datastore, the Veeam logs will indicate which lookup failed. More than likely the hostname look up for the ESXi host in which the VM will be created on.

I really hope this post helps someone. Honestly I just followed the Veeam KB which was a great source reference to troubleshooting the issue. Your case maybe different, depending on the root cause your resolution maybe different then what was discussed here.

Cheers stay safe everyone.

ESXi VM disconnected after applying patch

Keep this short.

I had to update a ESXi host locally as it’s mgmt connection would drop with all VMs having to go down on it. As it’s a single host with virtualized network components.

On one of the VMs this was an opportunity to update it since it needed to be shutdown temperately anyway. I took a snapshot of the VM, updated it, validated updates were fine, removed the snapshot. Then proceeded to update the host:

vim-cmd hostsvc/Maintenance_mode_enter
esxcli software vib update -d "/path/to/file.zip"
vim-cmd hostsvc/Maintenance_mode_exit
reboot

Nothing special here. However once the host came back up and I was able to access it via vCenter, one of the VM’s was shown as “disconnected” I’ve seen this with ESXi hosts before, but not particularly with a VM.

Oddly enough there’s only one datastore on the host and all other VMs are fine, and checking the datastore, all files are where they should be.

I figured maybe remove the VM from inventory and just re-add it via the vmx file, however the option was greyed out.

It turned out there was apparently still a snapshot left on the VM (noticed via delta files existing within the VMs folder path).

Removing all the snapshots resolved the issue. Turns out the VM was also running, but didn’t show the green play icon, thus I wasn’t aware of it’s powered on state. Which also explains the greyed out context menu for removing from inventory.

Hope this helps someone.

Validating Windows Creds

I wanna make a really quick post here about this. Normally I generally right click a app on the taskbar, and then shift+right click the app icon, and in the context menu pick “run as a different user”. then I get a credentials box prompt asking me to enter the creds of the user and their password, and if successful open the app (generally cmd).

This time I was testing some old credentials used for a particular service, but I wasn’t sure of the password, I also wasn’t sure exactly where this account was all used, so was hesitant to just go and change the accounts password.

I did my usual trick as stated above and got the user was not allowed local logon for this machine, which was a good thing, some standard best practices for the account were implemented. This however still left me with the assumption the user/password was correct, but not 100% sure.

attempting the same thing with a random known bad password sure enough responded with wrong username/password. Giving me pretty confident results the username and password I entered were correct.

I found this serverfault post about the same question, and I attempted the simple “net use” trick. Sure enough they also do the run as trick I stated in the first paragraph.

net use \\%userdnsdomain% /user:%userdomain%\%username% *

on my main machine I got an error of multiple connections not allowed, I attempted the fix posted by themadmax

net use /delete \\unc\path

which didn’t work probably cause this path I was testing against was a mapped drive for my local logged in user. I followed up by running the commands from an alternative machine I knew had access to the share and DC’s.

Sure enough this worked, I am now confident the username and password are correct.

Hope this helps someone!

Auto Install Defender Updates, but Not All Updates

Issue

Fun times! Updates. Which are not separated in the defined GPOs available to Sys Admins.

Many sources of this issue:
[SOLVED] Server 2016 – Auto install definition updates but nothing else? – Windows Server – Spiceworks

Autoupdate Windows Defender (microsoft.com)

Windows Server 2016 auto install security updates – Microsoft Q&A

Issue: Defender Definition updates come ever day, no separate GPO to differentiate other Windows system updates from these. Other updates require manual install for service availability reasons.

*NOTE* This is how to do this while retaining the update option #3: Auto Download and notify for install. Incase you need to maintain guided (human controlled) updates, but not for the definition updates.

Solution:

Use either:

For 2008 R2 (Source)
A) C:\Program Files\Windows Defender\MpCmdRun.exe -Signature Update

For 2016 + 2019 (Source)
B) PowerShell Cmdlet: Update-MpSignature

Implementation:

Create a script, configure a GPO to deploy it to server as a scheduled task.
*This post to be update with better, step by step tasks. Just a place holder for now with references.

Step 1: Create a script

If you need help with this, you can use my script as a reference, or just use it, similar to this.

Step 2: Determine shared location

Save the script to a share available to domain system (I heard SYSVOL is accessible by all). If this is not acceptable you can follow this guys guide in which he creates a standard SMB share from an alternative server.

Step 3: Fall Down a Rabbit Hole

OK… this is where things got a bit tricky. There’s one slight issue if you want to run a task from a systems’ perspective when the source is on a SMB share that requires domain creds. In the guide I provided about the Op simply created a shortcut link to the network shared script, which will run under a users context.

In this guide, by SysOps, he mentions the use of SYSTEM and the escalated privilege’s it has, but later mentions that he’s sourcing the script local but you could use a network share, however, not mentioning the issue I just did here.

Of course I figured, ok what a good time learn using gMSA accounts to run the task. It should be able to read the script file, it should have the required rights. (expect this is super good to know Thanks Leon! – If you have “Run whether user is logged in or not” your gMSA must be member of the Log on as a batch job or the local Administrators group to be able to run.) Also don’t have to worry about managing a password for the account, it should be a win all around. Let’s do it.

Pick Your Poison

You can either A) copy the script to a local path on the server, and create a scheduled task to run the script, either as system, or any standard user.

or B) create a domain account, or gMSA, and place the script in a SMB location and use a GPO to create the scheduled task on all machines.

I choose B…. but….

This is a bit of a rabbit hole so feel free to avoid this tangent by skipping to part B.

Turns out there’s no governance around the ExecutionPolicy in windows.

Microsoft has changed how definition updates are seen in update history.

Note usually you should grant access to manage the machine password permission to a group, instead of machines directly, and if done so permission to the gMSA can be applied without reboot. (Though I’m sure the same might apply when applied directly to the system as well, but I have tested).

Now my mind started to wonder a bit, Is there a limitation to how many machines can have access to a gMSA? Even this more nitty gritty blog post on gMSAs doesn’t seem to state any limitations. This reddit post asking specifically around gMSA limitation.. nothing.

“unique_username065
3 years ago
You also need to give the gMSA permission to run scripts. There is a technet blog article that explains all the necessary steps to run scheduled tasks and scripts. I am on mobile, so I can’t look for it.

Just be very careful because everyone with access to the machine can potentially exploit gMSAs AFAIK.

Disclaimer: all I wrote is based in theories”

Well that TechNet blog would have been useful, I’ll keep sourcing my findings as we move along here.  So I’ll test it on a single machine, but I have multiple systems in an OU which ties GPOs, so how to push a GPO to just one machine in a OU?

Of course this has it’s own rabbit hole you have to considerSee here for all the details.

In short… ‘If adding “Authenticated Users” with just “Read” permissions is not an option in your environment, then you will need to add the “Domain Computers” group with “Read” Permissions. If you want to limit it beyond the Domain Computers group: Administrators can also create a new domain group and add the computer accounts to the group so you can limit the “Read Access” on a Group Policy Object (GPO).’

In my case the computer account it’s self should suffice, or as mentioned a group with computer accounts. This was the scope and the read permission will both be applied via the same group, and if needing to add more machine only need to add them to the appropriate groups not mess with GPO scopes or permissions. (AKA scalable design)

Then I had one final question pop in my head, “If you can define a GPO to copy a file from a shared network path to the local machine, how does it do that? If scheduled Tasks can only run via ‘SYSTEM’.”

My highly intelligent friend said something, and seems to be backed by this source as well.

“This can only be done during system startup – you’re copying to a system protected folder. During system startup you’ll need to grant the computer itself read access to the source directory share. There are two ways of doing this.

– Create a computer group and grant that group read access. You’ll then need to add every computer to it. You could use the built in Domain Computers group for this as well

or

– Put all the files you want copied into the GPO folder. This folder is read-only for computers as they start.”

Ohhh weird… but you can’t use the computer account to run scheduled tasks?

Apparently not well poop. So that explains all that….

I wanted to test my script as a scheduled task, and noticed a random change from the last time I test.

Old results: Click Check for updates, Definition Update was available, but had to click Install for them to be installed.,

New results (without deploying this script): Click check for update, definition update installs by itself after clicking check for update.

Oh well in that case, lets just up the amount of times it checks for updates.

Apply the GPO setting “Automatic Updates Detection Frequency”

Check the next morning….

As you can see, the detection frequency was applied, but I guess it’s not being adhered to. The last update is well beyond 2 hours.

Time to deploy the script.

K so to pull this off…

Step 3: Create a gMSA

  1. create a group for granting access to manage the MSA password
    New-ADGroup -Name "Update Defender Definitions" -SamAccountName UpdateDefenderDefinitions -GroupCategory Security -GroupScope DomainLocal -DisplayName "Update Defender Definitions" -Description "This group is granted ManagePassword rights on the gMSAtskUDDspt" -Path "CN=Managed Service Accounts,DC=zewwy,DC=ca"
  2. create the gmsa
    New-ADServiceAccount -Name gMSAtskUDDspt -DNSHostName gMSAtskUDDspt.zewwy.ca -PrincipalsAllowedToRetrieveManagedPassword UpdateDefenderDefinitions
  3. grant the group access to the GMSA, by adding computers into the group created in step 1.
    Add-ADGroupMember -Identity SvcAccPSOGroup -Members SQL01,SQL02

    Step 4: Create a GPO that creates a Scheduled Task to run the script

Right click in GPMC where you want your GPO to be linked, and select “create new GPO and link it here”

Remove Authenticated Users from the scope (if you need to test this one one machine, when multiple machines are in one OU, else skip this stuff). Then add the computer account in the scope area. (It appears the computer account is granted read rights on the GPO now.)

Edit the GPO -> Computer -> Preferences -> Scheduled Tasks (at least Windows 7)

It’s super important to know the differences between the actions types.

Action : Update Create

Name: InstallDefinitionUpdates

[Document remaining steps]

Caveats

Shit… I forgot, you have to add the gMSA to all computers that would need this applied too.. and you can’t automate that via GPO, like you can everything else.

Did a update force and saw the scheduled task, finally something, but…

I clearly forgot about Leon’s advise… and double checking that the option to run if the user is logged in or not.

gpupdate /force….. no change to task…. what… ok delete task…..

gpupdate /force… No new task… what?

Go to GPO, switch the option back to run when user is logged on…

gpupdate /force… new task is there… OK what gives?

Try to set the task to run if user is logged on or not manually by editing the task…. I get a cred box pop-up. As for most services using gMSAs, left the password field blank and clicked ok…

I love IT work….

OK… what did I miss this time?

OK, I’ve been digging in the PowerShell properties for scheduled tasks for a while now… How the heck do I set to run logged on or not via powershell?

Main answer, says to use a principal with type password, but it’s a service account? Second main answer says to use system, like no this is a gMSA and we need a domain account for the issues stated above. For shits I tried setting the principal logon type to S4U, as mentioned by one commenter, but it gave me access denied response, then I picked password type and it took it, somehow it is set now… what?! (See picture below)

I went to check the task history… It worked!

Holy Bloody Mary, it actually worked!

OK but it’s seem really stupid when you define the option to run when logged on or not it won’t deploy the task, but if you leave it as user as to logon it does, then you have to use powershell to set the proper logontype. So another powershell script… Ughhhh, there’s also the issue of installing the gMSA on the computer account, I wonder if I could have two additional tasks to run powershell commands to those needfuls.

Ahhh crap, if the GPO action is replace… and I just had to do manual steps I haven’t automated yet….

gpupdate /force… yup back to run when user is logged on crap! Normally the replace action is good if you want to make changes, in this case it’s not wanted, and would be kind of redic to have these multiple scripts to fix themselves go off every time there’s a gpupdate. In this case I changed the Action back to create. K that works, but how do I run these simple powershell commands right after that… automatically.

$principal = New-ScheduledTaskPrincipal -UserId domain\gMSA$ -LogonType password
Set-ScheduledTask -TaskName InstallDefinitionUpdates -Principal $principal

For the first issue, this was the closest I could find. The main answer of using LAPS is poop. The issue around credSSP could be the fact, but not sure if putting creds into a script is a great idea anyway if it is required. I wonder if the system account can run the command… or the “Computer account” maybe via a simplified startup script?

Sine the amount of systems I had to deploy on was small, I skipped this. But if this was wanting to be deployed on end machines, workstations, or laptops. This might be a required step in that case.

As for Issue number two. I ran the above commands manually after installing the gMSA manually. At this point it makes you wonder what was the point of automating the creation of the scheduled task, if I simply have to manually do the other steps. The only answer to that I have is, I didn’t know, I learnt as I go. However it only now required 2 more hurdles to resolve to actually fully automate the process.

Summary

This was another very painful learning experience, all cause definition updates were tied to MS updates, and couldn’t have their own install schedule or install action. I going to create a separate blog post to cover creating a Scheduled Task with a gMSA like this one did. but more specific to that task.

May I suggest you use a standard domain account and just deploy a script pointing to that, and store the creds somewhere if you really need to. This is a painful process.

 

Microsoft Exchange Vulns and Buggy Updates

I’ll keep this post short. If you are unaware, there’s been a big hack on exchange servers.

Microsoft Exchange hack, explained (cnbc.com)

I ran the IOC scripts from MS, was I affected, it appears I may have.

Initiated my own lab DRP/BCP. Informed myself that services would be down, and restored AD and Exchange from backups before the logged incidents. Took the OWA Rev proxy rule  down till the servers could be fully patched.

Booted restored VMs, patched, hopefully good to go.

Then doing patch Tuesday updates users laptops start failing to boot after installing KB5000802. All I could find was news of prints causing BSODs classic.. BSODs! In my case it was causing boot crashing, I did my usual trick, but I got a different error, then ran the Windows Start up repair process, which amazingly got it to boot but said it reverted an updated (the one above). i attempted a install again, but same problem. I didn’t want to re-image as it was an VIPs machine, and time was of the essence. I took a whim, and decided to install all the latest drivers from the laptop OEM vendor (In case some was using MS drivers instead), after that tried the update again, and got a successful install.  Phewwww!

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.

 

Creating Custom ESXi Image

Follow these steps

  1. Download Offline Bundle of ESXi Image
  2. Download Drivers E.G The Native ESXi USB NIC drivers
  3. Install PowerCLI (Set-ExecutionPolicy Remotesigned; Import-Module PowershellGet; Install-Module -Name VMware.PowerCLI)
  4. In PowerCLI connect the standard SoftwareDepot by typing:

    Add-EsxSoftwareDepot -DepotUrl <Path to zip>

  5. Get the ImageProfile list:

    Get-EsxImageProfile

  6. Clone standard ImageProfile:

    New-EsxImageProfile -CloneProfile ESXi-6.7.0-8169922-standard -Name MyProfile -Vendor <vendor>

  7.  [Only If Required] If your vib file has Acceptance Level – CommunitySupported, we need to set this Acceptance Level for our ImageProfile:

    Set-EsxImageProfile -ImageProfile MyProfile -AcceptanceLevel CommunitySupported

  8. Add our vib to SoftwareDepot:

    Get-EsxSoftwarePackage -PackageUrl <path to vib>

  9. Add our vib to ImageProfile:

    Add-EsxSoftwarePackage -PackageUrl

Error:

Search result.

Answer driver for specfic version (7.1, need 6.5)

So I downloaded the proper driver but I couldn’t figure out how to pick the right software package since the “get” command was actually already loaded the other driver, so it kept trying to add the 7.1 driver. Only thing I could think of was to close the powershell windows and start fresh…

10. Export ImageProfile to ISO image:

Export-EsxImageProfile -ImageProfile MyProfile -ExportToIso -FilePath

That was it! Sadly the laptop I wanted to use this on was still boot looping, and sadly the USB NIC “Insagnia” didn’t seem to work and was getting NFS4 client failed to load, and not network adapters found on the machine. But was worth a shot.