Fixing WordPress Security Recommendations

Fixing WordPress Security Recommendations

So.. I log into my WordPress server, and under site health on the Dashboard, the yellow… basically sayin…. “Do better”

Anyway, I asked Co-Pilot for help on this (Specifically about the warning about a dated PHP version [7.4]), and it basically said:

🛠️ Update PHP for Your WordPress Site

1. Check Your Current PHP Version

  • Go to your WordPress dashboard
  • Navigate to Tools > Site Health > Info tab
  • Expand the Server section to see your PHP version

2. Back Up Your Site First

  • Use a plugin like UpdraftPlus or Duplicator
  • Save both your files and database—just in case something breaks

3. Update PHP via Hosting Control Panel

Depending on your host, the process varies slightly:

Hosting Provider How to Update PHP
Bluehost Go to Websites > Settings > PHP Version, choose the latest version
SiteGround Use Site Tools > Devs > PHP Manager, select manual update
Hostinger Go to Websites > Manage > PHP Configuration, pick a new version
cPanel Hosts Use MultiPHP Manager under the Software section

If you’re unsure where to find this, your host’s support team can guide you.

4. Test Compatibility

  • Use the PHP Compatibility Checker plugin to scan for issues
  • Or create a staging site to test the update safely

5. Update WordPress, Plugins, and Themes

  • Make sure everything is up to date before switching PHP versions
  • This reduces the chance of conflicts

Great, except that I’m self hosting… I have no interest in my access to a service pay walled and managed via poorly coded web sites.  So I followed up with Co-Pilot which it then wanted me to basically add a third party repo, in this specific case, a PPA “ondreJ”. I had a couple issues with this.

  1. This is known as a Frankstein build.
  2. Managing third party repo’s can cause issues, and is usually frown’d apon.

These are mentioned in this reddit post where a sysadmin was confused and this type of setup was instantly caught by others with experience. Which is what AI was telling me to do, it’s reasoning was as follows:

“Ubuntu’s default repositories only include certain PHP versions—usually the ones that were current when that Ubuntu release came out. So if you’re on, say, Ubuntu 22.04 or 24.04, you might only get PHP 8.1 or 8.2 from the official repos. Anything newer (like PHP 8.3 or 8.4)? Not there yet.

That’s why developers lean on Ondřej Surý’s PPA (ppa:ondrej/php). He’s a trusted Debian developer who maintains up-to-date PHP packages for Ubuntu and Debian.”

When asked why someone woudl have to install the actual maintainers repo, instead of those builds being pushed to the OS default repo’s, AI stated this:

“Here’s the short version: distro maintainers often freeze package versions when they release a new OS version—for stability and compatibility across thousands of other packages. So even if the upstream maintainer (like Ondřej) has newer PHP versions ready to go, the official Ubuntu repo won’t touch them unless there’s a security patch or they release a new Ubuntu version that includes the updated language package.”

Anyway… instead of all this rubbish, it makes more sense to upgrade the base OS at this point. So let’s do that.

🛠️ Upgrading Base OS Debain 11 to 12

1. Back Up Your System

You’ve already got backups—perfect. Just make sure you’ve got /etc, /var/lib/dpkg, and /var/lib/apt/extended_states covered.

2. Update Current System

sudo apt update
sudo apt upgrade
sudo apt full-upgrade
sudo apt --purge autoremove

3. Edit Your APT Sources

Replace all instances of bullseye with bookworm in your sources list:

sudo sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list

If you use additional repos in /etc/apt/sources.list.d/, update those too:

sudo sed -i 's/bullseye/bookworm/g' /etc/apt/sources.list.d/*

Optionally, add the new non-free-firmware section:

sudo sed -i 's/non-free/non-free non-free-firmware/g' /etc/apt/sources.list

4. Refresh Package Index

sudo apt update

5. Minimal Upgrade First

sudo apt upgrade --without-new-pkgs

6. Full System Upgrade

sudo apt full-upgrade

Follow any prompts carefully—especially ones about restarting services or replacing config files.

7. Reboot

sudo reboot

8. Verify Upgrade

bash
lsb_release -a
cat /etc/debian_version
After the reboot I was back in to my wordpress admin portal, and the site warning was gone. Yay… but some still remain.. lets see if we can clean these up.. on at a time:
right inactive plugins, that should be easy enough. Deleted 2 unused plugins, and sure enough the warning is gone. Same with themes, let’s do that next.
Appearance -> Themes -> Theme Details -> Delete.
Older version of PHP… well here we go again, so we went from a site warning to site good, but it still wants php8.3… which was discussed above. Frankenstein the build or wait for a new base OS, I’ll await Debian 13 before introducing a Frankenstein Debian build. We’ll ignore this warning.

Recommended Module: ZIP

This one is a bit annoying cause it can go two ways:
1) Sure, I can install this module, but if I don’t need it why would I? Just to remove this warning, I suppose.

Fair question—and the short answer is: probably not essential, but it’s handy to have. 👍

The zip module isn’t required for WordPress to function at a basic level. If your site is running smoothly, you’re likely not using any plugins or features that rely on it yet. But here’s why WordPress flags it:

📦 What the zip Module Is Used For

  • Plugin installation and updates (especially from ZIP files)
  • Theme import/export
  • Backup plugins (e.g., Duplicator, UpdraftPlus)
  • Migration tools

If you’re manually installing things via FTP or don’t use plugins that automate packaging and extraction, you’re fine without it. WordPress just lists it because a lot of plugins assume it’s present.

🧠 TL;DR

  • ✅ Your site can run without it
  • ⚠️ If you install or update ZIP-based plugins through the admin interface, you might hit errors
  • 🛠️ It’s a tiny module and safe to install if you’re future-proofing

How to?

🛠️ Enable the zip PHP Module on Debian 12

1. Install the Extension

Make sure you install the correct version for your PHP setup. If you’re using PHP 8.2 (the default in Debian 12):

sudo apt update
sudo apt install php8.2-zip

If you’ve upgraded to PHP 8.3 via SURY:

sudo apt install php8.3-zip

2. Restart Your Web Server

Depending on your setup:

sudo systemctl restart apache2

3. Verify Installation

Run:

php -m | grep zip

If it returns zip, you’re good to go.

2) Don’t install it from a security stand point:

Security Angle: Why Not Installing zip Might Be Safer

  • Less attack surface: Fewer PHP extensions mean fewer potential vulnerabilities, especially from bugs in parsing zip headers or malicious zip payloads.
  • Blocks auto-installs: Without ZipArchive, plugins/themes that use zipped packages via the WordPress admin can’t self-extract. That forces all installs to be manual—good for auditability.
  • Script-based exploits are blunted: Malware that attempts to download, unpack, and activate zip files through vulnerable plugins or upload endpoints will hit a wall.
  • Prevents sneaky plugin drops: No zip means plugins can’t unpack themselves unnoticed during malicious file uploads or improperly sanitized form inputs.

⚠️ Trade-Offs

  • Some legit backup or migration plugins won’t work—or will throw warnings
  • Manual updates and installs are more tedious
  • You may need to temporarily install it for specific maintenance tasks, then uninstall afterward

Object Cache

This one was a bit more annoying than I wanted it to be simply because it’s got many options. beside 1) Do it vs 2) Don’t do it.

Right now, my site runs fine but doesn’t want a faster site… right… Right?

🔴 Redis

Pros

  • Very fast and widely adopted
  • Works across multiple servers (great for scaling)
  • Excellent support from plugins like Redis Object Cache
  • Stores complex data types (not just key-value pairs)
  • Can be configured for persistence (disk backup of cache)

⚠️ Cons

  • Uses more memory than simpler caches
  • Requires a background daemon (redis-server)
  • Overkill for tiny or low-traffic sites

🔵 Memcached

Pros

  • Lightweight and blazing fast
  • Great for simple key-value object caching
  • Minimal resource usage—ideal for single-server setups

⚠️ Cons

  • Doesn’t support complex data types
  • No persistence: cache is lost if the server reboots
  • Fewer modern plugin options compared to Redis

🟣 APCu

Pros

  • Fast, simple, and bundled with PHP
  • No external services required—runs in-process
  • Perfect for single-server, low-footprint setups

⚠️ Cons

  • Only works per process: no shared cache across servers
  • Not ideal for large or complex sites
  • Might get flushed more often depending on your PHP configuration

In my case I’m going to try memcached, why I unno….

🧰 Install Memcached + WordPress Integration

1. Install Memcached Server + PHP Extension

bash
sudo apt update
sudo apt install memcached php8.2-memcached
sudo systemctl enable memcached
sudo systemctl start memcached

Replace php8.2 with your actual PHP version if needed.

2. Verify Memcached Is Running

echo "stats settings" | nc localhost 11211

You can use Bash’s built-in TCP support:

exec 3<>/dev/tcp/127.0.0.1/11211
echo -e "stats\r\nquit\r\n" >&3
cat <&3

This opens a raw TCP connection and sends the stats command directly.

You should see a list of stats—if not, Memcached isn’t active.

3. Install a WordPress Plugin

The most common plugin for Memcached integration is W3 Total Cache:

  • Go to your WordPress dashboard
  • Navigate to Plugins > Add New
  • Search for W3 Total Cache
  • Click Install, then Activate

4. Configure W3 Total Cache for Memcached

  • Go to Performance > General Settings
  • Under Object Cache, enable it and select Memcached as the method
    *If memcached is not selectable restart web service (apache2)
  • Do the same for Database Cache if desired
  • Save settings and purge cache

✅ Optional: Use a Drop-In Instead of a Plugin

If you want to skip plugins entirely:

  • Download object-cache.php from Memcachy
  • Place it in /wp-content/
  • Add this to wp-config.php:
    php
    define('WP_CACHE', true);
    $memcached_servers = array(
      array('127.0.0.1', 11211)
    );
    

This gives you Memcached support without a full plugin UI.

After installing the backend modules and the plugin, and configuring memcached as the object cache, site health finally removed that warning too.. now the final one.. HTTPS.

HTTPS

Now it’s not possible to access my site via HTTP, not externally anyway. So, why the warning?

Ah, that’s a classic WordPress quirk—your browser’s using HTTPS, but WordPress itself still thinks it’s living in the insecure past. 😅

Here’s how to fix it since you’re self-hosting:

🔐 Update WordPress URLs to Use HTTPS

  1. Log into your WordPress Admin Dashboard
  2. Go to Settings > General
  3. Update both:
    • WordPress Address (URL)
    • Site Address (URL) to use https://yourdomain.com instead of http://
  4. Scroll down and click Save Changes

But it’s greyed out…

🛠️ How to Fix It

  1. Open your wp-config.php file It’s located in the root of your WordPress installation (same folder as wp-load.php and wp-settings.php).
  2. Look for these lines:
    define('WP_HOME','http://yourdomain.com');
    define('WP_SITEURL','http://yourdomain.com');
    

    Or they might be using https:// if you’ve already set up SSL.

  3. Delete or comment them out:
    php
    // define('WP_HOME','https://yourdomain.com');
    // define('WP_SITEURL','https://yourdomain.com');
    
  4. Save the file and reload your WordPress dashboard The fields under Settings > General should now be editable.

mhmm, uhhh commenting them out caused my site to simple redirect to local loop back IP which of course the system I’m navigating my site on isn’t hosting a web server thus the site timeouts.

yeah, that redirect to 127.0.0.1 or localhost is a classic symptom of WordPress falling back to default behavior when it can’t find proper site URLs. Here’s what likely happened:

🧠 Why It Redirected to Localhost

When you commented out WP_HOME and WP_SITEURL in wp-config.php, WordPress tried to pull the site URL from the database (wp_options table). If those values are still set to http://localhost or http://127.0.0.1, it’ll redirect there—even if you’re accessing the site from a public domain.

🛠️ How to Fix It

✅ Option 1: Restore the Correct URLs in wp-config.php

Uncomment and update the lines like this:

define('WP_HOME', 'https://yourdomain.com');
define('WP_SITEURL', 'https://yourdomain.com');

Replace yourdomain.com with your actual domain name.

✅ Option 2: Update URLs Directly in the Database

If you want to manage URLs from the dashboard again, you’ll need to fix the database values:

  1. Access phpMyAdmin or use the MySQL CLI
  2. Run:
    sql
    UPDATE wp_options SET option_value = 'https://yourdomain.com' WHERE option_name IN ('siteurl', 'home');
    
  3. Clear your browser cache and reload the site

You can then safely remove the hardcoded lines from wp-config.php and regain control via the WordPress admin panel.

FINALLY What a PITA… Hope this helps someone…

Small Rabbit Hole

Allllllright, I wanted to check if I was even running phpMyAdmin (I vaguely remembering that it was part of turnkey Linux setup, which I THINK is what this was originally spun up from, which had such services baked in for ease of use)… ANYWAY, I went to try n figure this out my simply checking my servers listening ports… when I found IPV6

🚫 IPv6

I dislike it, others disagree. if it be supported (by now it’s pretty widely adopted), or if it’s something you need.. ughhh, then give er… let  the world be your oyster or some dumb shit. I personally don’t like the idea of everything having a fully publicly routeable IP address.. if it even works that way.. unno… I still stick to IPv4 where, yes I use NAT… ooo nooooo…

Anyway long story short I wanted to disable IPv6 on my WordPress server…

🧱 Method 1: Disable via sysctl (Persistent)

Edit the system config file:

sudo nano /etc/sysctl.conf

Add these lines at the end:

net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

Apply changes:

sudo sysctl -p

I did this but still found listening port on IPv6 (specifically SSH and NTP), I could have reconfigured this services butt, instead..

🧨 Method 2: Blacklist the IPv6 Kernel Module

Create a blacklist file:

sudo nano /etc/modprobe.d/blacklist-ipv6.conf

Add:

blacklist ipv6

Then update initramfs:

sudo update-initramfs -u
sudo reboot

This didn’t work for me.

🧪 Method 3: Disable via GRUB Boot Parameters

Edit GRUB config:

sudo nano /etc/default/grub

Find the line starting with GRUB_CMDLINE_LINUX_DEFAULT and add:

ipv6.disable=1

Example:

GRUB_CMDLINE_LINUX_DEFAULT="quiet ipv6.disable=1"

Update GRUB:

sudo update-grub
sudo reboot

This finally worked!

Rabbit Hole #2: stunnel4

what is stunnel4

Stunnel4 is a proxy tool that wraps unencrypted traffic in SSL/TLS encryption, allowing legacy or non-secure services to communicate securely without modifying their original code. Think of it as a secure tunnel that sits between a client and a server, encrypting everything that passes through.

🔐 What Stunnel4 Does

  • Encrypts plaintext protocols like HTTP, SMTP, FTP, or Telnet
  • Acts as a TLS/SSL wrapper for services that don’t natively support encryption
  • Can run in client or server mode
  • Supports load balancing, certificate validation, and FIPS-compliant cryptography
  • Uses OpenSSL under the hood for encryption algorithms

🧠 How It Works

Imagine you have a mail server that only speaks plain SMTP on port 25. You can configure Stunnel to:

  • Accept encrypted connections on port 465
  • Decrypt the traffic
  • Forward it to port 25 locally

This way, clients connect securely—even though the mail server itself is oblivious to SSL.

🛠️ Typical Use Cases

  • Securing legacy services without rewriting them
  • Tunneling blocked protocols (e.g. SSH over HTTPS)
  • Adding SSL to VPNs (like OpenVPN)
  • Acting as a secure gateway for internal services

Want help setting up a config file or using it to secure a specific service like SMTP or Redis? I can walk you through it.

“Tunneling blocked protocols (e.g. SSH over HTTPS)” This is exactly what I see my system doign on port 12320.. and it shows as service stunel4. but I don’t know if I want to stop the service as a whole…

You’re spot-on in diagnosing that behavior—port 12320 with stunnel4 almost certainly means your system is tunneling SSH over HTTPS, or at least exposing a secure channel for remote access via SSL. This is commonly used in stacks like Webmin/Virtualmin/Usermin, where stunnel4 wraps non-HTTPS services to provide secure access through a browser

🔍 What Port 12320 Usually Does with stunnel4

  • Acts as a gateway to Webmin or Usermin shell access
  • Allows remote file management, terminal, or SSH-like features over HTTPS
  • Wrapped securely so that firewalls see it as “harmless” SSL traffic

You can confirm this by checking:

sudo systemctl status stunnel4
cat /etc/stunnel/stunnel.conf

Or:

grep -r 12320 /etc/stunnel/

That config file should reveal exactly which service is being tunneled.

⚠️ Should You Stop It?

That depends on:

  • Do you use Webmin/Virtualmin remotely?
  • Are you managing your server through a web browser interface?
  • Is 12320 exposed to the public internet or locked behind a firewall?

If you don’t use it and it’s open to the world, you could:

  • Disable it safely:
    sudo systemctl disable --now stunnel4
    
  • Or edit the config to disable only port 12320, leaving other tunnels active

But if you rely on browser-based admin tools, killing stunnel4 might break access to file manager or terminal tools from the web UI.

🧠 Pro Tip

Rather than killing the whole service, you can:

  • Comment out the 12320 tunnel in stunnel.conf
  • Reload the config:
    bash
    sudo systemctl restart stunnel4
    

That gives you selective control—without losing secure access to services you still need.

Want help editing that config file or running a quick audit on what’s listening? I can help you tailor your server’s security posture to your exact setup.

stunnel4.service, Loaded: masked (Reason: unit stunnel4.service is maked.), Active: inactive (dead) grep -r 12320 /etc/stunnel/ /etc/stunnel/shellinabox.conf:12320 So, there it is the shell (SSH) service using stunnel, which is why I see a consoel when I access https://wordpress:12320. But why is the service saying dead when I clearly see it bound via ss command?

Excellent deduction — you’re practically reverse-engineering the system just by reading its sockets. 🧠🔧 Let’s break down what’s happening:

🧩 Why stunnel4.service Says Dead but the Port Is Active

You’re seeing a listening socket on port 12320, likely from ShellInABox wrapped by stunnel, but stunnel4.service reports:

Loaded: masked
Active: inactive (dead)

Here’s the likely cause:

⚙️ Stunnel is running, but not via the stunnel4.service unit

  • The service is masked, which means it’s intentionally disabled from being started or stopped using systemctl.
  • This doesn’t stop you (or another app) from manually launching Stunnel as a background process, or from invoking it via a custom systemd unit, init script, or even Webmin/Virtualmin.

🧠 ss shows the port because:

  • The process was started outside the official service file, possibly via:
    • A different systemd unit (e.g., webmin-stunnel.service)
    • A direct call in /etc/rc.local or cron or Webmin’s internal startup sequence
    • A bundled wrapper from ShellInABox or another tool

🔍 How to Find the Real Parent Process

Let’s track down what’s owning port 12320:

sudo lsof -i :12320

Or:

sudo ss -tulpn | grep 12320

This will show you the actual PID and the command or executable responsible. Likely output will say something like:

LISTEN  ... 12320  ... /usr/bin/stunnel

Once you get the PID, trace it:

ps -fp <PID>

🛠️ How to Disable That Specific Tunnel (Without Killing Stunnel Globally)

You can comment out or delete the tunnel block in /etc/stunnel/shellinabox.conf. Try this:

  1. Edit the config:
    sudo nano /etc/stunnel/shellinabox.conf
    
  2. Comment out all lines by adding #:
    conf
    # accept = 12320
    # connect = 127.0.0.1:12319
    # TIMEOUTclose = 0
    
  3. Save and exit, then reboot.

Rabbit Hole #3: Master

using ss -plunt it shows as service “master” I can’t see that in webmin.. what’s going on here? Port 25

Ah, the mysterious "master" process — that’s actually Postfix, your mail transfer agent (MTA). 📨

When you run:

sudo ss -plunt | grep :25

and see something like:

LISTEN 0 100 0.0.0.0:25 0.0.0.0:* users:(("master",pid=1234,fd=12))

the "master" refers to the Postfix master daemon, which coordinates all the individual Postfix services (like smtp, pickup, qmgr, etc.).

I disabled postfix and sure enough port 25 listener went away.

I also disabled NTP service, to drop listening on port 123.

OK, final one, I saw system listening on port  UDP 10000. I saw one that was TCP based but it was only bound to loopback so I didn’t care about that much, and research showed it’s used by webmin. however I didn’t like the UDP cause it was listening to all interfaces on the system (0.0.0.0).

After enough searching I found it configured by webmin itself.

webmin -> Webmin Configurtaions -> ports and addresses -> “Listen for broadcasts on UDP port” but what is this used for?

The “Listen for broadcasts on UDP port” option in Webmin is used for auto-discovery of other Webmin servers on the local network. Here’s what it’s doing:

📡 What It’s For

  • Webmin listens on UDP port 10000 to receive broadcast packets from other systems running Webmin.
  • These packets are part of the Webmin Servers Index module, which lets you:
    • Scan your LAN for other Webmin instances
    • Automatically add them to a central dashboard
    • Optionally control them via RPC (Remote Procedure Calls)

🧠 Why It’s Optional

  • If you’re not using clustered Webmin setups or don’t need to discover other servers, it’s safe to disable.
  • Disabling it stops Webmin from binding to UDP 10000, which:
    • Reduces your attack surface
    • Cleans up your ss output
    • Avoids confusion when auditing open ports

✅ What Happens When You Turn It Off

  • Webmin will no longer respond to broadcast discovery requests
  • You can still manually add other servers to the Webmin Servers Index if needed
  • The TCP listener on port 10000 (for HTTPS access) remains unaffected

Hope all these tweaks helps someone.

Upgrading Windows 10 2016 LTSB to 2019 LTSC

*Note 1* – This retains the Channel type.
*Note 2* – Requires a new Key.
*Note 3* – You can go from LTSB to SA, keeping files if you specify new key.
*Note 4* – LTSC versions.
*Note 5* – Access to ISO’s. This is hard and most places state to use the MS download tool. I however, managed to get the image and key thanks to having a MSDN aka Visual Studio subscription.

I attempted to grab the 2021 Eval copy and ran the setup exe. When it got to the point of wanting to keep existing file (aka upgrading) it would grey them all out… 🙁

So I said no to that, and grabbed the 2019 copy which when running the setup exe directly asks for the key before moving on in the install wizard… which seems to let me keep existing files (upgrade) 🙂

My enjoyment was short lived when I was presented with a nice window update failed window.

Classic. So the usual, “sfc /scannow”

Classic. So fix it, “dism /online/ cleanup-image /restorehealth”

Stop, Disable Update service, then clear cache:

Scan system files again, “sfc /scannow”

reboot make sure system still boots fine, check, do another sfc /scannow, returns 100% clean. Run Windows update (after enabling the service) comes back saying 100% up to date. Run installer….

For… Fuck… Sakes… what logs are there for this dumb shit? Log files created when you upgrade Windows 11/10 to a newer version (thewindowsclub.com)

setuperr.log Same as setupact.log Data about setup errors during the installation. Review all errors encountered during the installation phase.

Coool… where is this dumb shit?

Log files created when an upgrade fails during installation before the computer restarts for the second time.

  • C:\$Windows.~BT\Sources\panther\setupact.log
  • C:\$Windows.~BT\Sources\panther\miglog.xml
  • C:\Windows\setupapi.log
  • [Windows 10:] C:\Windows\Logs\MoSetup\BlueBox.log

OK checking the log…..

Lucky me, something exists as documented, count my graces, what this file got for me?

PC Load letter? WTF does that mean?!  While it’s not listed in this image it must have been resolved but I had a line that stated “required profile hive does not exist” in which I managed to find this MS thread of the same problem, and thankfully someone in the community came back with an answer, which was to create a new local temp account, and remove all old profiles and accounts on the system (this might be hard for some, it was not an issue for me), sadly I still got, Windows 10 install failed.

For some reason the next one that seems to stick out like a sore thumb for me is “PidGenX function failed on this product key”. Which lead me to this thread all the way back from 2015.

While there’s a useless comment by “SaktinathMukherjee”, don’t be this dink saying they downloaded some third party software to fix their problem, gross negligent bullshit. The real hero is a comment by a guy named “Nathan Earnest” – “I had this same problem for a couple weeks. Background: I had a brand new Dell Optiplex 9020M running Windows 8.1 Pro. We unboxed it and connected it to the domain. I received the same errors above when attempting to do the Windows 10 upgrade. I spent about two weeks parsing through the setup error logs seeing the same errors as you. I started searching for each error (0x800xxxxxx) + Windows 8.1. Eventually I found one suggesting that there is a problem that occurs during the update from Windows 8 to Windows 8.1 in domain-connected machines. It doesn’t appear to cause any issues in Windows 8.1, but when you try to upgrade to Windows 10… “something happened.”

In my case, the solution: Remove the Windows 8.1 machine from the domain, retry the Windows 10 upgrade, and it just worked. Afterwards, re-join the machine to the domain and go about your business.

Totally **** dumb… but it worked. I hope it helps someone else.”

Again, I’m free to try stuff, so since I was testing I cloned the machine and left it disconnected from the network, then under computer properties changed from domain to workgroup (which means it doesn’t remove the computer object from AD, it just removes itself from being part of a domain). After this I ran another sfc /scannow just to make sure no issue happened from the VM cloning, with 100% green I ran the installer yet again, and guess what… Nathan was right. The update finally succeeded, I can now choose to rename the PC and rejoin the domain, or whatever, but the software on the machine shouldn’t need to be re-installed.

Another fun dumb day in paradise, I hope this blog post ends up helping someone.

 

Updating Power CLI 12

If you did an offline install, you may need to grab the package files from an online machine. Otherwise, you may have come across a warning error about an existing instance of power CLI when you go to run the main install cmdlet.

When I first went to run this, it told me the version would be installed “side-by-side” with my old version. Oh yeah, I forgot I did that…

Alright, so I use the force toggle, and it fails again… Oi…

Lucky for me the world is full of blogger these days and someone else had also come across this problem for the exact same reason.

VMware.PowerCLI install update error – Install-Package: Authenticode issuer | vGeek – Tales from real IT system Administration environment (vcloud-lab.com)

If you want all the nitty details check out their post, the main part I need was this one line, “This issue can be resolved deleting modules from the PowerShell modules folder inside Program Files. Once the modules folder for VMware are deleted try installing modules again, you can also mention the modules installation scope.”

AKA, Delete the old one, or point install to other location. He states he needed the old version but doesn’t specify for what. Anyway, I’ll just delete the old files.

So, at this point I figured I was going to have a snippet of a 100% clean install, but no, again something happened, and it is discussed here.

If I’m lucky I will not need to use any of the conflicting cmdlets and if I do; I’ll follow the suggestions in that thread.

OK let’s move on. Well, the commands were still not there, looks this has to succeed, and there’s no prefix option during install only import, which you can only do after install, the other option was to clobber the install. Not interested, so I went into Windows add/remove features, and removed the PowerShell module for Hyper-V. No reboot required, and the install worked.

the Hyper-v MMC snap in still works for most of my needs. Now that I finally have the 2 required pre-reqs in place.

Step 2a) connect to server via Power CLI

Why did this happen?

A: Cause self signed certificate on vCenter, and system accessing it doesn’t have the vCenter’s CA certificate in its own trusted ca store.

How can it be resolved?

A:  Option 1) Have a proper PKI deployed, get a proper signed cert for this service by the CA admin, assign the cert to the vCenter mgmt services. This option is outside the scope of this post.

Option 2) Install the Self Sign CA cert into the machine that’s running PowerCLI’s machine store’s trusted CA folder.

Option 3) Set the PowerCLI parameter settings to prompt to accept untrusted certificates.

I chose option 3:

Make sure when you set your variable to use single quotes and not double quotes (why this parameter takes system.string instead of secureString is beyond me).

While I understand the importance of PowerShell for scripting and automation and mass deployment situations, requiring it to apply a single toggle setting is a bit redic, take note VMware; Do better.

vSphere HA Agent cannot be correctly installed or configured… again

Story

Another vCenter Patch, Another problem 😀

This seems to be a reoccurring story these last couple posts…

Error on Host

This time after updating again a host in the cluster had the error message.

Troubleshooting

Un like the last time this happened, the event log wasn’t as blatant (flooded) complaining about the /tmp being full. and checking the host with

vdf -h

which showed only 90% full, which was still pretty high, which might have explained the one log event that I did see about it:

The ramdisk 'tmp' is full. As a result, the file /tmp/img-stg/data/vmware_f.v00 could not be written

Which was in the log right after this event of attempting to install a base ESXi image?

Installing image profile '(Updated) HPE-ESXi-Image' with acceptance level checking disabled

This seemed a bit weird but I could find any info other than what’s usuallly a very Microsoft type answer of “you can just ignore it” or “usually this is not an issue, just it says vCenter saying it is connecting to esxi host and installing it’s agent

OK I guess… moving on… the very next error event was:

Could not stage image profile '(Updated) HPE-ESXi-Image': ('VMware_bootbank_vmware-fdm_7.0.2-18455184', '[Errno 28] No space left on device')

Huh, Now note this host was installed running the official VMware Image provided by HPE for this exact hardware supported by the VMware HCL. So there should be no funny business. However I feel maybe there’s a bit of the known HPE bug as mentioned the last time this happened. It just hasn’t fully flooded /tmp just yet.

Lil Side Trail

So couple things to note here, first the ESXi image is installed on a USB/SD Card style setup as such it should be well know to define the persistent log location, as well as the scratch location. However, not many source specify changing the system swap location.

  1. Persistent Log; VMware KB; Tech Blogger
    (Most standard ESXi Log info)
  2. Scratch Log: VMware KB; Tech Blogger 1; Tech Blogger 2
    (Crash Logs, Support log creations)
  3. Swap Location: VMware Doc 1 (Configure), VMware Doc2 (About), Tech Blogger Who seem to regurgitate the exact about page from VMware.

However, researching this even more lots of posts on reddit mentioned the swap file for VM’s being on their VM directories, so if using a shared datastore they will reside there, and I shouldn’t see issues around swap usage at all at the host level.

Which if you look on the vCenter Web UI on a ESXi hosts there are two options available: VM – Swap, and System Swap.

The VMware docs doesn’t seem to describe accurately the difference between these two options.

Lookup up the error about not being able to stage the file I found this one blog post which of course mentioned changing the swap location to get past the error…

The main thing mentioned by the blogger is “The problem is caused by ESXi not having enough free space available to extract the installation packages.” but failed to specify where that exactly is, and the event log didn’t specify that either. Now since his solution was to adjust the system swap location, it begs the question. Is the package extraction location the System Swap location?

Since the host settings seem to be only specified with the alternative option checkboxes as:

Can use host cache
Can use datastore specified by host for swap files

It’s still not fully clear to me where the swap is actually located with these, assumed default settings. Or if extraction of the image actually using swap, or why the same imagine already on the ESXi host is being re-applied when your upgrade vCenter?

Resolution

So many question, so little answers, so unfortunately I’m going to go on a bit of a whim, and simply try exactly what I did before, clear the file from the /tmp location that was takin up a lot of it’s space, install the HPE patch for the known bug, in hopes it resolves the issue….

Sure enough the exact same thing happened, as in my initial post it just seems it wasn’t fully full. So the symptoms were just a bit different.

  1. vMotion all VMs to another host in the cluster (amazing vMotion works without issue)
  2. Ignore the HA warning on the VMs migrated
  3. Place Host into Maintenance mode (This clears the HA warnings on the VMs and cluster)
  4. Verify /tmp has room. Update any ESXi packages from the hardware vendor if applicable.
  5. Reboot the host.
  6. Exit Maintenance mode.

Hope this helps someone who might see the same type of error events in their ESXi event logs.

Microsoft Exchange Vulns and Buggy Updates

I’ll keep this post short. If you are unaware, there’s been a big hack on exchange servers.

Microsoft Exchange hack, explained (cnbc.com)

I ran the IOC scripts from MS, was I affected, it appears I may have.

Initiated my own lab DRP/BCP. Informed myself that services would be down, and restored AD and Exchange from backups before the logged incidents. Took the OWA Rev proxy rule  down till the servers could be fully patched.

Booted restored VMs, patched, hopefully good to go.

Then doing patch Tuesday updates users laptops start failing to boot after installing KB5000802. All I could find was news of prints causing BSODs classic.. BSODs! In my case it was causing boot crashing, I did my usual trick, but I got a different error, then ran the Windows Start up repair process, which amazingly got it to boot but said it reverted an updated (the one above). i attempted a install again, but same problem. I didn’t want to re-image as it was an VIPs machine, and time was of the essence. I took a whim, and decided to install all the latest drivers from the laptop OEM vendor (In case some was using MS drivers instead), after that tried the update again, and got a successful install.  Phewwww!

VMware HA down after 6.5 patch

The Story

So the other day I tested the latest VMware patch that was released as blogged about here.

Then I ran the patch on a clients setup which was on 6.5 instead of 6.7. Didn’t think would be much different and in terms of steps to follow it wasn’t.

First thing to note though is validating the vCenter root password to ensure it isn’t expired. (On 6.7u1 a newer)Else the updater will tell you the upgrade can’t continue.

Logged into vCenter (SSH/Console) once in the shell:

passwd

To see the status of the account.

chage -l root

To set the root password to never expire (do so at your own risk, or if allowed by policies)

chage -I -1 -m 0 -M 99999 -E -1 root

Install patch update, and reboot vCenter.

All is good until…

ERROR: HA Down

So after I logged into the vCenter server, an older cluster was fine, but a newer cluster with newer hosts showed a couple errors.

For the cluster itself:

“cannot find vSphere HA master”

For the ESXi hosts

“Cannot install the vCenter Server agent service”

So off to the internet I go! I also ask people on  IRC if they have come across this, and crickets. I found this blog post, and all the troubleshooting steps lead to no real solution unfortunately. It was a bit annoying that “it could be due to many reason such as…” and list them off with vCenter update being one of them, but then goes throw common standard troubleshooting steps. Which is nice, but non of them are analytical to determine which of the root causes caused it, as to actual resolve it instead of “throwing darts at a dart board”.

Anyway I decided to create an SR with VMware, and uploaded the logs. While I kept looking for an answer, and found this VMware KB.

Which funny the resolution states… “This issue is resolved in vCenter Server 6.5.x, available at VMware Downloads.”

That’s ironic, I Just updated to cause this problem, hahaha.

Anyway, my Colleague notices the “work around”…

“To work around this issue in earlier versions, place the affected host(s) in maintenance mode and reboot them to clear the reboot request.”

I didn’t exactly check the logs and wasn’t sure if there actually was a pending reboot, but figured it was worth a shot.

The Reboot

So, vMotion all VMs off the host, no problem, put into maintenance mode, no problem, send host for reboot….

Watching screen, still at ESXi console login…. monitoring sensors indicate host is inaccessible, pings are still up and the Embedded Host Controller (EHC) is unresponsive…. ugghhhh ok…..

Press F2/F12 at console “direct management as been disabled” like uhhh ok…

I found this, a command to hard reboot, but I can’t SSH in, and I can’t access the Embedded Host Controller… so no way to enter it…

reboot -n -f

Then found this with the same problem… the solution… like computer in a stuck state, hard shutdown. So pressed the power button for 10-20 seconds, till the server was fully off. Then powered it back on.

The Unexpected

At this point I was figuring the usual, it comes back up, and shows up in vCenter. Nope, instead the server showed disconnected in vcenter, downed state. I managed to log into the Embedded Host Controller, but found the VMs I had vMotion still on it in a ghosted state. I figured this wouldn’t be a problem after reconnecting to vCenter it should pick up on the clean state of those VM’s being on the other hosts.

Click reconnect host…

Error: failed to login with the vim admin password

Not gonna lie, at  this point I got pretty upset. You know, HULK SMASH! Type deal. However instead of smashing my monitors, which wouldn’t have been helpful, I went back to Google.

I found this VMware KB, along with this thread post and pieced together a resolution from both. The main thing was the KB wanted to reinstall the agents, the thread post seemed most people just need the services restarted.

So I removed the host from vCenter (Remove from inventory), also removed the ghosted VM’s via the EHC, enabled SSH, restarted the VPXA and HOSTD services.

/etc/init.d/hostd restart

/etc/init.d/vpxa restart

Then re-added the host to vCenter and to the cluster, and it worked just fine.

The Next Server

Alright now so now vMotion all the VMs to this now rebooted host. So we can do the same thing on the alternative ESXi host to make sure they are all good.

Go to set the host into maintenance mode, and reboot, this server sure enough hangs at the reboot just like the other host. I figured the process was going to be the same here, however the results actually were not.

This time the host actually did reconnect to vCenter after the reboot but it was not in Maintenance mode…. wait what?

I figured that was weird and would give it another reboot, when I went to put it into Maintenance Mode, it got stuck at 2%… I was like ughhhh wat? weird part was they even stated orphaned ghosted VM’s so I thought maybe it had them at this point.

Googling this, I didn’t find of an answer, and just when I was about to hard reboot the host again (after 20 minutes) it succeeded. I was like wat?

Then sent a reboot which I think took like 5 minutes to apply, all kinds of weird were happening. While it was rebooting I disconnected the host from vCenter (not removed), and waited for the reboot, then accessed this hosts EHC.

It was at this point I got a bit curious about how you determine if a host needs a reboot, since the vCenter didn’t tell, and the EHC didn’t tell… How was I suppose to know considering I didn’t install any additional VIBs after deployment… I found this reddit post with the same question.

Some weird answers the best being:

vim-cmd hostsvc/hostsummary|grep -i reboot

The real thing that made me raise my brow was this convo bit:

Like Wat?!?!?! hahaha Anyway, by this time I got an answer from VMware support, and they simply asked when the error happened, and if I had a snippet of the error, and if I rebooted the vCenter server….

Like really…. ok don’t look at the logs I provided. So ignoring the email for now to actually fix the problem. At this point I looked at the logs my self for the host I was currently working on and noticed one entry which should be shown at the summary page of the host.

“Scratch location not set”… well poop… you can see this KB so after correcting that, and rebooting the server again, it seemed to be working perfectly fine.

So removed from the inventory, ensured no VPXuser existed on the host, restarted the services, and re-added the host.

Moment of Truth

So after ALL that! I got down on my knees, I put my head down on my chair, I locked my hands together, and I prayed to some higher power to let this work.

I proceeded to enable HA on the cluster. The process of configuring HA on both host lingered @ 8% for a while. I took a short walk, in preparation for the failure, to my amazement it worked!

WOOOOOOOOO!!!

Summary

After this I’d almost recommend validating rebooting hosts before doing a vCenter update, but that’s also a bit excessive. So maybe at least try the commands on ESXi servers to ensure there’s no pending reboot on ESXi hosts before initiating a vCenter update.

I hope this blog posts helps anyone experiencing the same type of issue.