Rename a vSwitch in vSphere

I noticed I had named some vSwitches in the new hosts builds I had. This was nice. However I also noticed I couldn’t name a vSwitch when creating in vCenter. So how did I name them.

I quickly searched google, but the primary results were not what I was expecting….

1, 2, 3, 4

All of which either stated to edit the host config file, or use cli commands… well I know I did do the first thing, and I don’t remember using the CLI. Also I don’t remember having to reboot the host. The only diff I can think of is that I named them at creation, not after the fact, but the vCenter wizard has no option for that… but sure enough I checked my documentation.

If you login into a host directly, you can name a vSwitch right when creating it. This just requires to be done on each host in the cluster. It’s nice but is it worth it?

Once you have it setup it is really nice to have named vSwitches.

of course this doesn’t include dvSwitches, as those you can name and usually require uplinks to communicate between hosts. However you can still deploy a test dvSwitch to multiple hosts without an uplink though those VMs would only be useful on a single host… which defeats the purpose of it, but you can move the VMs as a whole group between VMs, and if that “Test switch” need any change it would be distributed between all hosts.

HPE SSD Firmware Bug (Critical)

I’m just gonne leave this right here…..

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

I wonder who they outsourced the firmware code out to back in 2015….

IMPORTANT: This HPD8 firmware is considered a critical fix and is required to address the issue detailed below. HPE strongly recommends immediate application of this critical fix. Neglecting to update to SSD Firmware Version HPD8 will result in drive failure and data loss at 32,768 hours of operation and require restoration of data from backup in non-fault tolerance, such as RAID 0 and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive. By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.

To determine total Power-on Hours via Smart Storage Administrator, refer to the link below:

Smart Storage Administrator (SSA) – Quick Guide to Determine SSD Uptime

Yeah you read that right, drive failure after a specific number of run hours. Yeah total drive failure, if anyone running a storage unit with these disks, it can all implode at once with full data loss. Everyone has backups on alternative disks right?

Lesson and review of today is. Double check your disks and any storage units you are using for age, and accept risks accordingly. Also ensure you have backups, as well as TEST them.

Another lesson I discovered is depending on the VM version created will depend which ESXi host it can technically be created on. While this is a “DUH” thing to say, it’s not so obvious when you restore a VM using Veeam and Veeam doesn’t code to tell you the most duh thing ever. Instead the recovery wizard will walk right through to the end and then give you a generic error message “Processing configuration error: The operation is not allowed in the current state.” which didn’t help much until I stumbled across this veeam form post

and the great Gostev himself finishes the post with…

by Gostev » Aug 23, 2019 5:52 pm

“According to the last post, the solution seems to be to ensure that the target ESXi host version supports virtual hardware version of the source VM.”

That’s kool…. or about… why doesn’t Veeam check this for you?!?!?!
Once I realized what the problem was, I simply restored the VM with a new name on the same host it was backed up from (Which was on a 6.5 ESXi host) and I was attempting to restore the VM on a 5.5 ESXi host. Again, after I realized I had created the VM under the options that I picked a higher VM level allowing it only to be used with higher versions of ESXi it was like again… “DUHHH” but then it made me think, why isn’t the software coded to check for such an obvious pre-requisite?
Whatever nothings perfect

Getting A+ Qualys Report

As some of you may know you can validate the security strength of your HTTPS secured website using https://www.ssllabs.com/ssltest/index.html

A good read on Perfect Forward secrecy

I use HA Proxy with Lets Encrypt for my sites security. While setting up those to plugins to work together apparently by default it’s not using the most secure suites ok the dev shows how you can adjust accordingly… but which ones? This what I get by default:

Phhh only a B, lets get secure here.

Little more searching I find the base ssl suites from mozilla config generator

which gave me this for the string of suites

ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384

But then ssllab report still complained about weak DH… so had to remove the final two options in the list leaving me with this

ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305

Now after applying the setting on the listener I get this!

Mhmmm yeah! A+ baby but looks like some poor saps may not be able to see my site:

Too bad so sad for IE on older OS’s, same with iOS (Macs) running older Safari.

Now let’s tackle DNS CAA well I was going to discuss how to set this up, but the linked site covers it well. Since my external DNS provider was listed in the supported providers, I logged into my providers portal to manage my DNS, and sure enough the wizard was straight forward to grant Lets Encrypt authority to sign my certificates! Finally one that was actually really easy! Wooo!

Now I suppose I can eventually play with experimental TLS1.3 but I’ll save that for another post! Cheers!

 

Mitigating from CVE-2018-3646 on ESXi 6.7

To keep this short, new VCSA 6.7 has VUM built in. No more Flash needed. Yay finally.

So I upload the latest 6.7u3 image, create my baseline, and test remedy one of my simple laptop hosts. After system reboots and comes back on VCSA dashbaord… uhhh what’s with this yellow warning icon…. Summary…

OK great, so after years of Intel being ahead of AMD, looks like at the cost of some pretty shitty shortcuts. and these shortcuts have caused Intel a huge problem, and pretty everyone else. Since it affected everyone, everyone has some form of write up on it. In this case VMware has coded the above warning, with a reference to this KB, so you know read that if you want a dry overview.

As you can see I have what shows as 4 logical processors, but after applying the mitigation (setting VMkernel.Boot.hyperthreadingMitigation to true on the host advanced settings) and rebooting…

Yay the warnings gone, but apparently so are half my logical processors?

If your wondering why they didn’t enable this by default is due system resource management, which of course, is exactly what vSphere is. Since it affects the available resource of the host it may not be able to accommodate the workload it was originally designed for. In my case it’s a lab and my work load is obviously very light, and this isn’t an issue for me.

Was it worth the mitigation? I don’t really know at this point as I’m unaware of any easy simple tactics any attacker could use to attack my footprint. At the same type CPU resource is not my major constraint, it’s usually memory.

For now better safe than sorry. In my next post I hope to cover the vCenter upgrade path and an error that happened to me along the way, luckily it wasn’t that hard to recover from. 🙂

Cheers!

ESXi 6.5 Stuck on Initializing Scheduler
PSOD PCPU1 could not start

I’m making this post short to note this odd experience with this host build.

First weird thing was when trying to install ESXi I couldn’t get past vmkusb not sure what it was about but only found this decent reddit post with the same problem.

In short he noticed it would only get past this if a second USB was plugged into the USB2 ports, and sure enough that worked for me too. strange….

Then a couple days later while doing some more test boots, I get a Purple screen of death, complaining about the PCPU1 not starting or some shit, ugh again lets see what others have to say… well I found this vmware thread on it, he basically stated that resetting the BIOS settings worked, after farting around with some bios settings, I had other failed boots and my CPU and system was rather hot. I let everything cool down, added more powerful fans and tried again after resetting the BIOS to factory and much like the post it worked.

After finalizing the build a little more, I switched to an old 2.5″ HDD.. same problem but I noticed it gets stuck on initializing scheduler before PSODing, while I searched for what might be up with that I found this

It did help shit, same problem next boot, instead of just resetting the BIOS I played around with a couple more settings like I enabled intel AES-NI which helps for CPU offloading of AES computations. and another one which I sadly forgot, and then my next boot was fine. saving this at this point in case it comes back again.

Upgrading my ASUS RT-N16

The ASUS RT-N16

I love this thing, I remember when I first read my first blog posts about the specs, and what it could all do…

Wireless

Wireless Frequency Bands 2.4 GHz
Number of Antennas 3
WLAN Mode 802.11n
Transmit Power 15.5 to 19.5 dBm
Antenna Placement External

Interface

Ports 1 x Ethernet (RJ45) (Uplink)
4 x 10/100/1000 Mb/s Gigabit Ethernet (RJ45)
2 x 480 Mb/s USB Type-A

Performance

Throughput 300 Mb/s
CPU 480 MHz Broadcom SoC
128 MB RAM
32 MB Flash

Security

Wireless Security WEP, WPA, WPA2

Those are some good specs for 2010, pretty much a decade ago. and most of the blogs touted DD-WRT, which I joined the form site way back in 2012… looking back at my old posts didn’t seem to get much of any help… but sure had oddities I was recently running a KongMod of DDWRT (build 22000M) Circa 2014, looking it up found out he stopped to made modded firmware for OpenWRT. I grabbed the latest DDWRT for my router using the DDWRT database factory reset settings, cleared NVRam and used IE with a windows laptop with static IP bound to port 1 on the router….

Soft Brick?

Gave the system enough time to boot, but noticed the pings were not coming back up, the Power light would flicker during boot and then stay off, while the wireless LED said lit.

I thought I may have soft bricked it, so I grabbed the stock firmware and flashing tool from ASUS website to my dismay even though I could press the restore button and have the power LED blink slowly indicating it ready for TFTP file, even the flashing tool would fail either that its not in flashing mode, or faiiled to flash. I thought I was hooped in this case and was in a soft lock loop, and thought I would have to JTAG flash it…

Then for shits (since the WiFi LED was on) I wondered if it was broadcasting… and when I checked for a available WiFi on my phone I was shocked to see it was, I connected, shocked again, and could ping the router… wait what??

Solution?

Sure enough I could see the DDWRT web interface.. I was stumped and started to Google, but only found one post that was dead on… but as I figured the solution provided did not work for me, well the vlan1 check setting BS.

There was another bunch of posts stating to add commands “swconfig dev eth0 set enable_vlan 1” or some crap, yeah that didn’t work either. Even though people said don’t do it, I decided to use the DDWRT web interface Firmware Update section over WiFi (either was would have to JTAG flash if it failed) So at first I simply used the K2.6 Mini build instead of the mega, after the flash the exact same shit, but the power light at least stayed on. but again could only connect via WiFi. Since the only other answer was “I flashed a newer firmware” which is a timeless statement lol which exact version who knows, and I sadly didn’t have the old Kong build if I simply wanted to go back.

AdvancedTomato FTW

I was about to try OpenWRT when I decided to look at Tomato again… so flashed it via the DDWRT firmware update section (Fuck you DDWRT) and to my amazement it came up perfectly, Wifi was fine, and I could ping it on a physical LAN port again. Woooo!

Since I wasn’t used to the interface I did need a bit of a hand getting it setup as s simple AP again, guess it makes sense DHCP is set at the bridge so if you want to setup different NICs for different subnets and still have their own DHCP, but in my case I wanted none.

Then I read this nice post by How-To-Geek on configuring traffic monitoring, something I never had on DDWRT, so not only is the new UI a fresh change, so are some of the features. I really hope also a lot less bugs. Cause DDWRT with OTRW was buggy and a HUGE PITA.

Optware?

Well googling did show there was the possibility… and installation seemed straight forwarded enough… of course both guides being 8+ years old, wasn’t too compelling, so I checked if the source referenced script was still accessible… and it was! Nice, checking the script out I see another external reference source and check it out too, amazingly it’s still reference-able too.

So I followed along, starting by first attempting to create a partition (512 MB, labeled “optware” as ext2) I did this by USB pass-through of my USB stick to a Mint Linux VM. Then simply using gparted created my partition, I also created a 1 Gig, and 2.5 Gig ext 2 partition labeled whatever with the spare space. (I tried a 4 Gig partition, but… it failed to mount so stuck with the recommendations).

Ran the installation as suggested…

wget http://tomatousb.org/local--files/tut:optware-installation/optware-install.sh -O - | tr -d '\r' > /tmp/optware-install.sh
chmod +x /tmp/optware-install.sh
sh /tmp/optware-install.sh

I did this of course after verifying that indeed my partition was mounted as /opt, and the script ran without issue.. .amazing…

after that, I first installed htop, cause lets face it, normal top sucks…

ipkg install htop

I followed this up with the main packages I actually used, screen and irssi. This allows me to have a persistent IRC chat client (given the AP/Router doesn’t reboot)

ipkg install screen
ipkg install irssi

Add User?

Now I remember specifically having issues with DDWRT, and adding standard users with limited permissions. Specifically with the name showing up weird

So I searched quickly to see if it was possible, and if so how people were doing it

much like the guy in the first link, I didn’t quite follow what was going on and then after checking each line, eventually it made sense. (Basically defining specific environment variables, and special actual files with embedded lines that are all saved to NVRAM, then a script (3 lines) is run to populate the linux user list)

I did add “adduser” but much like mentioned elsewhere it would complain about not having “passwd”, there was no packages for “mkpasswd”, or “makepasswd”. I wasn’t in the mood to change my root system password and run a single stupid line to set one users password … :S (

sed -n -e "s,^root:,$UNAM:,p" < /etc/shadow >> /etc/shadow.custom

)

Instead much like the alternative suggestion on the page itself “You can also cut & paste passwd and shadow entries from another linux box.” which is exactly what I did, using my Linux Mint VM, I used openssl passwd with a salt to generate a MD5 hashed password.

Now I was able to SSH in with my new non root account, YAY!

Now according to the source “These commands need only be done once for each custom username. Thereafter, the user will always be created every time the router boots up. To delete a user, edit /etc/passwd.custom and /etc/group.custom and delete the line with that username, then save them to nvram.”

OK…. I’m going to reboot now…

mhmmm is it going to work….? Oooo… account is there after reboot in passwd file… and line exists for account in shadow file, and home dir exists… lets log in… well shit… the password didn’t save… still same as root even though they differ in the shadow file… k… let’s make a new one… save in shadow file, relogin, yup password changed. and now change in shadow.custom and save, and reboot…. arrrrggggg C’MON

Maybe you have to run that setfile commands when changing a file set to nvram? second try… there we go! success.

Screen, Irssi and the fun Stuff

It’s been a while since I had to reconfigure this stuff so someone’s blog to help me along the way, and it’s rather old now… but still good stuff and this simple one

and then run irssi 😀 (by typing irssi and hitting enter)

Silly Rabbit Trix are for… I mean Irssi I’m following a guide already…

/network list

to make adding to Freenode easier instead of having to type /connect irc.freenode.net I’m going to setup a reference much like the existing reference from the above command:

The above shows names, but not there DNS lookups which is the server list

To add our reference:

/server add -auto -network Freenode irc.freenode.net

Now that we have an auto connecting server, we’d like to specify the user and login details:

/network add -nick Zew -autosendcmd "/msg nickserv IDENTIFY *******" Freenode

Now that my usual helpful sources are added (you can always catch me in one of these places) let’s test it all out… run /quit and then irssi again. Which worked! I was joined to my server, authed and joined to my channels 😀

Use Ctrl + X to switch connected servers, Esc + left or Right to move channels.

and finally “Ctrl + A then Ctrl + D . Doing this will detach you from the screen session which you can later resume by doing screen -r .”

See you on IRC!

 

Updates to the Zewwy PiCade

Zewwy PiCade Updates

It’s been a while since I completed my PiCade now a lot can happen in a year, and sadly I have not done more guides around the OS lakka I really hope to provide more of these for anyone who is simply running Lakka on whatever hardware they choose. Since the OS is compiled for many SOCs as well as the plentiful x86/64 architecture the hardware choices are fruitful.

Anyway, so let me cover the first couple things… Hardware Updates….

Hardware Updates

1) I had recently brought my arcade to a local Anti-social hosted by non other than the amazing local Skullspace. (If you are in Winnipeg check these guys out, they are amazing. And funny in the main slides all three people are friends I know well.) While there wasn’t many places to put it I decided to test the build by placing it right ontop of a subwoofer/amplifer (yeah test it hardcore), while it did last the majority of the night, by the end it had started to flicker the screen on n off. a good wack would usually bring it back (some poor connection somewhere). I had intially thought it maybe self built power extender for the screen board, but after bypassing it I was still experiencing the issue. Having everything on one board, made pulling it out for diagnostics a breeze. and sure enough… simply checking all the connectors, a loose piece was found, I’m not sure exactly what this piece was (googling the numbers on seem it maybe an inductor) either way it had one leg that was clearly broken from its mating surface. So I soldered it back, but wanted to ensure it hold this time, I didn’t have a shrink tube big enough so I simply grabbed some plastic wrap (food wrap) and gave it a good wrap up….

and sure enough it’s been great since. Not sure I’ll put it on a subwoofer again, but it did otherwise old up great! 😀 That’s a pretty good build I’d say…

2) as you can see from the above picture I also moved the screen input selector (still annoyed this board has this considering the other input types are not even on the board (but I believe they use the same logic IC regardless to save costs) to the top, so it’s clickable using a toothpick (I hope to 3D print a button).

3) I had my buddy at Skullspace help me create an acrylic front protector since I had to pull the LCD from the glass cover and digitizer that normally comes with the screen (this can be seen as a plus or minus, I think in this case it was a plus cause the glass was cracked, but the LCD screen was fine) which worked out for my build.

4) I wanted a way to keep the “lid” shut since it was designed to lift up to take an iPad in and out, which is nice for my all on one board design as well, but only when I need to do work, or switch SD cards for trying different OS’s. Since there was no build in latching mechanism I decided to use a but of trick magic I learnt from watching a bunch of Chris Ramsay on Youtube, you know the guy who opens all the puzzles. In this case I first thought similar to the same technique I used to hold the speakers in… but then that would be ugly, it worked great for the speakers since it’s from behind and you can’t see it… so instead puzzle magic I drilled 4 tiny holes in the lid and carefully on each side making sure not to go through the side vinyl.

Then using a piece from a paper clip… and a magic from a fish tank cleaning kit…

I placed the clip in the Lid hole till it’s completely hidden, then shut the lid, and use the magnet to pull the pin out into the side piece, thus locking the lid on both sides…

as you can see I’m lifting it right up the front edge which normally would open and pivot on the rear pins but with my secret hidden pins its locked and besides the tiny bit of play as you can see, it works great. (I could have attempted a bit thicker of a paper clip, but then I risk it rubbing on the walls of the holes I drilled and the magnet may not have the power to beat the friction)

As you can also tell the new acrylic is reflective but protects the LCD screen, I also attached speaker covers, and the pot handle for the volume control. I hope to get edges 3D printed as well as the front bezel which is just a grey cardboard right now.

Now on the back is nothing more than the removable Battery which operates the unit.

Software Updates

Now Lakkas made some updates since my build which is running Rpi-2.1.x The current release is on 2.4.x so i got hopefully we get some nice updates, first thing I did was test the new build on a standalone MicroSD to avoid buggering up my current build, now my main SD card is 16 Gigs vs 4, which is ironically 4 times the size. Now I have to figure out how to install the newer version while migrating most of my settings (Joy stick bindings, and game listings and Images) with the least amount of work… mhnmmm Neat!

I’m going to follow option B!

“Manual updates

  1. Download the latest img.gz file corresponding to your hardware from here.
  2. Place this file in your Storage partition, in the .update folder, either using SAMBA or by mounting it
  3. Boot or reboot your hardware.

You should see some messages about the upgrade process. After some seconds, your system will reboot.”

First thing I thought I’ll use my MicroSD to USB adapter and just use file explorer, hahah you silly goose Windows isn’t for user friendly, and of course what I mean by this is lakka will use linux partitions (like ext2,3 maybe 4) and clearly not MS based NTFS, and MS doesn’t naively support such FS. This requires additional software, which I don’t want to install. I also left all IODD device at work so there are all my bootable images and I sadly don’t have my own PXE (Yet)…. great OK so there goes almost all my direct mounting options, so i guess network based transfer it will be…. I reallllly wanted to DD the card as is Just in case the upgrade shits the bed, and I can simply DD my back img back onto the MicroSD, but since I can’t do that now… alright fine….

Downloaded linux mint, created VM, booted VM, installed Linux mint, atatched USB controller, passed through USB device (MicroSD to USB), auto mounted, and used a terminal to make my backup like a movie tech nerd…

I can also use this to inject the Lakka img, I do have it on my machine but don’t have file passthrough on the vmware console, simplest thing would be another USB passthough but I’m out of USB devices right now. So I used an network share, I wasn’t sure at first which the “storage partition” was but made a fair assumption:

K that was quick, unmount from VM, remove from host USB port, and place SD card in PiCade… and lets see what happens (at least I have a backup now) 😀

Now when I booted it, it did say unpacking, and did update, and did retain my playlists, nice new icons for each item, very nice.

But i lost all my core key map bindings, and rom remap bindings… that’s annoying. My Arcade/MAME games did run with the normal playlist, core changes? haven’t exactly figured that out yet. Since I recently picked up the Genesis Mini, I put that whole playlist on here :D.

VMware Host Oddities

I’ll keep this post short as I find it rather interesting turn of events…

I was ensuring config files were being backed up as good Sysadmin Posture would suggest. I noticed an issue on one of the hosts when I went to run the configuration backup. When I was hit with a “general system error”. I Wish I would have saved the actual error message but when I went to research it, it gave me more people complaining about the error when trying to vMotion (In this case I grabbed the most specific line from the call back in hopes to get something) most results simply pointed to saying restarted the ESXi hosts management agents. I decided to see if I could vMotion or if any other symptoms from the host, but everything showed green in vCenter, and vMotions worked without an issue.

I left it for the night (before) and decided to update the host since it was running 6.5-u2 and needed to be updated to 6.5-u3. I was hoping it would resolve the config backup issue (I had an older copy on hand but wanted most up to date) So I decided to do the update via a ISO image and a host reboot, moved VMs, no problem, placed host into Maintenance mode, no issues, send host for shutdown (I wasn’t using iLO or iDRAC, so needed to manually mount the USB hosting the installation media), while I was at the console waiting for the host to show “shutting down” it simply …. didn’t… after 10 minutes which is far more then generous I decided to press F2 to get into the console to have to tell me … uggghhh I wish I would have taken a picture of the error, something along the lines of you can’t use the console cause you been locked out…. it was weird, after waiting another 10 minutes and nothing happening (I’m assuming there must have maybe been some actual underlying issue with the datastore holding the scratch?) I decided to just do a hard shutdown (hold power for 5 seconds). I could rebuild this server faster if need be. Powering it back up was perfectly fine through all POST checks, booted the ESXi 6.5u3 installer, booted fined, checked for logical drives, found all of them perfectly fine including the SD card holding the OS (scratch changed to a persistent location, another datastore on the local host with multiple disks used for a datastore) anyway I selected it and it saw there was an existing installation and selected to update it, successful, reboot. reboots fine, but host doesn’t come back to vCenter…..

Log into host directly, and notice it states it has 18 VMs, all with a status of error, now I had moved them and they are even still running on another host (thankfully it didn’t do anything stupid to try and hijack them) at this point I had called VMware support and placed an SR. Once I had a tech on the line, I discussed my symptoms and issues. At this point I asked what was the best course of action as, I didn’t want to rebuild the host (I could, but time). In this case I simply un-registered all the VMs that were in error as they were not associated with the host), then attempted to re-connect the host. It first failed complaining about bad username and password (I’m not sure if this was the root account used to add it, or the VPX user it uses to manage the host) but it prompted the wizard like adding a new host, since all other settings were still fine on the host (network and vswitches w/ VMPGs and VMKs, and all Datatstores) after this wizard the host was re-added back into the cluster with the latest updates, and running the backup config command:

esxcli storage vmfs extent list

worked without error. So yay! all fixed, but one thing still bothered me… what was the root cause…

I decided to check my scratch settings real quick, and sure enough it hold good and is on redundant datastore on the local ESXi host, so not sure what the root cause was, but it was fixed. I’m posting this for my own reference. Just incase this issue re-arises.

 

DCs Show CPU Spikes by svchost

I’ll try to keep this one short. The other night I was installing updates on some computers, I like to see system resources when doing this, as Windows is a heavy, HEAVY bitch. As I was scorlling through my VMs I noticed my DC’s hovering around 18%. While that may not sound like much, I know it was high lol for what they do. So I went to check the vSphere logs to see how long has this been going on… 1 day… mhmm. 1 week…. mgmmm 1 month… could this be normal? I don’t think so I just have caught it…. and looking at the Year view showed the increase a couple months back… but what could it be….

I noticed it on my other DC’s as well… all the same time frame.

After a while I noticed it was the svchost process. Using Mark’s ProcExp.exe I narrowed it down to svchost (DHCP/ nethost / eventlog)… I decided after many other failed searches to view this particular process having CPU issues. Then I found this, exactly what I was experiencing…. Funky CPU in Taskmgr, that process. all of it. and his answer:

TL;DR: EventLog file was full. Overwriting entries is expensive and/or not implemented very well in Windows Server 2008.

just as he mentions in detail in his answer, the security log was at max and being overwritten. Now I know there isn’t much happening at these times of the day so how did the log get filled so fast and being overwritten to cause CPU spikes. looking at the Log (Palo Alto User Agent Log on, and Logoff events) lots of them. I haven’t blogged yet in my series with Palo Alto about User mappings when it comes to the monitor area of the Palo ALto Firewalls, but you can configure Palo Alto to use Server monitoring directly instead of a user-ID agent server, which you can install on a dedicated windows server which will use SMI to query client devices on behalf of the Palo Alto firewall to determine what IP address is being used by whom…

In most cases, the majority of your network users will have logins to your monitored domain services. For these users, the Palo Alto Networks User-ID agent monitors the servers for login events and performs the IP address to username mapping.”

Now I can’t find a good Palo Alto Networks source on it, but when you configure the Monitoring Servers which “enable the User-ID agent to map IP addresses to usernames by searching for logon events in the security event logs of servers, configure the settings described in the following table.

which is all good and great however, the default for this is:

Server Log Monitor Frequency (sec)
Specify the frequency in seconds at which the firewall will query Windows server security logs for user mapping information (range is 1-3600; default is 2)

and apparently this process is not session based itself, so every 2 seconds the firewalls were hitting the DC’s looking to see who’s got what IP based on the logon events, and this in itself was creating a logon, and logoff event every 2 seconds. That apparently not only filled the log, but is enough garbage to flood the security log and cause the overwrite function on eventviewer to cause CPU “spikes”.

The solution was to increase the frequency of this lookup. This obviously reduces the accuracy of the mapping, but when you have long lease times on your DHCP settings, and users don’t change networks (like almost ever) this is a low risk, while still retaining user field information the Palo Alto Monitoring section. This along with a backup and clearing of the security event and the systems all went back to low CPU usage.

Happy happy joy joy