A Productive Nightmare

The Story

Lack of Space

It all begins with a new infrastructure design, it’s brilliant. All the technical stuff a side, the system is built and ready for use, one problem the new datastore is slightly overused (many plans for service migrations and old bloated servers to be removed but have not yet been completed). I had one datastore that was used for a test environment, with the whole test environment down and removed this datastore would be perfect temp location till the appropriate datastore could be acquired.

The Next Day

I was chatting with our in house developer when a user walks in asking why they couldn’t complete a task on the system, figuring a work flow server issue simply rebooting it often fixed any issues with it, however this time I also received an email from the DBA stating reports of a DB issue due to bad blocks on the storage level.

At this point my heart sank, I quickly logged into the storage unit and was shocked to not see any notification of issues, deciding right then and there to move to back to reliable storage I made the svMotion, while it was in progress the storage unit I was logged into finally showed errors of disk failure, one disk had failed while the other had become degraded (In a RAID 1+0 this can be bad news bears) after the svMotion completed there was still a corrupted DB (we all have backups right?) lucky it was just a configuration DB for the workflow server and not any actual data, so I provided the DBA with a backup of the database files, didn’t take long and everything was back to green.

That Weekend

I decided to play catch up on the weekend due to the disruptive nature of the disk failure that week, to my dismay and only by chance the new host in the new cluster was showing disconnected from vCenter… What the…

Since I wasn’t sure what was going on here at first I chatted with the usual’s on IRC, I was informed instantly “RAMdisk is full”. After some lengthy recovery work (shutting down VMs and manually migrating them to an active host in vCenter) I discovered it was cause the ESXi host did lose connectivity to its OS storage (in this case was installed on an SD card)

So I updated the firmware on the host server. This so far (after a couple weeks now) has resolved this issue.

Then while I was working on the above host lost of connectivity, the other host lost connection to vCenter! However this one had much different signs and symptoms, after doing the exact same process of moving VMs off this host, it was determined by VMware support that it was “possibly” due to the loss of the one datastore. Remember the datastore I discussed above, although I had moved any VM usage of it from the hosts I did not remove it as an active datastore, so although the storage unit was accessible while the disks had failed, for some reason the whole storage unit had failed (UI was now unresponsive). So I had to remove this datastore and all associated paths. After all this everything was again green for this cluster.

So much for that weekend…

That Storage Unit

Yeah alright so that storage unit… it was a custom built FreeNAS box that was spliced together from a HP DL385p Gen8 server. I got this thing for dirt cheap and was working as a datastore perfectly fine before the disc failure so I don’t blame the hardware or even FreeNAS or all the crap that happened. It was just a perfect storm.

So I decided to try something different with this unit first… since I had been using an LSI 9211-8i flashed in IT mode (JBOD) for the SAS expanders in the front (25 disk sff). I decided I would try to build my first hyper-converged setup. That meant creating a FreeNAS VM, hardware passthough the storage controller (LSI 9211-8i) and then created datastores using the discs in the front.

Sooo

The Paradox

The first issue I had was the fact you need a datastore to host the FreeNAS VMs config and hard drive files… but if we are going to do hardware pass-through of the entire SAS exapnders via the LSI card, that means it’s not accessible or usable for the host OS. Uggghhhh, now we could use NFS or iSCSI but the goal for me was to have a full self contained system not relying on another host system, now I can easily install ESXi on a USB or SD card, but it won’t allow me to use these as datastores. At least not on there own…

Come here USB datastore… I mostly followed this blog post on it by Virten however I personally love this old one by non other than my favorite VMware blogger William Lam of VirtuallyGhetto.com

*My Findings* Much like the comments on here and many other blog and form posts about doing this is I could not get this to work on 5.1 or 5.5 those builds are too finiky and I’d always get the same error about no logical partition defined or something, yet worked perfectly fine in 6.5 or 6.7 (I personally don’t use 6.0)

OK, so I decided to use ESXi 6.7, installed on a SD card, and setup a 8 gig USB based Datastore. Next Issue is you have to reserve the memory else you’d be limited to even less than 4 gigs as ESXi will complain there is not enough from on the datastore for the swap file. Not a big deal here as we have plenty of RAM to use (100 Gigs HP genuine ECC memory).

I did manage to get FreeNAS installed on said datastore and as you’d expect it was slowwwwww. My mind started to run wild and though about RAMDisc and if it was possible to use that as a datastore… in theory.. it is! William is still around! 😀

Couple notes on this

1) you need a actual Datastore as it seems like ESXI just creates system links to the PMem Datastore. (I noticed this by attempting to ssh into the host and simple copy the VM’s files over, it failed stating out of space, even though there was enough defined for the PMem Datastore).

2) You create the VM and defined the HDD to be on PMem Datastore and will warn you of non persistence.

Sure enough I created a FreeNAS VM on the PMem and it was fast install, but as soon as the host needed a reboot, attempt to power on that VM and it says the HDD is gonezo. So this was cool, but without persistence it sort of sucks.

Anyway I didn’t need the FreeNAS OS to have fast I/O anyway, so stuck with the USB based datastore. Then I went to pass-through the controller, now enabling pass-though on the controller worked fine, but the VM wouldn’t start.

Checking the logs and googling revealed only ONE finding!

No matter what I tried the LSI card or the built in HBA same error as the post above:
“WARNING: AMDIOMMU: 309: Mapping for iopn 0x100 to mpn 0x134bb00 on domain 1 with attr 0x3 failed; iopn is already mapped to mpn 0x100 with attr 0x1
WARNING: VMKPCIPassthru: 4054: Failed to setup IOMMU mapping for 1 pages starting at BPN 0x100000100”

Yay, another idea gone to shit and time wasted, I learned some things but I wanted to learn something and bring some use back to this system… ugh fine! I’ll just put it back to normal connecting the SAS expanders to the P420i HBA and use the 2GB battery backed cache to define a speedy datastore and just keep it simple…

The Terrible HBA

I don’t wish this HBA on anyone seriously, so after I put it all back to normal, the first thing I find is:

  1. When I booted the server and let the system post, when it got up to the storage controller part (Past the bottom indication to press F9 for setup, F10 for Smart Provision, and F11 for Boot Menu) it will list the storage controller and it’s running firmware in this case v8.00.
    Half the time if I pressed F5, if there was no previous error codes and no disks or logical units defined I someones got into the ACU (Array Configuration Utility) the other half the fans would kick up to 100% and stay there while the ACU booted (showing nothing but an HP logo and a slow progress bar) and when ACU finally did load I’d be presented with “No Storage Controller found”
    (Trust me I got a 40 min video of me yelling at the server for being stupid haha)
  2. This issue would become 100% apparent as soon as I plugged in a drive with a logical unit defined from another (updated) version of Smart Array.
    To get around this issue I ended up grabbing the “latest” HP SSA (Smart Storage Administrator) tool from, HPs site. Now I quote latest due to the fact is it’s from 2013… No this allowed me to finally build some arrays for me to use with the planned ESXi build.

I noticed that at first I wasn’t seeing the new logical drive I defined in the HP SSA in ESXi itself, I totally forgot to grab HPE custom build as it includes all required drivers for these pieces of hardware.

First thing i notice after grabbing HPE’s custom ESXi build… in this case 6.7 (requires VMware login) is that the keyboard is buggering out on me when attempting to configure the management NIC.

At first I thought maybe the USB stick was crapping out due to the many OS installs I’ve been doing on it. So I decide to move to using the logical array I built, the custom installer does see the new array and away I go, still buggy, so I thought maybe it’s the storage controller firmware? Looking up the firmware for P420i or equivalent appears there are numerous post of issues and firmware updates.. turns out there’s even a 8.32(c) Nov 2017 update, since I was too lazy to build a custom offline installer for this firmware flash I used an install of Windows Server 2016 and ran the live updater, to my amazement it worked flawless… yet also to my amazement Windows worked perfectly fine on the same logical array regardless of the firmware it was running (Is this a VMware issue…??)

So after re-installing the custom ESXi 6.7 from HPE, the host was still being buggy… and now started to PSOD (Purple Screen of Death)… are you kidding me, after everything that’s already happened… ughhhhh…

Googling this I found either

A) Old posts of Vendor finger Pointing (Around ESXi 3-4)

B) Newer Posts (ESXi 6.7~) this lead me to the only guy who claimed to have fixed his PSOD and how he did it here

Which I found I was not having the same errors showing which lead me to my first link due to the logs. Having updated all the firmware, and running HPEs builds I could only think to try the ESXi 6.5-U2 build as the firmware was supposedly supported for that build.

Now running ESXi 6.5-U2 without any issues, and no PSOD! Unfortunately without warranty on this hardware I have no way to get HP to investigate this newer 6.7 build to run on this particular hardware.

Icing on the Cake

Alright so now I should finally be good to go to use this hypervisor for testing purposes right? Well I had a bunch of spare discs and slots to create a separate datastore for more VMs yay…

Until I went to boot that latest HP SSA offline I listed above that fixed the fan speed and no controller found for the ACU, well now this latest HP SSA was getting stuck at a white screen! AHHHHHHHHHHHHHHHHH how do I create of manage the logical unit and build arrays if the offline software is stuck, well i could have installed and learned how to use the hpssacli and their associated commands but since I was already kind of stressed and bummed out at this point installed Windows Server 2016 and ran the HP SSA for that which looks exactly like the offline version.

Finally created all my arrays, installed the only stable version of ESXi with associated drivers, have all my datastores on the host showing green, created a dedicated restore proxy and am finally getting some use back from this thing….

Conclusion

What… a …. freaking… NIGHTMARE!

 

A certificate chain could not be built to a trusted root authority

Today I tried installing .NET 4.6.2 onto an offline WIndows 7 to setup playnite.

Every since I built my PiCade I’ve been huge into loaders, almost all of them seem to rely on RetroArch for an old game emulation which is fine by me. Check out those sites for their respective offerings. My PiCade uses Lakka which is just a different loader that I find is rather well compiled for ARM based devices such as the Pi. They do offer a x86-64 build however I found that since it’s a linux derivative you have to be pretty Linux swanky to do good with it, by that I mean supporting hardware can be a bit more difficult as you have to be able to get the drivers for certain hardware yourself, in my case using an old HP laptop with an ATI card inside it was CPU rendering everything so N64 and Dreamcase would CHUGGGG, even though the system is way more than capable.

Anyway… Playnite only has one dependency for Windows, thats .NET 4.6.2, so when I downloaded the offline installer I didn’t expect issues, until…

Error! A certificate chain could not be built to a trusted root authority.

Ughhh ok… Google… wtf does that mean? MS says

Grab their certificate

install using elevated cmd with…

certutil -addstore root X:\Where\you\saved\the\Cert.crt

Sure enough after this I was successfully able to install .NET 4.6.2

Fixing WindowsRE

To make a long story short, in my previous post I covered some issues I had dealing with MBR2GPT. In which case I fried the Recovery Partition, and thus ruining my Advanced Startup abilities.

So here’s what happens… As soon as you either A) move the Recovery partition (via GParted) or B) Delete it you’ll have a disabled WinRE.

Checking BCDedit will show Recovery: Yes, and a recovery sequence but agentc will state otherwise:

Heck you may as well even wipe the useless BCD settings at this point!

Sooo how do you fix it?

If you did A) and simply moved the Recovery Partition, the fix is pretty easy.

If you did B) then First you’ll need to shrink a bit of space on your primary disk that hosts the Windows OS files (or another whole different disk, doesn’t really matter to me mon) either way…

I did recall someone recreating the files that are within the actual recovery folder, but I sadly can’t find it now, also this is a good post thread on it

In my case I had another laptop with the same Windows version deployed that still had the recovery partition, so I simply used a linux live to DD it onto a USB drive, and then do the same Linux Live on the laptop with the deleted partition and created another partition with the exact same sector count, and dd the image back into place.

This however still doesn’t fix the WinRE, and simple doing “reagentc /enable” fails stating the WinRE location is null, which from the picture above remains the fact.

I was stuck on this for a while until I stumbled upon this Technet post (not my own… haha woah!)

After reading this, I followed along by mounting my Recovery Partition:

Cleaned my Reagent.xml From this…

to this…

Set the WinRE location with the /setreimage option on reagentc, and enabled that puppy for the win!

This was good enough for me! After this all the advanced recovery options were available again, so I could do things like MBR2GPT without using the /allowfullos switch. 😀

Sorry about the crappy pictures and no headers… I clearly was super lazy on this one.

My Delightful Challenges with MBR2GPT

The Story

I’m not gonna cover in this blog what MBR2GPT is… honestly there’s more than enough coverage on this. Instead mines gonna cover a little rabbit hole I went down. Then how I managed to move on, this might even be a multi part series since I did end up removing my recovery partition in the whole mix up.

How did I get here?

In this case the main reason I got here was due to how I was migrating a particular machine, normally I wouldn’t do it this way but since the old machine was not being reused, I ended up DDing (making a direct copy) of an SSD from old laptop to M2 SSD on new laptop. New laptop has more storage capacity and since the recovery partition was now smack dab in the center of the disk instead of at the end, I wasn’t able to simply extend the partition within Windows Disk Management Utility (diskmgmt).

Instead I opted to use GParted Live, to move the recovery partition to the end of the drive and extend the usual user data partition containing Windows and all that other fun stuff. Nothing really crazy or exciting here.

After the move and partition expansion ran Windows Check Disk to ensure everything was good (chkdsk /F) and sure enough everything was good. Since this newer laptop uses UEFI and all the fun secureboot jazz along with the Windows 10 OS that was on the SSD already, but the SSD partition scheme was MBR… whommp whomp, but of course I knew about MBR2GPT to make this final transition the easiest thing on earth!

Until….

Disk layout validation failed for disk 0

Ughhhh… ok well I’m sure there’s some simple validation requirements for this script…

  • The disk is currently using MBR
  • There is enough space not occupied by partitions to store the primary and secondary GPTs:
    • 16KB + 2 sectors at the front of the disk
    • 16KB + 1 sector at the end of the disk
  • There are at most 3 primary partitions in the MBR partition table
  • One of the partitions is set as active and is the system partition
  • The disk does not have any extended/logical partition
  • The BCD store on the system partition contains a default OS entry pointing to an OS partition
  • The volume IDs can be retrieved for each volume which has a drive letter assigned
  • All partitions on the disk are of MBR types recognized by Windows or has a mapping specified using the /map command-line option

Ughhh, again, ok that’s a check across the board… wonder what happened…

Well I quickly looked back at my open pages, but couldn’t find it (I must have closed it) but someone on a thread mentioned all they did was delete their recovery partition. I ended up doing this, but it still failed on me…

checking the logs (%windir%/setupact.log)

I had errors, such as partition too close to end, and I extended and shrunk the partition. After going through most random errors in the log I was stuck on one major error I could not find a good solution to…

Cannot find OS partition(s) for disk 0

so I started looking over all the helpful posts on the internet, like this, and this

The first one I came across for general help, and it was a nice post, but it actually didn’t help me solve the problem. The second one I actually hit cause it was literally dead on with my issue…

“After checking logs (%windir%/setuperr.log), it was clear it had a problem with my recovery boot option — it was going through all GUIDs in my BCD (Boot Configuration Data), and failed on the one assigned to recovery entry. This entry was disabled (yet still had GUID assigned to it, which apparently led nowhere and that seems to have been the root cause). I searched some discussion forums, and most of them said that I would have somehow create recovery partition.

Fortunately, it turned out to be a lot easier 🙂 Windows has another command line tool, called REAgentC, which can be used to manage its recovery environment. I ran reagentc /info, which showed that recovery is disabled and its location is not set, but it had assigned the same GUID that was failing when running MBR2GPT. So, I run reagentc /enable, which set recovery location and voila, this time MBR2GPT finished its job successfully.”

That’s cool and all but for me…

  1. I removed my recovery partition, so any GUID pointing to it is irrelevant and its gone
  2. reagentc was not showing as an available command, either on my main Windows installation, or the windows PE that comes with the 1809 installer ISO I was using.

After farting around for a while, finding out I wasted my time with ADK and other annoyance around Windows Deployment methods and solutions when all I wanted was this simple problem solved, I moved on to another solution, a lil more hands on…

If he fixed his with reagentc… and it was due to a problem in the BCD… let’s see what we can do with the BCD directly without third-party tools. Bring on the….. BCDedit!

 

Microsoft Windows [Version 10.0.17763.437]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\WINDOWS\system32>diskmgmt

C:\WINDOWS\system32>mbr2gpt /validate /allowfullos
MBR2GPT: Attempting to validate disk 0
MBR2GPT: Retrieving layout of disk
MBR2GPT: Validating layout, disk sector size is: 512 bytes
Cannot find OS partition(s) for disk 0

C:\WINDOWS\system32>c:\Windows\setuperr.log

C:\WINDOWS\system32>diskmgmt

C:\WINDOWS\system32>mbr2gpt /validate /allowfullos
MBR2GPT: Attempting to validate disk 0
MBR2GPT: Retrieving layout of disk
MBR2GPT: Validating layout, disk sector size is: 512 bytes
Cannot find OS partition(s) for disk 0

C:\WINDOWS\system32>bcdedit

Windows Boot Manager
--------------------
identifier              {bootmgr}
device                  partition=\Device\HarxxiskVolume1
description             Windows Boot Manager
locale                  en-US
inherit                 {globalsettings}
default                 {current}
resumeobject            {c982d23f-8xx8-11e8-b5xx-ed8385ce2xx2}
displayorder            {current}
toolsdisplayorder       {memdiag}
timeout                 30

Windows Boot Loader
-------------------
identifier              {current}
device                  partition=C:
path                    \WINDOWS\system32\winload.exe
description             Windows 10
locale                  en-US
inherit                 {bootloadersettings}
recoverysequence        {b09bfaxx-6xx4-11e9-98ce-ead29f9a0a02}
displaymessageoverride  Recovery
recoveryenabled         No
allowedinmemorysettings 0x15000075
osdevice                partition=C:
systemroot              \WINDOWS
resumeobject            {c982d23f-8xx8-11e8-b5xx-ed8385ce2xx2}
nx                      OptIn
bootmenupolicy          Standard

Even though I disabled the recovery partition via the “recoveryenabled” no setting the issue kept happening, I could only guess it was this setting, but how can I delete it… ohhhhhh

C:\WINDOWS\system32>bcdedit /deletevalue {current} recoverysequence
The operation completed successfully.

C:\WINDOWS\system32>bcdedit

Windows Boot Manager
--------------------
identifier              {bootmgr}
device                  partition=\Device\HarxxiskVolume1
description             Windows Boot Manager
locale                  en-US
inherit                 {globalsettings}
default                 {current}
resumeobject            {c982d23f-8xx8-11e8-b5xx-ed8385ce2xx2}
displayorder            {current}
toolsdisplayorder       {memdiag}
timeout                 30

Windows Boot Loader
-------------------
identifier              {current}
device                  partition=C:
path                    \WINDOWS\system32\winload.exe
description             Windows 10
locale                  en-US
inherit                 {bootloadersettings}
displaymessageoverride  Recovery
recoveryenabled         No
allowedinmemorysettings 0x15000075
osdevice                partition=C:
systemroot              \WINDOWS
resumeobject            {c982d23f-8xx8-11e8-b5xx-ed8385ce2xx2}
nx                      OptIn
bootmenupolicy          Standard

C:\WINDOWS\system32>mbr2gpt /validate /allowfullos
MBR2GPT: Attempting to validate disk 0
MBR2GPT: Retrieving layout of disk
MBR2GPT: Validating layout, disk sector size is: 512 bytes
MBR2GPT: Validation completed successfully

No way! it worked! Secure boot here I come!

I hope this blog posts helps someone else out there!

Summary

Check the logs: C:\Windows\setuperr.log

There might have been a way I could have moved forward without even have deleted my recovery partition, I just happening to know it was not needed for this system, so it was something that was worth a try. However if I probably had looked in the log before even having done that I may have avoided this rabbit hole.

But…. I learnt something, and that was cool.

Note don’t delete your recovery partition. Bringing it back is an extremely painful process without re-install. I will cover how to accomplish this in my next post. I was actually going to this yesterday, however other work has come up. I will however complete the blog post this week… I promise! (It’s like I’m talking to myself in these posts… like no one actually reads these… do they?)

A general system error occurred: Launch failure

Failure to Launch

Sound the Alarm! Sound the Alarm!

*ARRRRRRREEEEEEEEERRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR
ARRRRRRREEEEEEEEERRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRRR

and one heck of a day sweating bullets, am-I-right?!

Also sorry about the lack of updates, with spring and work, it’s been hard to find time to blog. Deepest apologies.

The Story

Well it’s Tuesday so it’s clearly a day for a story, and boy do I have a story for you. I was going about my usual way of working.. endlessly to meet my goal… that never able to acquire goal of perfection…. anyway… I had completed a couple major tasks of going from vCenter 5.5 to 6.5u2, and while I have yet to blog about that fun bag (cause I had used my own PKI and certificates so the migration scripts VMware provides pooped the bed till I replaced them with self signed) but I digress this has nothing to do with that… well sort of.

Where was I… oh right… I was vMotioning a couple VMs to some new 6.5U2 hosts I had setup on new hardware (yeah buddy this was a whole new world (Why did I just say that in the voice of Princess Jasmine?)) but alas the final VM was not meant to move for it had stalled @ 20 percent (I had this happen once before during my deployment and discovered some interesting things about multi-NIC vmotions) I probably should blog about that toooooo but alas my time is running short. (I still have many other things to tackle and blog about). and then……..

“A general system error occurred: Launch failure”

ugh…. I could have jumped right to the logs but I jumped on google instead…

CloudSpark apparently recently came across this issue and posted on all the many places he apparently could think… reddit, and VMware forums

the VMware post got a tad long, but it was more the post on reddit that got me wondering….

“Haven’t seen this personally, but there’s a few things via google on the “Failed to connect to peer process” error that suggest it could be due to running out of storage. Specifically in one scenario, the /tmp/vmware-root folder on the ESXi host fills up with logs.” – Astat1ne

When I SSH into the host and ran “df -h” I was surprised to be reported back with an error:

esxcli returned an error: 1

Something to that extent anyway. I ended up Moving all the VMs back off the host, and swapped out the SD card the ESXi software was installed on (I had a copy of the SD card with a clone of the ESXi software and host configure that was created right after the host was installed and configured).

Sure enough after reboot powering it back on, “df -h” returned clean. I was then able to vMotion all the VMs back onto the set of new hosts.

Def a generic error message I’ve never seen, and thinking about it now is rather comical. (I know when your in the middle of the problem it’s not so funny)…

BUT it sure is now! :D…. this is so going to come back to bite me….

(await updates here post April 30th)

Exporting OPNsense HAProxy Let’s Encrypt Certificates

You know… in case you need it for the backend service… or a front end IDS inspection… whatever suits your needs for the export.

Step 1) Locate the Key and certificate, use the ACME logs!

cat /var/log/acme.sh.log | grep “Your cert”

*No that is not a variable for your cert, actually use the line as is

Step 2) Identify your Certificate and Key

Step 3) run the openssl command to create your file:

openssl pkcs12 export out certificate.pfx inkey privateKey.key in certificate.crt

Step 4) use WinSCP to copy your files to your workstation

*Note use SFTP when connecting to OPNsense, for some reason SCP just no worky

Testing Active Sync

I’ll keep this post short.. I swear this time haha

The Story

I recently blogged about using OPNsense as a reverse proxy for Exchange. This was really nice, and I was able to access Outlook Anywhere, literally from anywhere with owa. However…

The Problem

For some reason I could not connect with Active Sync… Even when I had my phone in the same network pretty much. I thought it might had to do with the self signed certificate (I should have known it was not when selecting “Accept all certificates” under the connections options still didn’t work), so I wastefully exported the certificate and key from my OPNsense server and imported it into the Exchange ECP, and assigned it to the IIS and SMTP services. (I’m probably going to change these back to the self signed as I don’t really have intentions on completing these steps every 60 days.

Since I didn’t want to use my account to test on Microsoft test site (this is a life saver), I used a account and email I was setting up for my colleague… and it passed… I was shocked. (At first I thought it was cause of the certificate changes… I soon found out.. it was not).

So I tested the same connection settings on my phone and it worked!

Woo Hoo!

The Real Problem and Solution

My happiness was sort lived once I attempted to add my account…. which still failed…. what the heck? I had all the settings the same exc…ep…t… my account…. wait a minute…. Google… WTF! Admins can’t use active Sync?!?!? Why isn’t that more specified in the documentation! I was aggravated about this for days… cause of something that was sooo simple!

I’ll cover exporting and importing certificates for other uses in a nother blog post. I just wanted to get this one out cause even though I had configured everything in my previous blog post about using OPNsense as a reverse proxy correctly, I wanted to follow up on why Active Sync was working for me, cause everything else on online guides made it sound so simple, and it rather is… until you find out a little secret GEM MS didn’t tell you about…

Zewwy has not one but two Epiphanies

The Story

Nothing goes better together than a couple moments of realization, and fine blog story. It was a fine brisk morning, on the shallow tides of the Canadian West… as the sun light gazed upon his glorious cheek… wait wait wait… wrong story telling.

The First Epiphany

First to get some reference see my blog post here on setting up OPNsense as a reverse proxy, in this case I had no authentication and my backend pool was a single server so nothing oo-lala going on here. I did however re-design my network to encompass my old dynamic IP for my static one. One itsy bitsy problem I’m restricted on physical adapters, which isn’t a big deal, with trunking and VLAN tagging and all that stuff… however, I am limited on public IP addresses, and the amount of ports that can listen on the standard ports… which is well one for one… If it wasn’t for security, host headers would solve this issue with ease at the application layer (the web server or load balancer) with the requirement of HTTPS there’s just one more hurdle to overcome… but with the introduction of TLS 1.2 (over ten years now, man time flys) we can use Server Name Indication (SNI) to provide individual certs for each host header being served. Mhmmm yeah.

This of course is not the epiphany… no no, it was simply how to get HAproxy plugin on OPNsense configured to use SNI. All the research I did, which wasn’t too much just some quick Googling… revealed that most configurations were manual via a conf file. Not that I have anything against that *cough Human error due to specialized syntax requirements*… it’s just that UIs are sort of good for these sort of things….

The light bulb on what to do didn’t click (my epiphany) till I read this blog post… from stuff-things.net … how original haha

It was this line when the light-bulb went off…

“All you need to do to enable SNI is to be give HAProxy multiple SSL certificates” also note the following he states… “In pass-through mode SSL, HAProxy doesn’t have a certificate because it’s not going to decrypt the traffic and that means it’s never going to see the Host header. Instead it needs to be told to wait for the SSL hello so it can sniff the SNI request and switch on that” this is a lil hint of the SSL inspection can of worms I’ll be touching on later. Also I was not able to specifically figure out how to configure pass-though SSL using SNI… Might be another post however at this time I don’t have a need for that type of configuration.

Sure enough, since I had multiple Certificates already created via the Let’s encrypt plugin… All I had to do was specify multiple certificates… then based on my “Rules/Actions/Conditions” (I used host based rules to trigger different backend pools) zewwy.ca -> WordPress and owa.zewwy.ca -> exchange server

and just like that I was getting proper certificates for each service, using unique certs… on OPNsense 19.1 and HAProxy Plugin, with alternative back-end services… now that’s some oo-lala.

My happiness was sort lived when a new issue presented it’self when I went to check my site via HTTPS:

The Second Epiphany

I let this go the first night as I accepted my SNI results as a victory. Even the next day this issue was already starting to bother me… and I wanted to know what the root of the issue was.

At first I started looking at the Chrome debug console… notice it complaining about some of the plugins I was using and that they were seem as unsafe

but the point is it was not the droids I was actually after… but it was the line (blocked:mixed-content) that set off the light bulb…

So since I was doing SNI on the SSL listener, but I I was specifying my “Rule/Action” that was pointing to my Backend Server that was using the normal HTTP real server. I however wanted to keep regular HTTP access open to my site not just for a HTTP->HTTPS redirect. I had however another listener available for exactly just that. At this point it was all just assumptions, even though from some post I read you can have a HTTPS load balancer hosting a web page over HTTPS while the back-end server is just HTTP. So Not sure on that one, but I figured I’d give it a shot.

So first I went back to my old blog post on getting HTTPS setup on my WordPress website but without the load balancer… turns out it was still working just fine!

Then I simply created a new physical server in HAProxy plugin,

created a new back-end Pool for my secure WordPress connection

created a new “Rule/Action” using my existing host header based condition

and applied it to my listener instead of the standard HTTP rule (Rules on the SSL listener shown in the first snippet):

Now when we access our site via HTTPS this time…

Clean baby clean! Next up some IDS rules and inspection to prevent brute force attempts, SQL injections… Cross site scripting.. yada yada, all the other dirty stuff hackers do. Also those 6 cookies, where did those come from? Maybe I’ll also be a cookie monster next post… who knows!

I hope you enjoyed my stories of “ah-ha moments”. Please share your stories in the comments. 😀

PAN ACC and WildFire

The ACC

For more in depth detail check Palo Alto Networks Page on the topic. Since the Palo Alto are very good Layer 7 based firewalls which allow for amazing granular controls as well as the use of objects and profiles to proliferate amazing scale-ability.

However, if you been following along with this series all I did was setup a basic test network with a single VM, going to a couple simple websites. Yet when I checked my ACC section I had a rating of 3.5…. why would my rating be so high, well according to the charts it was the riskiest thing of all the Internets…. DNS. While there have been DNS tunneling techniques discussed, one would hope PAN has cataloged most DNS sources attempting to utilize this. Guess I can test another time…

You may notice the user is undefined and that’s because we have no User ID servers specified, or User ID agents created. Until then that’s one area in the granular control we won’t be able to utilize till that’s done, which will also be covered under yet another post.

I did some quick search to see why DNS was marked so high, but the main thing I found was this reddit post.

akrob – Partner · 5 months ago – Drop the risk of applications like DNS ;)”

Hardy har har, well can’t find much for that, but I guess the stuff I was talking about above would be the main reasons I can think of at this time.

The better answer came slightly further down which I will share cause I find it will be more of value…

so we got the power, it just takes a lot of time to tweak and adjust for personal needs. For now I’ll simply monitor my active risk with normal use and see how it adjusts.

For now I just want to enable WildFire on the XP VMs internet rule to enable the default protection.

The WildFire

Has such a nice ring to it… even though wild fires are destructive in nature… anyway… this feature requires yet another dedicated license, so ensure you have all your auth codes in place and enabled under Device -> Licenses before moving on.

Now this is similar to the PAN URL categories I covered in my last post. Yes, these are coming out at a rather quicker than normal pace, as I wish to get to some more detailed stuff, but need these baselines again for reference sake. 😀

Go under Objects -> Security Profiles -> WildFire Analsysis

You will again see a default rule you can use:

Names self explanatory, the location I’m not sure what that exactly is about, the apps and file types are covered under more details here.

to use it again you simply have to select which profile to use under whatever rules you choose under the security rules section. Policies -> Security

Now you can see that lil shield under the profile column thats the PAN URL filter we applied. now after we apply the wild fire…

we get a new icon 😀

Don’t forget to commit…. and now we have the default protection of wild fire. Now this won’t help when users browse websites and download content when those sites are secured with HTTPS. The Palo Alto is unable to determine what content is being generated or passed over those connections, all the PAN FW knows are the URLs being used.

Testing

Following this site, which has links to download test file which are generate uniquely each time to provide a new signature as to trigger the submission. It’s the collaborative work through these submissions that make this system good.

Checking the Wildfire Submissions section under the Monitor Tab.

There they are they have been submitted to Palo Alto WildFire for analysis, which I’m sure they probably have some algo to ignore these test files in some way, or maybe they use to analyze to see how many people test, who knows what things can all be done with all that meta data…. mhmmm

Anyway, you may have noticed that the test VM is now Windows 7, and that the user is till not defined, as there’s no user agent, or LDAP servers since this machine is not domain joined that wouldn’t help anyway and an agent would be required AFAIK to get the user details. I may have a couple features to cover before I get to that fun stuff.

Summary

As you may have noticed the file was still downloaded on the client machine, so even though it was submitted there was nothing stopping the user from executing the download file, well at least trying to. It would all come down to the possibility of the executable and what version of Windows is being used when it was clicked, etc, etc. Which at that point you’d have to rely on another layer of security, Anti Virus software for example. Oh yeah, we all love A/V right? 😛

You may have also noticed that there was 3 downloads but only 2 submissions, in this case since there is no SSL decryption rules (another whole can of worms I will also eventually cover in this series… there’s a lot to cover haha) when the test file was downloaded via HTTPS, again the firewall could not see that traffic and inspect the downloaded contents for any validity for signatures (cause privacy). Another reason you’d have to again rely on another layer of security here, again A/V or Updates if a certain Vulnerability is attempted to be exploited.

So for now no wild fire submissions will take place until I can snoop on that secure traffic (which I think you can already see why there’s a controversy around this).

Till my next post! Stay Secure!

PAN URL Categories

PAN URL Categories

Heyo! So today I’m gonna cover URL category’s. Obviously Uniform Resource Locations are nothing new and even more so categories hahah. So when you know existing ones and have classified them, you can do some amazing things, what’s the hardest part…. Yes… proper classification of every possible URL, near impossible, but with collaboration feasible. In this post I’m going to cover how to set this up on a Palo Alto Networks firewall, cover some benefits, a couple annoyances, and ways to resolve them when possible…. Let’s get started!

License Stuff

Now when I first started with Palo Alto Networks Firewalls, they were using Brightcloud… here’s a bit of details from here

Palo Alto Networks firewalls support two URL filtering vendors:
PAN-DB—A Palo Alto Networks developed URL filtering database that is tightly integrated into PAN-OS and the Palo Alto Networks threat intelligence cloud. PAN-DB provides high-performance local caching for maximum inline performance on URL lookups, and offers coverage against malicious URLs and IP addresses. As WildFire, which is a part of the Palo Alto Networks threat intelligence cloud, identifies unknown malware, zero-day exploits, and advanced persistent threats (APTs), the PAN-DB database is updated with information on malicious URLs so that you can block malware downloads, and disable Command and Control (C2) communications to protect your network from cyber threats.
BrightCloud—A third-party URL database that is owned by Webroot, Inc. that is integrated into PAN-OS firewalls. For information on the BrightCloud URL database, visit http://brightcloud.com.
I’m not exactly sure if Brightcloud is going to continued to be supported or not and they have instead stuck more with their own in house URL DB, which of course requires a license so under Device -> Licenses ensure you have an active PAN URL-DB license.
For a list of all the class types you can use see here. (PAN login required)
Once you get this out of the way lets get into the good stuff.
Still under the Licenses area, Click the Download Now link under the area.
Considering I have nothing… Yes…
Not sure why they have a region selection… but alright…
Yay!
Now we are ready to start using them!

Objective Profiles… I mean Object Profiles

Yeah… click on the Objects tab… look under Security Profiles… URL Filtering.

There lies a default profile, which allows 57 categories while blocking only 9. For a simple test I’ll use this, the blocked categories are:

  1. abused-drugs (LOL, cause other poisons like Tobacco and alcohol are allowed, cause laws)
  2. adult (I’m assuming this is a business friendly term for porn)
  3. command-and-control (duh)
  4. gambling (duh)
  5. hacking (interesting class definition)
  6. malware (duh)
  7. phishing (duh)
  8. questionable (duh)
  9. weapons (awwwww)

Well that seems like a fairly reasonable list. Creating your own allow and block listing is just as easy as creating a new profile and defining each class accordingly, and yes you can easily clone an existing profile and change one or two categories as required.

The Allow and Block lists are specified under the overrides areas if you happen to need to allow or block a URL before it can be officially re-classed by PAN DB. As quoted by the wizard, “For the block list and allow list enter one entry per row, separating the rows with a newline. Each entry should be in the form of “www.example.com” and without quotes or an IP address (http:// or https:// should not be included). Use separators to specify match criteria – for example, “www.example.com/” will match “www.example.com/test” but not match “www.example.com.hk”” Which makes sense it’s will determine what is allowed as for proctols under the security rules area, this simply states which addresses (DNS or IP based) to allow or block. In the case of DNS till proper classification.

Checking a URL for a Category

To check a address class, check PANs site for it here. If you find a site is mis-classed you can send an email to Palo Alto Networks team and they will test the verification of the re-class and re-class the PAN DB accordingly. As far as I can tell I don’t think this one actually requires a login.

Using IT!

Alright, alright, lets actually get to some uses. Now if you were following my series see my last two posts here, and here for reference material. Under the Security Rule Test Internet, the final tab, actions, we did not define any profile settings, this is where the rubber hits the road for the first time.

Pick Profiles, We’ll cover groups a bit later (its just a group of profiles, who’d of thought).

As you can see this expands the window to show all the profiles you saw under the Objects -> Security Profiles area, in this case we are just going to play with the URL filtering.

Now once I apply this on the internet rule.. productive for my Test XP machine should go up… muahahah and…

HAHAHAHA you lazy mid 2000’s virtual worker… you can’t go gambling get back to work!

Summary

As you can see how useful URL categories can be, unfortunately I did want to cover more granular examples; such as only allowing a server to access it’s known update server URL’s. Hopefully I can update this post to cover that as well.

For now I hope you enjoyed this quick blog post. In my next post I hope to cover how this isn’t an IDS of any kind at this point, but a single layer of the multi-layer security onion. Stay tuned for more. 🙂