HPE SSD Firmware Bug (Critical)

I’m just gonne leave this right here…..

https://support.hpe.com/hpsc/doc/public/display?docId=emr_na-a00092491en_us

I wonder who they outsourced the firmware code out to back in 2015….

IMPORTANT: This HPD8 firmware is considered a critical fix and is required to address the issue detailed below. HPE strongly recommends immediate application of this critical fix. Neglecting to update to SSD Firmware Version HPD8 will result in drive failure and data loss at 32,768 hours of operation and require restoration of data from backup in non-fault tolerance, such as RAID 0 and in fault tolerance RAID mode if more drives fail than what is supported by the fault tolerance RAID mode logical drive. By disregarding this notification and not performing the recommended resolution, the customer accepts the risk of incurring future related errors.

HPE was notified by a Solid State Drive (SSD) manufacturer of a firmware defect affecting certain SAS SSD models (reference the table below) used in a number of HPE server and storage products (i.e., HPE ProLiant, Synergy, Apollo, JBOD D3xxx, D6xxx, D8xxx, MSA, StoreVirtual 4335 and StoreVirtual 3200 are affected).

The issue affects SSDs with an HPE firmware version prior to HPD8 that results in SSD failure at 32,768 hours of operation (i.e., 3 years, 270 days 8 hours). After the SSD failure occurs, neither the SSD nor the data can be recovered. In addition, SSDs which were put into service at the same time will likely fail nearly simultaneously.

To determine total Power-on Hours via Smart Storage Administrator, refer to the link below:

Smart Storage Administrator (SSA) – Quick Guide to Determine SSD Uptime

Yeah you read that right, drive failure after a specific number of run hours. Yeah total drive failure, if anyone running a storage unit with these disks, it can all implode at once with full data loss. Everyone has backups on alternative disks right?

Lesson and review of today is. Double check your disks and any storage units you are using for age, and accept risks accordingly. Also ensure you have backups, as well as TEST them.

Another lesson I discovered is depending on the VM version created will depend which ESXi host it can technically be created on. While this is a “DUH” thing to say, it’s not so obvious when you restore a VM using Veeam and Veeam doesn’t code to tell you the most duh thing ever. Instead the recovery wizard will walk right through to the end and then give you a generic error message “Processing configuration error: The operation is not allowed in the current state.” which didn’t help much until I stumbled across this veeam form post

and the great Gostev himself finishes the post with…

by Gostev » Aug 23, 2019 5:52 pm

“According to the last post, the solution seems to be to ensure that the target ESXi host version supports virtual hardware version of the source VM.”

That’s kool…. or about… why doesn’t Veeam check this for you?!?!?!
Once I realized what the problem was, I simply restored the VM with a new name on the same host it was backed up from (Which was on a 6.5 ESXi host) and I was attempting to restore the VM on a 5.5 ESXi host. Again, after I realized I had created the VM under the options that I picked a higher VM level allowing it only to be used with higher versions of ESXi it was like again… “DUHHH” but then it made me think, why isn’t the software coded to check for such an obvious pre-requisite?
Whatever nothings perfect

ESXi 6.5 on Proliant Gen9 Hardware Status Unknown

I’ll keep this post short.

If you have a Proliant Gen9 server and running ESXi 6.5 u2 along with VSCA 6.5u2.

You will get all hosts not displaying any hardware status. This should fixed immediately as you don’t get alerts on any hardware faults via IPMI. This includes status from hosts running ESXi 5.5 or 6.5.

The first fix is to upgrade the VCSA to 6.5u3.

After upgrading the VCSA to 6.5u3… Hardware status will come back for each host.. however.. if you are running ESXi 6.5u2 on the Gen9 servers you’ll something like this:

as you can see some sensors are a lil wonky…

The fix here is to upgrade the host to 6.5u3 via the HPE build.

After the hosts and the VCSA are on 6.5u3 all is good and hardware faults will again will trigger critical alarms on vSphere.