Checking SSD health with ESXi 5.1 – Switched On Tech Design

A new feature with ESXi 5.1 is the ability to check SSD health from the command line. Once you have SSH’d into the ESXi box, you can check the drive health with the following command:

esxcli storage core device smart get -d [drive]

…where [drive] takes the format of: t10.ATA?????????. You can find out the right drive name by the following:

ls -l /dev/disks/

This will return output something like the following:

mpx.vmhba32:C0:T0:L0
mpx.vmhba32:C0:T0:L0:1
mpx.vmhba32:C0:T0:L0:5
mpx.vmhba32:C0:T0:L0:6
mpx.vmhba32:C0:T0:L0:7
mpx.vmhba32:C0:T0:L0:8
t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB
t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB:1
t10.ATA_____M42DCT064M4SSD2__________________________0000000011470321ADA4
t10.ATA_____M42DCT064M4SSD2__________________________0000000011470321ADA4:1

Here I can use the t10.xxx names without the :1 at the end to see the two SSDs available, copying and pasting the entire line as the [drive]. The command output should look like:

~ # esxcli storage core device smart get -d t10.ATA_____M42DCT064M4SSD2__________________________000000001147032121AB
Parameter                     Value Threshold Worst
—————————- —– ——— —–
Health Status                 OK     N/A        N/A
Media Wearout Indicator       N/A    N/A        N/A
Write Error Count             N/A    N/A        N/A
Read Error Count              100    50         100
Power-on Hours                100    1          100
Power Cycle Count             100    1          100
Reallocated Sector Count      100    10         100
Raw Read Error Rate           100    50         100
Drive Temperature             100    0          100
Driver Rated Max Temperature N/A    N/A        N/A
Write Sectors TOT Count       100    1          100
Read Sectors TOT Count        N/A    N/A        N/A
Initial Bad Block Count       100    50         100

One figure to keep an eye on is the reserved sector count – this should be around 100, and diminishes as the SSD replaces bad sectors with ones from this reservoir. The above statistics are updated every 30 minutes. As a point of interest, in this case ESXi isn’t picking up on the data correctly – the SSD doesn’t actually have exactly 100 power-on hours and 100 power cycle count.

Assuming it works for your SSDs, this is quite a useful tool – knowing when a drive is likely to fail can give you the opportunity for early replacement and less downtime due to unexpected failures.