1. Recovered error predictive failure alert
Following error is reported in the NetApp ONTAP cluster node event log:
MM/DD/YYYY HH:MM:SS Cluster-02 ERROR disk.ioRecoveredError.pfa:
Recovered error predictive failure alert on disk 1c.xx.xx: op
0x2a:b1576e00:0200 sector 0 SCSI: recovered error - Disk reports
predicted failure event (1 5d 0 32)
Enclosure # : 1
SMART ASC # : 5D
Connector ID # : 0
SMART ASCQ # : 32
Event log identifier: disk.ioRecoveredError.pfa
Severity: ERROR
Description: This event is emitted when a disk determines that it will fail shortly. This occurs when a
threshold internal to the disk indicates that a failure is imminent.
Next step:
As the ‘Node-name’ is clearly indicated in the error, along with the physical disk details. You can
simply run:
Either:
cluster::> aggregate show-status –node <node-name>
or,
cluster::> node run -node cluster-0x
cluster-01> aggr status -r
2. Which would show the status 'prefail' against the disk reporting disk.ioRecoveredError scsi
errors. If it’s so, then only 2 things can happen:
1) If there is a matching spare disk assigned on that Node :
It will be automatically selected for Rapid RAID Recovery. In this process, the prefailed disk will be
copied to the spare. At the end of the copy process, the prefailed disk is removed from the RAID
configuration. The node will spin that disk down, mark it as `broken', so that it can be removed from
the shelf.
As shown in the following output: [Rapid RAID Recovery has begun]
data 1.xx.xx 0 SAS 10000 1.63TB 1.64TB (prefail, copy in progress) = disk reporting error
data 1.xx.xx 0 SAS 10000 1.63TB 1.64TB (2 % copied) = new spare
2) If there are no spares on that Node:
Disk will eventually fail and the raid-group will go in a degraded state. Once the suitable spare disk is
available, the contents of the disk being removed (broken state) will be reconstructed onto that
spare disk, until that happens, RG will remain in degraded state and raid-group performance could
see latency issues depending upon the raid-group disk utilization %.
Not an ideal situation;
Proactive action: Check if you have matching spare on the partner node if so then assign it to the
failed-disk node and hopefully data would be copied to new disk before it actually fails.
ashwinwriter@gmail.com
May, 2019