What are Storage Controller Health Status Messages?
The DS8K Storage Controller Health Message function sends alerts at the Logical Control Unit (LCU) level when resources are not available or under service. This process runs on both primary and secondary storage systems in a mirrored environment so alerts could be related to either.
To understand more about the message codes themselves, the first place to start are these helpful IBM Knowledge Center links:
IEA077A CRITICAL CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text
IEA076E SERIOUS CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text
IEA074I MODERATE CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text
When notifying Hosts for these events, the Storage Controller will present Attention status for every Logical Subsystem for every path-group associated with each Logical Subsystem. A Host accessing multiple Logical Subsystems will get a notification per Logical Subsystem. For an array rebuild affecting 16 Logical Subsystems in a 16-way Sysplex, 256 attentions would be raised for the MC x’02’ (RAID ARRAY REBUILD IN PROGRESS) and then again for the MC x’03’ (RAID ARRAY REBUILD COMPLETE). To enable duplicate messages to be ignored each message has a Token. The Token in these messages have a unique value for each Logical Subsystem (LSS), but the message issued for each LSS contains the same Token. The Token value then can be used to identify if a message has already been seen for another LSS.
These messages are monitored by applications such as GDPS and can also be automated with System Automation/Netview. Host actions can be triggered by these messages based on user-specified policies. One such action might be to trigger a HyperSwap in a replicated system from a primary seeing impacting conditions to a secondary system that is not having any sort of impacting issue.
Categories of Events
There are several categories of events. The categories are ordered in decreasing severity.
1. These critical/acute messages indicate an unplanned condition and can indicate data loss or a loss of access. These error will cause the IEA077A messages on z/OS, and may trigger an unplanned HyperSwap. These conditions also will trigger a Call Home to alert the IBM Support of an issue. It is important to have monitoring of both the IEA077A alerts and the Call Home alerts to ensure that the alert is observed and addressed. Call Home allows for IBM Service to be able to address an issue promptly and if not functioning correctly the IBM response can be delayed.
|C0||Acute||Pinned Non-retryable Error in device|
|C1||Acute||Data loss occurred (FC-08 state)|
|C2||Acute||Data availability lost (FC-06 state)|
|C3||Acute||Raid Rank not available (FC-01 state)|
Device Adapter Pair Reset started, access lost
The message is normally IEA076E MC x’42’ unless a product switch is set to produce this acute alert and allow for an elevated response to the event.
2. The following IEA076E messages indicate an unplanned condition has occurred. They also require human intervention to determine whether action should be taken. For instance, if the primary volume has a secondary volume that is indicating alerts/errors/warning, it may not be advisable to HyperSwap. In addition, these conditions trigger the Call Home mechanism.
|41||Serious||Data Loss Error occurred during background media scrub|
|42||Serious||Device Adapter Pair Reset started, access lost|
3. The following IEA076E messages also indicate an unplanned condition has occurred. They can be monitored by the operator or application. These conditions can impact performance. Thus, the user might also want to determine if a planned HyperSwap should be performed or if the events are on a Metro Mirror secondary whether to suspend mirroring. These conditions do not generate a Call Home.
|40||Serious|| PPRC device I/O operations from primary to secondary are timing out.
These operations are retried on different paths for up to 30 seconds.
|80||Serious||Storage Controller experiencing repetitive warmstarts,
For example, any 10 warmstarts within 1-hour window.
4. This set of IEA074I message codes indicate an unplanned condition that will require human intervention for action to be taken. They also trigger the Call Home mechanism. The user might also want to determine if a planned HyperSwap should be performed. In some of these conditions, a single point of failure has been created by the condition so the storage system has reduced redundancy and a second issue could be problematic.
|04||Moderate||Single cluster mode due to error
Call home will occur only if the server is fenced.
|07||Moderate||Device Adapter Fenced or Quiesced.
This condition may degrade performance of the storage system. Device Adapter redundancy has been lost.
|22||Moderate||Secondary Storage Controller failover|
5. The next set of IEA074I message codes indicate an unplanned condition has occurred. These messages may be ones that an operator or application would monitor, but it is not required. There is no Call Home except potentially for MC x’01’ and x’02’.
|MC||Severity||Description||01||Moderate||Device in Preemptive Reconstruct (PER) mode.
This mode may last up to 2 minutes with the frequency of offload governed by a threshold.
Note: Call Home may occur for PER Mode if enabled by a product switch.
|02||Moderate||Device RAID Array is rebuilding.
The rebuild may last a number of hours depending on the size of drive.
Call Home will be performed if required for drive replacement.
|OD||Moderate||Host Adapter Recovery has started
The channel connections will be reset.
|10||Moderate||PPRC path degraded due to high failure rate|
|20||Moderate||Secondary Storage Controller experienced recovery action.
This legacy message is no longer used for warmstart, failover, failback
|21||Moderate||Secondary Storage Controller warmstart|
6. The next set of IEA074I message codes indicate an unplanned condition has been resolved. No Call Home is performed. These messages can be used by monitoring applications to clear an alert generated by the corresponding error event.
|MC||Severity||Description||03||Moderate||Device RAID Array finished rebuilding
x’02’ marked the start of the event.
|06||Moderate||Back to Dual cluster mode
x’04’ or x’05’ marked the start of the event.
|0E||Moderate||Host Adapter Recovery has ended.
x’0D’ marked the start of the event.
|OF||Moderate||Device Adapter Pair Reset has completed
x’42’ or x’C4′ marked the start of the event
|11||Moderate||PPRC path no longer degraded due to high failure rate
x’10’ marked the start of the event.
|23||Moderate||Secondary Storage Controller failback
x’22’ marked the start of the event.
7. The next set of IEA074I message indicate a planned condition and are provided so the user is aware of the time period when an action to the control unit is occurring. These conditions do not invoke the Call Home mechanism
|MC||Severity||Description||05||Moderate||Single cluster mode due to Code load or Service mode.||09||Moderate||SSFI Code Activation has started.||0A||Moderate||SFI Code Activation has completed.||0B||Moderate||HA Code Activation has started.||0C||Moderate||HA Code Activation has completed.|
Thanks to Alan McClure, IBM GDPS Development and Level 3 support, Todd Sorenson, DS8K Platform and Error Recovery Team Lead, and Stephen Spor, zSeries Channel Verification Systems Test Engineering, for their expertise.