What are Storage Controller Health Status Messages?

The DS8K Storage Controller Health Message function sends alerts at the Logical Control Unit (LCU) level when resources are not available or under service. This process runs on both primary and secondary storage systems in a mirrored environment so alerts could be related to either.

To understand more about the message codes themselves, the first place to start are these helpful IBM Knowledge Center links:

Critical Codes:
IEA077A CRITICAL CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text

Serious Codes:
IEA076E SERIOUS CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text

Moderate Codes:
IEA074I MODERATE CONTROLLER HEALTH,MC=cc,TOKEN=dddd,SSID=xxxx,DEVICE NED=tttt.mmm.ggg.pp.ssssssssssss.uuuu,RANK|DA|INTF=iiii,text


When notifying Hosts for these events, the Storage Controller will present Attention status for every Logical Subsystem for every path-group associated with each Logical Subsystem. A Host accessing multiple Logical Subsystems will get a notification per Logical Subsystem. For an array rebuild affecting 16 Logical Subsystems in a 16-way Sysplex, 256 attentions would be raised for the MC x’02’ (RAID ARRAY REBUILD IN PROGRESS) and then again for the MC x’03’ (RAID ARRAY REBUILD COMPLETE). To enable duplicate messages to be ignored each message has a Token. The Token in these messages have a unique value for each Logical Subsystem (LSS), but the message issued for each LSS contains the same Token. The Token value then can be used to identify if a message has already been seen for another LSS.

These messages are monitored by applications such as GDPS and can also be automated with System Automation/Netview. Host actions can be triggered by these messages based on user-specified policies. One such action might be to trigger a HyperSwap in a replicated system from a primary seeing impacting conditions to a secondary system that is not having any sort of impacting issue.

Categories of Events

There are several categories of events. The categories are ordered in decreasing severity.

1. These critical/acute messages indicate an unplanned condition and can indicate data loss or a loss of access. These error will cause the IEA077A messages on z/OS, and may trigger an unplanned HyperSwap. These conditions also will trigger a Call Home to alert the IBM Support of an issue. It is important to have monitoring of both the IEA077A alerts and the Call Home alerts to ensure that the alert is observed and addressed. Call Home allows for IBM Service to be able to address an issue promptly and if not functioning correctly the IBM response can be delayed.

MC Severity Description
C0 Acute Pinned Non-retryable Error in device
C1 Acute Data loss occurred (FC-08 state)
C2 Acute Data availability lost (FC-06 state)
C3 Acute Raid Rank not available (FC-01 state)
C4 Acute Device Adapter Pair Reset started, access lost
The message is normally IEA076E MC x’42’ unless a product switch is set to produce this acute alert and allow for an elevated response to the event.

2. The following IEA076E messages indicate an unplanned condition has occurred. They also require human intervention to determine whether action should be taken. For instance, if the primary volume has a secondary volume that is indicating alerts/errors/warning, it may not be advisable to HyperSwap. In addition, these conditions trigger the Call Home mechanism.

MC Severity Description
41 Serious Data Loss Error occurred during background media scrub
42 Serious Device Adapter Pair Reset started, access lost

3. The following IEA076E messages also indicate an unplanned condition has occurred. They can be monitored by the operator or application. These conditions can impact performance. Thus, the user might also want to determine if a planned HyperSwap should be performed or if the events are on a Metro Mirror secondary whether to suspend mirroring. These conditions do not generate a Call Home.

MC Severity Description
40 Serious PPRC device I/O operations from primary to secondary are timing out.
These operations are retried on different paths for up to 30 seconds.
80 Serious Storage Controller experiencing repetitive warmstarts,
For example, any 10 warmstarts within 1-hour window.

4. This set of IEA074I message codes indicate an unplanned condition that will require human intervention for action to be taken. They also trigger the Call Home mechanism. The user might also want to determine if a planned HyperSwap should be performed. In some of these conditions, a single point of failure has been created by the condition so the storage system has reduced redundancy and a second issue could be problematic.

MC Severity Description
04 Moderate Single cluster mode due to error
Call home will occur only if the server is fenced.
07 Moderate Device Adapter Fenced or Quiesced.
This condition may degrade performance of the storage system. Device Adapter redundancy has been lost.
22 Moderate Secondary Storage Controller failover

5. The next set of IEA074I message codes indicate an unplanned condition has occurred. These messages may be ones that an operator or application would monitor, but it is not required. There is no Call Home except potentially for MC x’01’ and x’02’.

MC Severity Description
01 Moderate Device in Preemptive Reconstruct (PER) mode.
This mode may last up to 2 minutes with the frequency of offload governed by a threshold.
Note: Call Home may occur for PER Mode if enabled by a product switch.
02 Moderate Device RAID Array is rebuilding.
The rebuild may last a number of hours depending on the size of drive.
Call Home will be performed if required for drive replacement.
OD Moderate Host Adapter Recovery has started
The channel connections will be reset.
10 Moderate PPRC path degraded due to high failure rate
20 Moderate Secondary Storage Controller experienced recovery action.
This legacy message is no longer used for warmstart, failover, failback
21 Moderate Secondary Storage Controller warmstart

6. The next set of IEA074I message codes indicate an unplanned condition has been resolved. No Call Home is performed. These messages can be used by monitoring applications to clear an alert generated by the corresponding error event.

MC Severity Description
03 Moderate Device RAID Array finished rebuilding
x’02’ marked the start of the event.
06 Moderate Back to Dual cluster mode
x’04’ or x’05’ marked the start of the event.
0E Moderate Host Adapter Recovery has ended.
x’0D’ marked the start of the event.
OF Moderate Device Adapter Pair Reset has completed
x’42’ or x’C4′ marked the start of the event
11 Moderate PPRC path no longer degraded due to high failure rate
x’10’ marked the start of the event.
23 Moderate Secondary Storage Controller failback
x’22’ marked the start of the event.

7. The next set of IEA074I message indicate a planned condition and are provided so the user is aware of the time period when an action to the control unit is occurring. These conditions do not invoke the Call Home mechanism

MC Severity Description
05 Moderate Single cluster mode due to Code load or Service mode.
09 Moderate SSFI Code Activation has started.
0A Moderate SFI Code Activation has completed.
0B Moderate HA Code Activation has started.
0C Moderate HA Code Activation has completed.

Thanks to Alan McClure, IBM GDPS Development and Level 3 support, Todd Sorenson, DS8K Platform and Error Recovery Team Lead, and Stephen Spor, zSeries Channel Verification Systems Test Engineering, for their expertise.

