The I/O Operations (I/O Ops) component of System Automation for z/OS (SAFOS) has been used by z/OS clients to manage the FICON I/O switch configuration, make configuration changes and display status information. In July of 2014 the IBM System Automation for z/OS product team announced a statement of direction to remove the I/O Operations component. SAFOS z/OS 3.5 is the last version to support I/O Operations and will be out of service in 9/2019.
This article describes functions that have been added to the z/OS product so that clients can continue to manage their FICON SAN infrastructure from z/OS. Clients are encouraged to provide feedback on any critical functions that maybe missing.
Over the last several years z/OS has taken many steps to provide alternative ways to provide critical functions provided by SAFOS I/O Operations. These functions include the ‘safe switching’ capability which allows clients to ensure that ports that are being blocked are not in use by any sharing systems, the ability to display how the SAN is connected from switch to switch, from host to switch and from switches to devices. Also, the state of all the ports in the switch, including the inter-switch links need to be accessible for the operations staff. By integrating these functions into z/OS the logical resource names are able to be understood (e.g. device numbers, control unit numbers, channel path identifiers, etc.).
‘Safe switching’ is one of the most important features that SAFOS I/O Operations provided.
When a repair service needs to be performed on a device in the SAN, or even a single port to a device, clients are encouraged to block all the switch ports that connect to the device that will be affected by the service procedure. This is done because FICON channels register for state-change-notifications (SCN) in order for hosts accessing that port to be notified if a disruptive event occurs, such as loss of light. By blocking the switch port(s) a potential firestorm of activity from these SCNs is avoided and disruptions to production work loads. The ‘safe switching’ function provided by SAFOS I/O Operations provides multi-system coordination to ensure that ports that are blocked are not in use by any of the systems accessing the SAN. Blocking a port that is in use could cause I/O errors to occur (interface control checks, missing interrupts and not operational conditions).
With the removal of SAFOS I/O Operations a new way to provide the safe switching function was needed. System z and Brocade developed a new function called Port Decommissioning, built upon industry standards. Instead of using SAFOS to communicate the status from all the sharing systems, the FICON switch does it. When a port needs to be blocked the operations team initiates the request from the switch. Each host system that has access to the switch is signaled via the CIM Control Application running in the FICON director that there is a request that the port(s) are to be blocked. A z/OS CIMOM running in each host takes the affected paths to the devices offline and acknowledges back to the Control Application that it is safe to block the switch port. If a host system is unable to take the affected path offline, as might be the case if it represents the last path to a device, the request is failed and the port decommissioning request is aborted.
Additional information on Port Decommissioning can be found at IBM Redbooks and Broacade.
Displaying Connectivity Information
The z/OS operator command used to display information about the I/O configuration is Display Matrix. Options are provided that allow operators to obtain information about the switches, switch ports, inter-switch links (ISLs), control units and devices.
Fiber channel switches include a control unit port (CUP) device that allows the fabric to communicate with z/OS via device specific support code. Link failures that occur in the SAN are reported to z/OS via link incident report that is presented via an attention message and link incident information.
Bottleneck detection and fenced ports are also surfaced with unsolicited messages from the CUP device.
A FICON Director reports a Health Summary Check condition when it detects conditions within the fabric that indicate that one or more ports, or routing between ports, may be operating at less than optimal capability. The condition is reported asynchronously using unsolicited alert status along with sense data that provides additional diagnostic information called Health Summary Diagnostic Parameters.
The Health Summary Diagnostic Parameters are used in the execution of further diagnostic commands to give the installation detailed information on the disturbance.
From time to time abnormal conditions can occur in the SAN, causing performance problems and disruptions to service level agreements (SLAs). Examples of abnormal conditions include the following:
- Multi-system work load spikes
- Multi-system resource contention in the SAN or at the CU ports
- SAN congestion
- Destination port congestion
- Firmware failures in the SAN, channel extenders, WDMs, control units
- Hardware failures (link speeds did not initialize correctly)
- Cabling Errors
- Dynamic changes in fabric routes (possible multi-hop cascading)
- Firmware bugs
The z/OS operating system has added numerous health checks to help clients identify problems in the SAN.
CMR Health Check
When abnormal conditions occur some paths to a storage system may perform poorly relative to other paths. The IBM Z channel subsystem is designed to dynamically adjust its I/O path selection algorithm to prefer sending requests to the better performing paths based on initial command response time (CMR) component of the I/O service time. When CMR time is not well balanced across paths to a storage system it is a symptom of an abnormal condition and the health check infrastructure will warn the client when it’s occurring so that the root cause can be determined and fixed.
I/O Rate Health Checks
When the channel subsystem’s path selection algorithms for routing work to better performing paths works well, the CMR time imbalance can be averted. However, there is an I/O start rate imbalance at the device. The IOS health check infrastructure also alerts clients when this is detected.
On occasion, clients have experienced cases, after power on reset, where a subset of links fail to initialize at the highest expected link speed. When this occurs an asymmetry in performance can occur leading to unexpected bottlenecks and failing to meet service level agreements. With the creation of the Read Diagnostic Parameters capability described below (see Diagnostic Commands on page 7), z/OS can now recognize when inconsistencies in link speeds occur across paths to a storage system and end to end on a single path. When these inconsistencies occur the IOS component of z/OS will issue health check messages warning clients that performance issues may occur.
Displaying Switch State Information
The display matrix command for switches was created to display state information for each port of a switch.
Displaying Inter-Switch Link Information
Clients can display information about the Inter-Switch Links (ISLs) using the display matrix command for devices and using the optional keyword for requesting the route information.
Read Diagnostic Parameters (RDP) is a new T11.org standard Extended Link Service (ELS) that provides the instrumentation needed for z Systems to provide enhanced problem determination and fault isolation.
Instrumentation data kept at every link in the SAN can be obtained using this in-band ELS command. IBM Z and z/OS retrieve the data and display it on the operator console via the display matrix operator commands shown below.
IBM System z Discovery and Auto-configuration and Dynamic CHPID Management
IBM Z and z/OS provide tooling, System z Discovery and Auto-Configuration, which can significantly simplify I/O configuration planning and definition. Clients can simplify plug in new storage systems into the SAN and have z/OS automatically discover the topology and propose a configuration definition with the optimal availability characteristics. Clients only need to insure that the physical connections are made with sufficient redundancy that no single point of failures exist.
Further, Dynamic CHPID Management allows clients can choose to allow a subset of the channels available be dynamically assigned to storage systems on the fly, when the work load requires additional I/O bandwidth and I/O requests to meet the WLM specified goals. This allows management at a coarser level, simplifying the human task and possibly reducing hardware costs while improving the system’s ability to meet the demands of a very dynamic workload mix.