Introduction

Unexpected disasters that require you to move your business-critical resources to a remote backup site can occur any time. If your environment has IBM® Geographically Dispersed Resiliency for Power Systems™ (GDR) deployed, the main component KSYS (controller node), which controls all the configuration of data center, notifies your system administrator of a disaster event, and the admin can manually invoke the disaster recovery (DR) commands to complete an unplanned move. Figure 1 illustrates the steps involved to manually trigger an unplanned DR process.

Figure 1. Trigger an unplanned DR manually

What happens when the unplanned DR process cannot be invoked manually?

In a situation where your admin cannot attend to or take the necessary action on time, the unplanned disaster recovery process can be automated. With GDR, this can be achieved by registering the user-defined scripts for critical events. This tutorial describes how to configure GDR to automatically move your business-critical virtual machines (VMs) to a remote backup site during a disaster.

Figure 2 illustrates the steps required to trigger an unplanned DR process automatically.

Figure 2. Trigger an unplanned DR process automatically

Learning objectives

In this tutorial you will learn how to:

  • Configure the KSYS cluster and make sure that both the discovery and verification processes for active site resources are successful.
  • Register the user-defined scripts for critical events such as, HMC_UNREACHABLE, STG_UNREACHABLE, and HOST_FAILURE.
  • Precisely identify the time at which a disaster occurred (this tutorial illustrates the HMC_UNREACHABLE disaster).
  • Verify that KSYS automatically invokes an unplanned move as soon as it receives a critical event.

Scenario details

This section describes the terminologies used by IBM GDR, the hardware used, and the prerequisites to configure GDR to automatically move your business-critical VMs to a remote backup site during a disaster.

Terminology

You should be familiar with the following key terminology used by IBM GDR.

Term Description
KSYS KSYS is the controller node which monitors the production site resources.
In case of a disaster, it notifies the admin, so that the admin can invoke the move from KSYS.
Production site A production site is where active VMs are located and where the applications are running.
Backup site A backup site is where the Hardware Management Console (HMC), Central Electronics Complex (CEC) also referred as host and storage are available and to which the active site VMs will be moved in case of a disaster.
Disaster recovery (DR) Disaster recovery is the process of moving the VMs from the active site to the backup site in case of a disaster (for example, HMC down, CEC down, storage/disk is down).
Planned DR In case maintenance of servers is needed, the admin can invoke a planned move.
Here, the cleanup of primary site will be done automatically during the move.
Unplanned DR In case of a disaster, the admin can invoke an unplanned move.
Here, the cleanup of the primary site should be done manually by the admin.
Critical events In an enterprise infrastructure, HMC, CEC and storage are the critical resources, which should be up and running 24×7, for continuity of your business.
In case any one of the critical resources goes down, the respective event will be generated and will be notified to the KSYS node.

List of events:
HMC_UNREACHABLE – HMC is down or not reachable
STG_UNREACHABLE – Storage subsystem is down or not reachable
HMC_REACHABLE – HMC has recovered and is now reachable
VIOS_RMC_STATE_DOWN – HMC to VIOS Resource Monitoring and Control (RMC) connectivity/network seems to have problems
INSUFFICIENT_HOST_CAPACITY – Backup host does not have sufficient capacity to do a successful DR failover
VIOS_FAILURE - VIOS seems to have failed
VM_CONFIG_COLLECTION_FAILURE – Configuration data collection failed for the VM
DAILY_VERIFY_FAILED – Daily verification checks have failed
REPLICATION_FAILURE – Storage reports replication problem
MIRROR_RELATIONSHIP_MISSING – Disk has no mirror pair
HOST_FAILURE – Host failure has occurred
FILESYSTEM_SPACE_WARNING – File system is occupied more than 90%
VM_MOVE – VM has moved from one host to another
DAILY_VERIFY_COMPLETE – Daily verification checks have completed successfully
HOST_IN_INVALID_STATE – Host is in invalid state
VM_STORAGE_COLLECTION_FAILURE – Storage information collection has failed for the VM
HMC_LOGIN_FAILURE – HMC login failed
DISK_VALIDATION_FAILURE – Disk group validation failure
VIOS_DELETED – VIOS deletion has been detected
VM_NOT_ACTIVE – VM does not seem to be active
DUPLICATE_VMs – VM exists on multiple hosts
VM_DISCOVERED_ON_HOST – VM has been detected on host
VM_DELETED_FROM_HOST – VM has been deleted from host
VM_NOT_FOUND – VM is not found
Administrator An administrator invokes the DR commands to initiate planned or unplanned moves in case of maintenance or disaster, from the KSYS node.
Register Users can register for the above critical events to get notified in case of a configuration change.
User scripts User can develop the scripts to be run to invoke the necessary action for the respective critical event.
Automatic DR An automatic DR process can be invoked by user-defined scripts to move the VMs automatically by registering for critical events such as HMC_UNREACHABLE.

Note: Effective December 2018, IBM Geographically Resiliency for Power Systems is renamed as VM Recovery Manager DR for Power Systems. In this tutorial we still use IBM Geographically Resiliency for Power Systems assuming that most of the users might not be familiar with new naming. However, we would follow the new naming convention in our future tutorials.

Hardware environment

Figure 3 illustrates the following hardware environment we used for this tutorial.

  • Site 1 is the active production site located in Austin.
  • Host1_1 is a host (managed system) in the active site. For example, bacon_8202-E4C-10F477R.
  • VM is the managed virtual machine or logical partition (LPAR) which is running with the production workload on the active site. In this example, vmcmc1 to vmcmc5.
  • KSYS node is the control system located in the active site. For example, r7r3m107
  • Site 2 is the remote back-up site located in India.
  • Host2_1 is a host (managed system) in the backup site. For example, sausage_8202-E4C-10F478R.

Figure 3. Hardware environment

Prerequisites

To complete the steps in this tutorial:

  • Storage disk mirroring should be done by admin manually.
  • Different storage area network (SAN) should be configured for both sites.
  • Source VIOS and target VIOS RMC should be active.

Steps

This section explains the administration of KSYS cluster and the procedure to register event-based, user-defined scripts.

Configure the KSYS cluster

To configuring the KSYS cluster, you need to add the HMCs, hosts, storage agents, and the pairing source site host and target site host. This can be done using the command line interface provided by the KSYS file sets. It is referred to as ksysmgr utilities. For more information, read the ksysmgr command article in the IBM Knowledge Center.

Query the KSYS cluster resources

  1. Run the following command to check the status of ksyscluster. The output is shown in Figure 4.
    #ksysmgr query ksyscluster

    Figure 4. ksysmgr query cluster output

    For more information about configuring ksys cluster, refer to the Congifuring GDR aricle in the IBM Knowledge

  2. Run the following commands to check the site and host of the cluster configuration:

    #ksysmgr query site
    #ksysmgr query cec
    

    Figure 5 shows the output of the cluster configuration commands.

    Figure 5. Current cluster configuration

  3. From the source HMC, verify that the VMs are active and running on the host, bacon_8202-E4C-10F477R, as shown in Figure 6.

    Figure 6. Active VMs on source HMC

  4. Make sure that discovery and verify commands execution are successful. Figure 7 and Figure 8 show the output of the command.

    The discovery command gets all the required details of the active site virtual machines. The details include profile information and storage information along with replicated disk at the remote site. All the information gathered during the discovery operation will be stored by KSYS in a registry. This information is used to activate virtual machines on the target site during disaster recovery.

    Verification operation helps admin to verify whether all the virtual machines and resources are in good state and are capable to be active at target site whenever a DR event is triggered. This operation also helps admin to keep a regular check on any failure by continuously monitoring each resource using the following command.

    #ksysmgr -t discover site Austin verify=true

    Figure 7. Command to invoke discovery and verify

    Figure 8. Verify output

Configure the automatic move

IBM GDR provides an option for registering events and indicating a follow-up action by configuring the user-defined scripts. You can register any of the events based on your specific requirements.

  1. Run the following command for a list of available events:

    #ksysmgr query events

    Figure 9. List of available events

  2. Create a sample user-defined script to recover the VMs automatically from the production site to the backup site.

    Note: In this tutorial an unplanned move is demonstrated with the HMC_UNREACHABLE event.

    Listing 1 shows a sample user-defined script that invokes the auto-move (unplanned DR) as soon as it receives the HMC_UNREACHABLE event, in case of an HMC disaster.

    Listing 1. Sample user-defined script

    # cat dr_script.sh
    set -x
    invoke_dr=NO
    ACTIVE=`/opt/IBM/ksys/ksysmgr q site | grep -e "Name:" -e "Sitetype:"  | awk '{print $2}' | while read -r name ; do read -r sitetype ;echo "$name:$sitetype";done | grep -e "ACTIVE" | awk -F ":" ' { print $1 } '`
    BACKUP=`/opt/IBM/ksys/ksysmgr q site | grep -e "Name:" -e "Sitetype:"  | awk '{print $2}' | while read -r name ; do read -r sitetype ;echo "$name:$sitetype";done | grep -e "BACKUP" | awk -F ":" ' { print $1 } '`
    
    # Authenticate failure event
    
    SOURCE_HMCS=`/opt/IBM/ksys/ksysmgr q hmc | grep -e "Site:" -e "Ip:"  | awk '{print $2}' | while read -r Site ; do read -r IP ;echo "$Site:$IP";done  | grep $ACTIVE | awk -F ":" ' { print $2 } ' `
    
    for IP in `echo $SOURCE_HMCS`
    do
    ping -c 3 $IP 2>/dev/null 1>/dev/null
    if [ $? == 1 ]
    then
         invoke_dr="YES"
         break
    fi
    done
    
    # Preform move
    
    if [ "$invoke_dr" == "YES" ]
    then
    echo "Got HMC_UNREACHABLE event from KSYS" >/tmp/"migrate_"$$
    sleep 120 [Symbol] this can be set based on configuration
    /opt/IBM/ksys/ksysmgr -t move site from=$ACTIVE to=$BACKUP force=true dr_type=unplanned >>/tmp/"migrate_"$$
    fi
    

Register the user-defined scripts for the critical events

Next, you must register the user-defined script to recognize the critical events, such as HMC_UNREACHABLE, STG_UNREACHABLE and HOST_FAILURE. In this scenario, we’ll register HMC_UNREACHABLE as shown in Figure 10.

  1. Run the following command to register the events.

    #ksysmgr add notify script=” <path of the script>” event=<event type>

    Figure 10. Command to register the events

  2. Run the following command to check if the events are registered with the user-defined scripts, as shown in Figure 11.

    #ksysmgr query notify script

    Figure 11. Command to pair the events and user-defined script

Simulating a disaster situation

This section describes a disaster situation and the process to recover automatically. For the purpose of this tutorial, an unplanned disaster event has been simulated.

Shut down the HMC

To simulate the unplanned disaster event, you need to shut down the HMC so that it no longer responds and thus triggers the automatic move.

  1. Log in to HMC as a required user and run the shutdown command. On shutting down HMC, it will lose its communication with the KSYS node. KSYS continues to perform the quick verify operation to check whether all the configured critical resources are running on the source site. In this case, HMC is not communicating with KSYS after shutdown, and therefore, on running the quick verify process, critical event related to HMC (down will be logged in.

  2. Run the following command to check if the HMC is down.

    #ping -c 3 <HMC IP or name>

    Figure 12 shows that the HMC is not responding.

    Figure 12. Command to ping the HMC

Verify if the KSYS controller is tracking critical events

The KSYS controller node tracks various events that occur in the environment and saves the information in log files. Any critical failure event will be logged during the quick verify process run by KSYS automatically. The quick verify operation is triggered automatically every 60 minutes.

  1. Run the following command to check if KSYS has received the event.

    #cat /var/ksys/events.log

  2. Note that the HMC_UNRECHABLE event was received at 1:42:57 as shown in Figure 13.

    Figure 13. Event notification to KSYS node

Verify if the automatic move is successful

As soon as KSYS receives the HMC_UNREACHABLE event, the auto-move script invokes the move, as shown in Figure 14. Command triggered internally by the auto-move script is ksysmgr move site from=<source_site> to=<backup_site> dr_type=unplanned

Note that the user-defined sample script is invoked immediately after KSYS receiving the event at 1:42:57. The corresponding move is invoked in less than a minute (approximately in 30 seconds) as shown in Figure 14.

Figure 14. Script invoked

As you can see in Figure 15, the move was invoked and the VMs were restarted on the backup site in India.

Figure 15. Auto-move invoked

The primary site VMs that were running in the Austin site have successfully moved to the backup site in India, as shown in Figure 16.

Figure 16. Active site after auto-move

And finally, as shown in Figure 17, from the target HMC, verify that the VMs are running on the backup site.

Figure 17. VMs are active on target server after DR

Summary

This tutorial explained you how to configure KSYS to automatically move your mission-critical VMs from an active site to a backup site when a disaster event occurs.

Note that in a typical enterprise environment, authentication failure events are complex. After carefully considering the risks involved, use the steps outlined in this tutorial to configure an automated disaster recovery process in your environment.