Introduction

At present there are many customers who uses Power servers at their data center, and out of them many uses IBM AIX® high availability software to make critical applications highly available with a disaster recovery management. This article explains how to use the IBM PowerHA SystemMirror feature to get a cluster into stable state in case of any applications or resource failure due to wrong input without stopping cluster services. There are chances, that due to wrong input in resource or applications, cluster might not come to stable state and resource group might go to error state. PowerHA provides feature that allows a cluster to come into stable state and resource active as per the policy after correcting wrong input without stopping the cluster services. This article explains resource failure with an application failure under the resource group.

Configuring PowerHA cluster

For detailed information on what is IBM PowerHA and how to configure PowerHA on AIX system refer to the Introduction to PowerHA article. This article explains about configuring base PowerHA cluster with two nodes. Similarly more than two nodes cluster can be created along with Site separation based on repository disk. Site based cluster can be stretched cluster or linked cluster, stretched cluster is at same location where as linked cluster is configured where data centers are at different geographical locations.

This article describes how to recover from application failure in a cluster using four nodes stretched cluster. Figure 1 shows a four-node stretched cluster setup.

Figure 1. PowerHA cluster configuration for Stretched cluster
alt

In this cluster, one network and three resource groups are created. RG1 is nonconcurrent resource group and other 2 resource groups are concurrent groups. Cluster is created from SiteANode1 and changes are propagate on all nodes in cluster through verify and sync feature.

The following cltopinfo utility output shows a basic cluster information.


        
                        
               
(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
#cltopinfo
Cluster Name:    test_cluster
Cluster Type:    Stretched
Heartbeat Type:  Multicast
Repository Disk: hdisk11 (00f601bbb83e6a40)
Cluster IP Address: 228.40.0.25
Cluster Nodes:
        Site A (siteA):
                SiteANode1
                SiteANode2
        Site B (siteB):
                SiteBNode1
                SitebNode2

There are four node(s) and one network(s) defined.

NODE SiteANode1:
        Network net_ether_01
                service_ip1     1.1.1.10
                SiteANode1        10.40.0.25

NODE SiteANode2:
        Network net_ether_01
                service_ip1     1.1.1.10
                SiteANode2        10.40.0.26

NODE SiteBNode1:
        Network net_ether_01
                service_ip1     1.1.1.10
                SiteBNode1        10.40.0.43

NODE SiteBNode2:
        Network net_ether_01
                service_ip1     1.1.1.10
                SiteBNode2        10.40.0.44

Resource Group RG1_conc
        Startup Policy   Online On All Available Nodes
        Fallover Policy  Bring Offline (On Error Node Only)
        Fallback Policy  Never Fallback
        Participating Nodes      SiteANode1 SiteANode2 SiteBNode1 SitebNode2

Resource Group RG2_conc
        Startup Policy   Online On All Available Nodes
        Fallover Policy  Bring Offline (On Error Node Only)
        Fallback Policy  Never Fallback
        Participating Nodes      SiteANode1 SiteANode2 SiteBNode1 SitebNode2

Resource Group RG1
        Startup Policy   Online On Home Node Only
        Fallover Policy  Fallover To Next Priority Node In The List
        Fallback Policy  Fallback To Higher Priority Node In The List
        Participating Nodes      SiteANode1 SiteANode2 SiteBNode1 SitebNode2
        Service IP Label                 service_ip1
 

Adding application controller scripts to PowerHA

Assume that there is an application which is at home directory in server and we need to make that application highly available with PowerHA. So there will be an application controller start,stop script, and also monitor script to monitor application in a cluster. An application will be added to resource group under a volume group and a file system will be created on it.

For explanation in this article, an application app1 is an application created which has a start script, stop script and monitor scripts and controller app1_test. This application is accessed through file system, /fs1 and volume group, VG1. This application is a resource under resource group RG1.

Following are the application start script, stop script and monitor scripts that perform the basic read/write operation after the cluster is active and the file system is mounted.


(0) root @ SiteANode1: /home/scripts  ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Start script
#cat app1_start
#!/usr/bin/ksh
/home/scripts/app1 /fs1/a /fs1/b /fs1/c /fs1/d /fs1/e > /dev/null &

(0) root @ SiteANode1: /home/scripts ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Stop script
#cat  app1_stop
#!/usr/bin/ksh
ps ‑ef | grep ‑w /home/scripts/app1 | grep ‑v grep | awk '{print $2}' | read pid
if  $pid then
       echo "printing that app1 is stopped"
        kill ‑9 $pid
fi

(0) root @ SiteANode1: /home/scripts ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Monitor scripts 
#cat app_mon1
#!/usr/bin/ksh
ps ‑ef | grep ‑w /home/scripts/app1  | grep ‑v grep | awk '{print $2}'| read pid
if  $pid then
        return 0
fi
return 1
 

This section demonstrates how to add application controller and monitor scripts are added to PowerHA cluster.

  1. Adding the application controller

Open the System Management Interface Tool (SMIT) for PowerHA using the smit command. Run the smit hacmp command at the command prompt. In the Cluster Applications and Resource SMIT screen that is displayed, select

Cluster Applications and Resource → Resource → Configure User Applications (Scripts and Monitors) → Application Controller Scripts → Add Application Controller Scripts.

Figure 2 shows screen for adding application controller scripts. The first field is mandatory as it ask about the controller name. An application controller name is user defined. Second field is to add start script and third field is to add stop script. Monitor script is not yet added so it wont display. Monitor script can be added after creating application controller or before application controller also.

Figure 2. SMIT screen to add application controller scripts
alt
  1. Adding monitor script for application controller

Open the SMIT interface using the smit hacmp command and select the following options to open the Add a Custom and Application Monitor screen.

Cluster Applications and Resource → Resource → Configure User Applications (Scripts and Monitors) → Application Monitors → Configure Custom Application Monitors → Add a Custom Application Monitor.

Figure 3. SMIT screen to add application monitor
alt

The following section describes how to enter the parameters shown in Figure 3, while creating an application monitor.

  • Monitor Name – This parameter is user defined, and the monitor added can monitor the application controller scripts .
  • Application Controller to Monitor – Add application controller created by Figure 1.
  • Monitor Mode – This is the select mode in which monitoring of application controller is to be done.
  • Monitor Method – You need to provide the complete path of monitor scripts.
  • Stabilization Interval – This is to set user-defined time interval, Stabilization interval is set to 30 seconds .Stabilization interval time is time in which cluster will see for 30seconds, if application has started properly it will get cluster to stable state and resource to online state, but if due to any failure application is not started it will restart again.
  • Restart Count – It will restart application controller or will fall over to next node if restart count is not assigned. Restart count mention here is 3 so if application fails to stabilize within 3 count then the application will fall over to next priority node. In that case resource group will go to ERROR state on that particular node and cluster will go to UNSTABLE state and if resource group is in ERROR state then cluster will go to RP_FAILED state.

  • Adding applications created to resource group

Open the SMIT interface using the smit hacmpcommand and select the following options to open the Change/Show Resources and Attributes for a Resource Group screen.

Cluster Applications and Resources → Resource Groups → Change/Show Resources and Attributes for a Resource Group.

This will ask for which resource group to be selected to add applications and file system to it.

Figure 4. Resource group selection for application and file system integrating to resource group
alt

Here, applications and file system will be added to resource group RG1.

Figure 5. SMIT screen to add resource and attributes to RG1
alt

So to understand cluster recovery feature with application, application controller app1_test is added to resource group RG1. This application is under file system /fs1 and volume group VG1.

  1. Recovering cluster from failure

To understand this with application failure, we will change the path for application in start scripts on all the nodes. Initially, start scripts are at /home/scripts location, and now will change the path for a start script to get application failure.

Actual start script


(0) root @ SiteANode1: /home/scripts  ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Start script
#cat app1_start
#!/usr/bin/ksh
/home/scripts/app1 /fs1/a /fs1/b /fs1/c /fs1/d /fs1/e > /dev/null &

changing application script path for start scripts


0) root @ SiteANode1: /home/scripts  ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Start script
#cat app1_start
#!/usr/bin/ksh
/home/test/app1 /fs1/a /fs1/b /fs1/c /fs1/d /fs1/e > /dev/null &

After starting the cluster services, due to wrong input into script it will keep on restarting application controller because of restart count. This can be seen in /var/hacmp/log/hacmp.out file, because of this wrong input into the application controller, cluster will not come to stable state. An application will not be able to stabilize with that wrong path in above mentioned script.

If there is wrong path in script only on one node, resource group will try to come ONLINE on that particular node but after completing 3 restart count it will fall over to next node (as per restart count set in Figure 3), if next node has everything correct in application then cluster will be stable and resource group will come online on that node.


(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
#clRGinfo
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
Group Name                   State            Node
‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑
RG1_conc            ONLINE           SiteANode1
                    ONLINE           SiteANode2
                    ONLINE           SiteBNode1
                    ONLINE           SitebNode2

RG2_conc            ONLINE           SiteANode1
                    ONLINE           SiteANode2
                    ONLINE           SiteBNode1
                    ONLINE           SiteBNode2

RG1                 ERROR           SiteANode1@siteA
                    OFFLINE         SiteANode2@siteA
                    OFFLINE         SiteBNode1@siteB
                    OFFLINE         SiteBNode2@siteB


(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
# clcmd lssrc ‑ls clstrmgrES| grep state
Current state: ST_RP_RUNNING
Current state: ST_RP_RUNNING
Current state: ST_RP_RUNNING
Current state: ST_RP_RUNNING

(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
# clcmd lssrc ‑ls clstrmgrES| grep state
Current state: ST_BARRIER
Current state: ST_BARRIER
Current state: ST_BARRIER
Current state: ST_BARRIER

(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
# clcmd lssrc ‑ls clstrmgrES| grep state
Current state: ST_BARRIER
Current state: ST_BARRIER
Current state: ST_BARRIER
Current state: ST_BARRIER

(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
# clcmd lssrc ‑ls clstrmgrES| grep state
Current state: ST_CBARRIER
Current state: ST_CBARRIER
Current state: ST_CBARRIER
Current state: ST_CBARRIER

(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
# clcmd lssrc ‑ls clstrmgrES| grep state
Current state: ST_UNSTABLE
Current state: ST_UNSTABLE
Current state: ST_UNSTABLE
Current state: ST_UNSTABLE
 

Application under that Resource group is also not started

(0) root @ SiteANode1: /usr/es/sbin/cluster/utilities
#ps ‑ef| grep app1
    root 16187668 15598018   0 10:13:36  pts/0  0:00 grep app1

Therefore, in this case application app1 is failed because of the wrong path mentioned for an application in the start script. So, the Cluster Manager will report event failure in /var/hacmp/log/hacmp.out file as shown in Figure 6. So manual intervention is required as per hacmp.out message. Now changing manually path in a start script to recover from RG into ERROR state and cluster into UNSTABLE or RP_FAILED state.

0) root @ SiteANode1: /home/scripts  ‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑‑ Start script
#cat app1_start
#!/usr/bin/ksh
/home/scripts/app1 /fs1/a /fs1/b /fs1/c /fs1/d /fs1/e > /dev/null & ‑‑‑‑ >  correct path .

Figure 6. hacmp.out log to see message of application server failure
alt

After changing the script manually, HACMP will provide feature to recover from failure which will run a clruncmd script under cluster utilities. PowerHA provides script failure recovery option with SMIT as well. Run the smit hacmp command at the command prompt and select the following options.

Problem Determination Tools → Recover From PowerHA SystemMirror Script Failure.

Here, users can select the required nodes one after the other.

Figure 7. SMIT screen to get recover from PowerHA script failure
alt

The recovery from script failure option will run the /usr/es/sbin/cluster/utilities/clruncmd command which sends a message to Cluster Manager Daemon on the selected node. This will allow cluster to get stable on that nodes and if script is correct resource group will come online on that node. Once this issue is solved, event completed will be seen in /var/hacmp/log/hacmp.out file. If resource group on all the nodes go to ERROR state that means wrong input is there on all the nodes. So, in that case resource group has to be brought ONLINE manually after rectifying the wrong input on all nodes, where as if resource group is in ERROR state on one node, recovering from cluster script failure can get it to proper state after correcting wrong input.

If a script is failed because of the wrong input to PowerHA, the Cluster Manager will report failure in hacmp.out. Once the wrong input is corrected, Cluster Manager can be resumed to get cluster in to STABLE state. The RECOVERY FROM SCRIPT FAILURE OPTION will run clruncmd utilities of PowerHA which will send a message to Cluster Manager Daemon on a selected node, that will get cluster to stable state and also resource group to proper state incase they are into ERROR state due to any wrong input to resources. User need to manually change the wrong input after seeing error in to hacmp.out file. Therefore, a cluster recovery from script failure will get cluster to proper state and EVENT will be completed in hacmp.out.

Summary

This article can help users to get recovery from cluster failure due to failure in a resource or application caused by the wrong input. Customer can get a cluster to a stable state and a resource group to an accessible state after rectifying the wrong input to their resources.