Overview

Skill Level: Any Skill Level

DB2 admins who are working with high availability solutions

Netcool Operations Insight (NOI) on Bluemix deployment is typically done across the Primary and Secondary data centers. This recipe describes the high availability setup for DB2 as an underlying database. This approach is valid for any DB2 setup.

Ingredients

Network setup

  • Command hostname returns the server short name

#hostname
netcool-server1

  • Entries in /etc/hosts are correct, in particular the short names
  • Network Interfaces IP, NetMask and Broadcast are properly configured
  • Both servers are pingable by the short name and the IP

 

Root access (e.g. via sudo)

Step-by-step

  1. Check DB2 HADR configuration

    DB2 HADR setup:
    On Standby server as db2inst1 user:

    #db2 get db cfg for reporter | grep -i hadr

    HADR database role = STANDBY
    HADR local host name (HADR_LOCAL_HOST) = netcool-server2
    HADR local service name (HADR_LOCAL_SVC) = 55555
    HADR remote host name (HADR_REMOTE_HOST) = netcool-server1
    HADR remote service name (HADR_REMOTE_SVC) = 55555
    HADR instance name of remote server (HADR_REMOTE_INST) = db2inst1
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR target list (HADR_TARGET_LIST) = netcool-server1:55555
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0
    HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 300

    On Primary server as db2inst1 user:

    #db2 get db cfg for reporter | grep -i hadr

    HADR database role = PRIMARY
    HADR local host name (HADR_LOCAL_HOST) = netcool-server1
    HADR local service name (HADR_LOCAL_SVC) = 55555
    HADR remote host name (HADR_REMOTE_HOST) = netcool-server2
    HADR remote service name (HADR_REMOTE_SVC) = 55555
    HADR instance name of remote server (HADR_REMOTE_INST) = db2inst1
    HADR timeout value (HADR_TIMEOUT) = 120
    HADR target list (HADR_TARGET_LIST) = netcool-server2:55555
    HADR log write synchronization mode (HADR_SYNCMODE) = NEARSYNC
    HADR spool log data limit (4KB) (HADR_SPOOL_LIMIT) = 0
    HADR log replay delay (seconds) (HADR_REPLAY_DELAY) = 0
    HADR peer window duration (seconds) (HADR_PEER_WINDOW) = 300

    HADR database is in Peer Connected status:

    #db2pd -db reporter -hadr

    Database Member 0 — Database reporter — Active — Up 2 days 18:08:28 — Date 2017-06-22-04.31.25.090257
    HADR_ROLE = PRIMARY
    REPLAY_TYPE = PHYSICAL
    HADR_SYNCMODE = NEARSYNC
    STANDBY_ID = 1
    LOG_STREAM_ID = 0
    HADR_STATE = PEER
    HADR_FLAGS =
    PRIMARY_MEMBER_HOST = netcool-server1
    PRIMARY_INSTANCE = db2inst1
    PRIMARY_MEMBER = 0
    STANDBY_MEMBER_HOST = netcool-server2
    STANDBY_INSTANCE = db2inst1
    STANDBY_MEMBER = 0
    HADR_CONNECT_STATUS = CONNECTED

     

    For more information, how to setup DB2 HADR database, please refer here (Chapter 3).

  2. Create a TSA cluster. Problem description

    Usually, when the HADR servers are located in the same network segment, the TSA cluster creation procedure is done by db2haicu utility. However, in such HADR setup, that the servers are located in different network segments, the TSA cluster nodes created by db2haicu, may not be able to communicate to each other.

    [root@netcool-server1 ~]$ lsrpdomain
    Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
    hadr_domain_reporter Online 3.2.1.2 No 12347 12348

     

    [root@netcool-server1 ~]$ lsrpnode
    Name OpState RSCTVersion
    netcool-server1 Online 3.2.1.2
    netcool-server2 Offline 3.2.1.2 

     

    [root@netcool-server2 ~]$ lsrpnode
    Name OpState RSCTVersion
    netcool-server1 Offline 3.2.1.2
    netcool-server2 Online 3.2.1.2

     

    This can happen, when the servers have more than one network interface (e.g. public and private), and the interfaces are all in a unique IP subnet:

    [root@netcool-server1 ~]$ ifconfig
    bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
    inet 10.107.54.8 netmask 255.255.255.224 broadcast 10.107.54.31

    bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
    inet 158.85.149.57 netmask 255.255.255.224 broadcast 158.85.149.63

    [root@netcool-server2 ~]# ifconfig
    bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
    inet 10.109.1.12 netmask 255.255.255.0 broadcast 10.109.1.255

    bond1: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
    inet 158.85.7.111 netmask 255.255.255.224 broadcast 158.85.7.127

     

    Then the RSCT communication groups may be built the wrong way:

     

    [root@netcool-server1]$ lscomg
    Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership
    CG1 4 1 1 Yes Yes -1 (Default) 1 (IP) 1
    CG2 4 1 1 Yes Yes -1 (Default) 1 (IP) 1

     

    [root@netcool-server1 ~]$ lscomg -il CG1

    Name = bond1
    NodeName = netcool-server2.bluemix.net
    IPAddress = 158.85.7.111
    Subnet = 158.85.7.96
    SubnetMask = 255.255.255.224

    Name = bond0
    NodeName = netcool-server1.bluemix.net
    IPAddress = 10.107.54.8
    Subnet = 10.107.54.0
    SubnetMask = 255.255.255.224 

     

    To make possible communication between TSA cluster nodes, we have to change the Communication Groups settings the way the network interfaces from the same A-network are connected in the same Communication Group, i.e. bond0 on server1 with bond0 on server2. 

    Even though RSCT commands for modification Communication Groups are available, our TSA cluster is not in Configuration Quorum (the nodes do not see each other), so no cluster changes are allowed. Therefore, we have to force the cluster creation with one network interface only.  

    The next steps describe the procedure to build a TSA cluster.

  3. Make cleanup

    In case, you have run db2haicu, and it has resulted in the wrong cluster configuration (s. above), you have to cleanup.

    • Stop cluster on both nodes
      # stoprpdomain domain_name

      Check the cluster domain is stopped
       
      # lsrpdomain
      Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
      hadr_domain_reporter Offline 3.2.1.2 No 12347 12348

    • Remove cluster on both nodes

      # rmrpdomain -f hadr_domain_reporter

      Check the cluster domain is removed
      # lsrpdomain
      #

    • Reconfigure cluster infrastructure on both nodes
      # /usr/sbin/rsct/install/bin/recfgct
    • delete everithing created by db2haicu. Run on both nodes:
      # db2haicu -delete

     

    Make sure, this command successfully completes.

  4. Disable iTCO watchdog module

    Run

    # lsmod | grep -i itco

    When the ITCO Watchdog Time Driver and Vendor Support modules are loaded, you have to add the following 2 lines to the /etc/modprobe.conf file on both nodes:

    blacklist iTCO_wdt
    blacklist iTCO_vendor_support

    In RH7:

    Create /etc/modprobe.d/blacklist.conf

    with the same content.

    Then reboot the nodes to clear the iTCO_vendor_support watchdog to allow RSCT to load its own watchdog.

  5. Shutdown network interfaces

    Shutdown all network interfaces but one on both nodes, e.g.:

    # ifconfig bond1 down

    Make sure, only bond0 is active:

    # ifconfig

    bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST> mtu 1500
    inet 10.107.54.8 netmask 255.255.255.224 broadcast 10.107.54.31
    ………………………………….

    Attention: through shutdown of the network Interface the applications using them may be interrupted.

  6. Create a TSA cluster

    Run on both nodes:

    # preprpnode netcool-server1 netcool-server2

     

    Run on either node

    # mkrpdomain hadr_domain_reporter netcool-server1 netcool-server2

     

    Check the cluster domain is created and Offline

    # lsrpdomain
    Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
    hadr_domain_reporter Offline 3.2.1.2 No 12347 12348

     

    Start the cluster domain:

    # startrpdomain hadr_domain_reporter

     

    Check the cluster domain is Online

    # lsrpdomain
    Name OpState RSCTActiveVersion MixedVersions TSPort GSPort
    hadr_domain_reporter Online 3.2.1.2 No 12347 12348

     

    Make sure, both nodes are Online:

    [root@netcool-server1 ~]$ lsrpnode
    Name OpState RSCTVersion
    netcool-server1 Online 3.2.1.2
    netcool-server2 Online 3.2.1.2

     

    Note: if the cluster nodes can still not see each other, try to cleanup this cluster (s. above) and create a new one from the other node.

    Make sure, one of the Communication Groups contains bond0 from both nodes

    [root@netcool-server1]$ lscomg

    Name Sensitivity Period Priority Broadcast SourceRouting NIMPathName NIMParameters Grace MediaType UseForNodeMembership
    CG1 4 1 1 Yes Yes -1 (Default) 1 (IP) 1
    CG2 4 1 1 Yes Yes -1 (Default) 1 (IP) 1

     

    [root@netcool-server1 ~]$ lscomg -il CG1

    Name = bond0
    NodeName = netcool-server1.bluemix.net
    IPAddress = 10.107.54.8
    Subnet = 10.107.54.0
    SubnetMask = 255.255.255.224

    Name = bond0
    NodeName = netcool-server2.bluemix.net
    IPAddress = 10.109.1.12
    Subnet = 10.109.1.0
    SubnetMask = 255.255.255.0

  7. Restore the network bond1

    To restore the previously shutdown network bond1, run on both nodes

    # ifconfig bond1 up

    To make sure, the network communication is active, restart the network service:

    # service network restart

  8. Configure the restored Communication Group

    Now, as both nodes can see each other, the TSA cluster is in the Configuration Quorum, so the Communication Groups can be modified the way they have a common interface on both nodes.

    [root@netcool-server1 ~]# chcomg -i n:bond0:netcool-server1,bond0:netcool-server2 CG1
    [root@netcool-server1 ~]# chcomg -i n:bond1:netcool-server1,bond1:netcool-server2 CG2

     

    [root@netcool-server1 ~]# lscomg -il CG2

    Name = bond1
    NodeName = netcool-server2.bluemix.net
    IPAddress = 158.85.7.111
    Subnet = 158.85.7.96
    SubnetMask = 255.255.255.224

    Name = bond1
    NodeName = netcool-server1.bluemix.net
    IPAddress = 158.85.149.57
    Subnet = 158.85.149.32
    SubnetMask = 255.255.255.224

  9. Rerun preprpnode as both Communications Groups are properly configured

    Rerun on both nodes:

    # preprpnode netcool-server1 netcool-server2

    Now the TSA cluster domain is prepared for resources definition.

  10. Create DB2 HADR high availability policy with db2haicu

    Run on DB2 HADR Standby node:

    [root@netcool-server2 ~]# db2haicu

    and follow the instructions. After db2haicu successfully finished on the Standby, run it on the Primary node

    [root@netcool-server1 ~]# db2haicu

    Make sure, db2haicu successfully finished on the Primary node.

    On the end of the day you should have created the DB2 HADR high availability policy looking like this. Run on either node

    # samcc -test

    HADR-policy-without-ServiceIP

  11. Define a Tie Breaker

    Usually db2haicu prompts for the network tie breaker. In our case, the cluster domain has been created manually. That means db2haicu has not created the tie breaker automatically.

    It is recommended to take as a tie brekaer the network gateway:

    [root@netcool-server2 ~]# route -n
    Kernel IP routing table
    Destination Gateway Genmask Flags Metric Ref Use Iface
    ………………………………
    10.0.0.0 10.109.1.1 255.0.0.0 UG 0 0 0 bond0
    ………………………………

    Define a tie breaker:

    # mkrsrc IBM.TieBreaker Name=”NetworkTB” Type=”EXEC” DeviceInfo=’PATHNAME=/usr/sbin/rsct/bin/samtb_net Address=10.109.1.1 Log=1′ PostReserveWaitTime=30

    Check the tie breaker definition:

    [root@netcool-server1 ~]# lsrsrc IBM.TieBreaker

    Resource Persistent Attributes for IBM.TieBreaker
    …………………………….
    resource 3:
    Name = “NetworkTB”
    Type = “EXEC”
    DeviceInfo = “PATHNAME=/usr/sbin/rsct/bin/samtb_net Address=10.109.1.1 Log=1”
    ReprobeData = “”
    ReleaseRetryPeriod = 0
    HeartbeatPeriod = 0
    PreReserveWaitTime = 0
    PostReserveWaitTime = 30
    NodeInfo = {}
    ActivePeerDomain = “hadr_domain_reporter”
    …………………………….

    Activate NetworkTB

    # chrsrc -c IBM.PeerNode OpQuorumTieBreaker=NetworkTB

    Check the setting:

    # lsrsrc -c IBM.PeerNode

  12. Save the policy

    # sampolicy -s /usr/sbin/rsct/sapolicies/db2/savedPolicies/reporter_hadr.xml

  13. Define Automatic Client Rerouter (ACR)

    Since the HADR servers are located in different networks, the SericeIP for communication with HADR Primary cannot be modelled in scope of TSA cluster. Instead, the ACR (Automatic Client Rerouter) can be configured, to transfer client application requests from a failed database server to a standby database server.

    Note: if automatic client reroute is not enabled, client applications receive error message SQL30081N, and no further attempts are made to establish a connection with the server.

    To enable automatic client reroute, update the alternate server for database reporter

    on netcool-server1:

    db2 UPDATE ALTERNATE SERVER FOR DATABASE reporter USING HOSTNAME netcool-server2 PORT 50000

    on netcool-server2:

    db2 UPDATE ALTERNATE SERVER FOR DATABASE reporter USING HOSTNAME netcool-server1 PORT 50000

    Note: the value of PORT parameter is definied in

    [root@netcool-server1 ~]# cat /etc/services | grep -i db2c
    db2c_db2inst1 50000/tcp

     

    After the ACR is configured, you will probably need to restart db2 on both servers.

  14. Tests

    The full test scenarios for DB2 HADR high availability policy are described in

    Automated cluster controlled HADR configuration setup using the IBM DB2 high availability instance configuration utility

    For maintenance scenarios, e.g. upgrade DB2 or TSA or System patch, refer here:

    Performing rolling updates in an automated high availability disaster recovery (HADR) environment

     

    Here we describe connection tests with JDBC Gateway.

    1. Normal Operations / Maintenance

    Perform a gentle HADR takeover

    Setup:

    – reporter database is in Peer Connected status
    – make sure, JDBC Gateway is connected to the database Primary role
     

    Scenario:

    On the Standby server as db2inst1:

    # db2 takeover hadr on db reporter

    Expected results:

    Takeover command finished successfully:
    DB20000I The TAKEOVER HADR ON DATABASE command completed successfully.

    – TSA indicates the HADR database Online on the primary server

    – JDBC Gateway has restarted on the new Primary and reconnected to the database:

    On the new Primary server:

    in /home/netcool/IBM/tivoli/netcool/omnibus/log/G_JDBC.log

    2017-06-19T07:48:36: Error: E-GJA-000-000: [ngjava]: G_JDBC: pool-1-thread-6: SQL message: Error for batch element #1: [jcc][t4][2027][11212][3.69.66] A connection failed but has been re-established. Special register settings have been replayed if necessary. Host name or IP address of the connection: netcool-server2.bluemix.net. Service name or port number of the connection: 50,000.

    The same scenario should also work in the opposite direction.

     

    2. Non planned outage

    Takeover HADR by force (In Window) performed by TSA

    Setup:

    – reporter database is in Peer Connected status
    – make sure, JDBC Gateway is connected to the database Primary role
     

    Scenario:

    Reboot (Power Off) the HADR Primary server

    Expected results:

    – TSA has taken over HADR reporter database to the standby server

    – JDBC Gateway has restarted on the new Primary and reconnected to the database

     

    After crash of the Primary server andfailover of reporter database to Standby by TSA:

    # samcc -test

    Reporter is available on new Primary after initial Primary crash

    On the new Primary server:

    In /var/log/messages

    Jun 19 08:12:57 netcool-server2 FAILOVERMONITOR: G_JDBC failed to respond. Count is 1.Jun 19 08:13:06 netcool-server2 ConfigRM[3076]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,PeerDomain.C,1.99.31.3,19954 :::CONFIGRM_PENDINGQUORUM_ER#012The operational quorum state of the active peer domain has changed to PENDING_QUORUM. #012This state usually indicates that exactly half of the nodes that are defined in the #012peer domain are online. In this state cluster resources cannot be recovered although #012none will be stopped explicitly.Jun 19 08:13:06 netcool-server2 RecoveryRM[4732]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825….WrwFN/HYx/8oK.A………………..:::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Protocol.C,1.54.3.18,2769 :::RECOVERYRM_INFO_4_ST#012A member has left.#012Node number = #0121Jun 19 08:13:06 netcool-server2 RecoveryRM[4732]: (Recorded using libct_ffdc.a cv 2):::Error ID: 825….WrwFN/UZx/8oK.A………………..:::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,Protocol.C,1.54.3.18,2797 :::RECOVERYRM_INFO_5_ST#012Master has left, this node is now the master.

    Jun 19 08:13:36 netcool-server2 ConfigRM[3076]: (Recorded using libct_ffdc.a cv 2):::Error ID: :::Reference ID: :::Template ID: 0:::Details File: :::Location: RSCT,PeerDomain.C,1.99.31.3,19950 :::CONFIGRM_HASQUORUM_ST#012The operational quorum state of the active peer domain has changed to HAS_QUORUM. #012In this state, cluster resources may be recovered and controlled as needed by #012management applications.Jun 19 08:13:36 netcool-server2 /usr/sbin/rsct/sapolicies/db2/hadrV105_start.ksh: Entering : db2inst1 db2inst1 REPORTER

     

    Jun 19 08:13:55 netcool-server2 NCO_PA[3422]: [RESTORE_MSG]: G_JDBC_B running as 1000 has been restored on netcool-server2.bluemix.net.Jun 19 08:13:55 netcool-server2 FAILOVERMONITOR: Backup Gateway started.Jun 19 08:13:55 netcool-server2 FAILOVERMONITOR: Entering failure-mode polling cycle.

    Jun 19 10:20:45 netcool-server2 FAILOVERMONITOR: G_JDBC failed to respond. Count is 5.Jun 19 10:20:47 netcool-server2 hadrV105_monitor.ksh[7176]: Returning 2 : db2inst1 db2inst1 REPORTERJun 19 10:20:50 netcool-server2 FAILOVERMONITOR: G_JDBC failed to repond 5 times.Jun 19 10:20:50 netcool-server2 FAILOVERMONITOR: Bringing up backup Gateway.Jun 19 10:20:50 netcool-server2 FAILOVERMONITOR: Starting backup Gateway…Jun 19 10:20:53 netcool-server2 NCO_PA[3570]: [RESTORE_MSG]: G_JDBC_B running as 1000 has been restored on netcool-server2.bluemix.net.Jun 19 10:20:53 netcool-server2 FAILOVERMONITOR: Backup Gateway started.Jun 19 10:20:53 netcool-server2 FAILOVERMONITOR: Entering failure-mode polling cycle.

    After initial Primary has been restored:

    # samcc -test

    After initial Primary has restored (samcc -test)

  15. For more info…

    For more info refer:

    Education: Integrating TSAMP v3.2 with DB2 HADR v10.1
    http://www.ibm.com/support/docview.wss?uid=swg21679824

     

    DB2 HA across shared instance using TSA
    https://www.ibm.com/developerworks/community/blogs/nsharma/entry/db2_ha_across_shared_instance_using_tsa?lang=en

     

    Rolling Upgrade Procedure for a TSA managed DB2 Shared Disk Environment
    http://www.ibm.com/support/docview.wss?uid=swg21645808

     

    TSA Setup on two Data Centers ‚Äď in an Environment having DB2 HADR with Multiple Standbys
    https://developer.ibm.com/recipes/tutorials/tsa-setup-on-two-data-centers-in-an-environment-having-db2-hadr-with-multiple-standbys/

     

Join The Discussion