Recovering from a database disk failure in Big SQL worker node (4.1fp2+)

This article describes a method to recover from a disk failure on a Big SQLworker. The methods explained here apply to Big SQL version 4.1.0.2 and onward.

In general, whenever a Big SQL worker node experiences some sort of unrecoverable error, the most immediate action should be to isolate that node from the cluster. You do this by simply shutting the node down (either via Ambari or command line). At that point Big SQL will no longer use that node for operations. There are two distinct scenarios. First, if the Big SQL worker node is not reachable. This means the host is no longer operational or the host cannot be reached over the network to operate on it remotely. In that case you may follow these instructions to remove it permanently from the Big SQL cluster: https://developer.ibm.com/hadoop/2016/10/20/big-sql-4-2-how-to-remove-a-dead-worker-node. This will allow the user to work on repairing the host in isolation without impacting the rest of the Big SQL cluster. Once the host is healthy it can be easily added back in. Second, and it is the case we will go over in this article, the host is fully operational so we can work on it locally. Keep in mind that as long as the Big SQL worker node in question is down then the user should be able to run any (Big SQL at least) workload in parallel unaffected.

Figure 1: This figure shows the topology of the Big SQL service. In this case, there are 2 head nodes because HA is enabled and 2 workers.

Here is the config how Big SQL database and data directories are laid out:

Figure 2: This image displays the configuration used by the Big SQL database and data directories.

In this example BigSQL cluster has 2 head nodes and 2 worker nodes
Big SQL Head Nodes : bihead1 and bihead2
Big SQL Worker Nodes : biworker1, and biworker2
Big SQL user id : bigsql

There could be cases where an error condition occurs and one or more worker nodes goes down during a user workload. Big SQL worker nodes will automatically shut down when they encounter pervasive errors that will prevent the user workload from executing successfully on that node (e.g. a severe I/O error). An equivalent scenario is when individual queries fail with a “database not found” SQL1031 error code on one or more worker nodes.

To illustrate the recovery steps needed we will “simulate” a disk failure that impacts the correct operation of the Big SQL database.

(i)-Goto Big SQL Head node (if two headnodes use primary )

  [bigsql@bihead1 ~]$ cat ~/sqllib/db2nodes.cfg  0 bihead1.mydomain.mycompany.com 0  1 biworker1.mydomain.mycompany.com 0  2 biworker2.mydomain.mycompany.com 0 >> this will be our victim host where we are going to "damage" the database intentionally.

(ii) Now we are going to remove all files for the database to simulate partial disk failures

  # ssh biworker2  root@biworker2 ~]# echo $DATA_DIRS
  /var/ibm/bigsql/database,/hadoop/bigsql  [root@biworker2 ~]# rm -rf /var/ibm/bigsql/database  [root@biworker2 ~]# rm -rf /hadoop/bigsql  [root@biworker2 ~]#

Database is now gone on node 2.

(iii) Show the simulated problem

Now let’s see what happen when we try to connect to the database on all nodes

Goto head node of bigsql

  su - bigsql  db2_all "hostname;db2 connect to bigsql"

Make sure you check all hostnames and the status of connection:

  bigsql@bihead1 ~]$ db2_all "db2 connect to bigsql" 2>/dev/null
         Database Connection Information     Database server        = DB2/LINUXX8664 11.1.0   SQL authorization ID   = BIGSQL   Local database alias   = BIGSQL    bihead1.mydomain.mycompany.com: db2 connect to bigsql completed ok          Database Connection Information     Database server        = DB2/LINUXX8664 11.1.0   SQL authorization ID   = BIGSQL   Local database alias   = BIGSQL    biworker1.mydomain.mycompany.com: db2 connect to bigsql completed ok     SQL1031N  The database directory cannot be found on the indicated file system.   SQLSTATE=58031  biworker2.mydomain.mycompany.com: db2 connect to bigsql completed rc=4  [bigsql@bihead1 ~]$

The host biworker2 is the one in trouble (we just executed above to simulate disk error by doing rm -rf to whole bigsql db on this host)
We can see from the output received from biworker2.mydomain.mycompany.com that we were not able to connect to the database on that node.
So how do we now resolve this issue?

How to fix database on worker node identified :

Step 1 : Go to head node

  ssh bihead1  su - bigsql

Step 2 : Identify node number of bad host

  cat sqllib/db2nodes.cfg | grep  biworker2

  [bigsql@bihead1 ~]$ cat sqllib/db2nodes.cfg | grep biworker2  -> this is the host connect failed.    2 biworker2.mydomain.mycompany.com 0

Here, in the first column, (2) is node number. The host biworker2 with nodenumber associated with it has a failure when connect attempt is executed.

Step 3 : Bring down node 2

  [bigsql@bihead1 ~] db2stop force 2 
      [bigsql@bihead1 ~]$ db2stop force 2  10/14/2016 09:56:07     2   0   SQL1064N  DB2STOP processing was successful.  SQL1064N  DB2STOP processing was successful.

Step 4 :
Step 4 : Now go to node 2 as root. Prepare the data directories (the assumption here is that the disks used by the database are replaced with good ones, but they are empty).

  # ssh biworker2  # echo $DATA_DIRS  [bigsql@bihead1 ~]$ echo $DATA_DIRS  /var/ibm/bigsql/database,/hadoop/bigsql  [bigsql@bihead1 ~]$

Parse each path separated with comma , here we have 2 paths defined
First token : (very important I’ll explain in a below): /var/ibm/bigsql/database
Second token : /hadoop/bigsql

For the first token only add “/bigdb” to the end and run the following:

  mkdir -p /var/ibm/bigsql/database/bigdb  chown bigsql:hadoop /var/ibm/bigsql/database  chown bigsql:hadoop /var/ibm/bigsql/database/bigdb  chmod 775 /var/ibm/bigsql/database  chmod 775 /var/ibm/bigsql/database/bigdb

ie. :

  [root@biworker2 ~]# mkdir -p /var/ibm/bigsql/database/bigdb  [root@biworker2 ~]# chown bigsql:hadoop /var/ibm/bigsql/database  [root@biworker2 ~]# chown bigsql:hadoop /var/ibm/bigsql/database/bigdb    [root@biworker2 ~]# chmod 775 /var/ibm/bigsql/database  [root@biworker2 ~]# chmod 775 /var/ibm/bigsql/database/bigdb  [root@biworker2 ~]#

For the rest of the TOKENS (here we have only /hadoop/bigsql )
Repeat the 3 steps below:

  mkdir -p /hadoop/bigsql  chown bigsql:hadoop /hadoop/bigsql  chmod 775 /hadoop/bigsql

ie.

  [root@biworker2 ~]# mkdir -p /hadoop/bigsql  [root@biworker2 ~]# chown bigsql:hadoop /hadoop/bigsql  [root@biworker2 ~]# chmod 775 /hadoop/bigsql

Step 5 : Final Step recreation of database in the node 2 that have the disks replaced with good ones.

Goto head node of bigsql and run “db2start restart nodenum 2” as bigsql user:

  root@bihead1 ~]# su - bigsql  [bigsql@bihead1 ~]$ db2start restart nodenum 2   10/14/2016 10:11:41     2   0   SQL1489I  The add database partition server  operation was successful. The new database partition server "2" is active.  SQL6032W  Start command processing was attempted on "1" node(s).  "0" node(s) were successfully started.  "0" node(s) were already started.  "4" node(s) could not be started.  [bigsql@bihead1 ~]$

This should be it. Now lets validate …

How to validate all Big SQL nodes that BigSQL db is OK.
Simple test, evaluate and validate used above in previous steps to identify the problem :

  [bigsql@bihead1 ~]$ db2_all "hostname; db2 connect to bigsql" 2>/dev/null

  bihead1       Database Connection Information     Database server        = DB2/LINUXX8664 11.1.0   SQL authorization ID   = BIGSQL   Local database alias   = BIGSQL    bihead1.mydomain.mycompany.com: hostname completed ok     biworker1       Database Connection Information     Database server        = DB2/LINUXX8664 11.1.0   SQL authorization ID   = BIGSQL   Local database alias   = BIGSQL    biworker1.mydomain.mycompany.com: hostname completed ok     biworker2       Database Connection Information     Database server        = DB2/LINUXX8664 11.1.0   SQL authorization ID   = BIGSQL   Local database alias   = BIGSQL    biworker2.mydomain.mycompany.com: hostname completed ok

Step 6: Getting tablespaces right

on biworker2 , as bigsql user run the following:

ssh biworker2
su – bigsql
db2 terminate
export DB2NODE=2
db2 “alter nodegroup ibmdefaultgroup drop dbpartitionnum(“2″)”
db2 “alter nodegroup ibmdefaultgroup add dbpartitionnum(“2″)”

Step 7: If cluster has Big SQL Logical workers , it means there are multiple entries in ~/sqllib/db2nodes.cfg for same host(second field in the file) but with different node number(first field in the file). if database is damaged on the logical nodes, repeat same steps for each node from Step1 to Step6 , that appears on same hostname(field 2) in ~/sqllib/db2nodes.cfg ,.

Sample Big SQL Logical Worker on same host :

2 biworker2.mydomain.mycompany.com 0
3 biworker2.mydomain.mycompany.com 1
4 biworker2.mydomain.mycompany.com 2

2,3,4 : Logical Big SQL nodes on biworker2 host with relative fcm ports (0/1/2 )

db2nodes.cfg may have extra fields but we are interested in first two field for the recovery.

Conclusion:
The database is now in good condition on all the nodes in the cluster, after the simulated faulty disk has been replaced.

IBM Support

Tips

Recovering from a database disk failure in Big SQL worker node (4.1fp2+) - Hadoop Dev

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?