IBM Support

Big SQL 4.2 How to remove a "dead" worker node - Hadoop Dev

Technical Blog Post


Abstract

Big SQL 4.2 How to remove a "dead" worker node - Hadoop Dev

Body

The following article explains how to remove a worker node that is “down” or otherwise unreachable, from a BigSQL cluster.Normally, a Big SQL worker node can be decommissioned from cluster if the host is alive :



screenshot-from-2016-10-15-055928

Figure 1: Showing the decommission option of a worker node.



However, when a host can no longer be accessed, above option will not be possible as shown below;

screenshot-from-2016-10-15-055856

Figure 2 : The host worker2Host is dead. The service states shows “unknown” and in the Ambari UI , the host heartbeats are lost.



sh1
Figure 3 : The host worker2Host is reported as ‘unknown-state’ in the hostlist – it is down.



How to remove a dead node in Big SQL Cluster from command line.

There are 3 phases of the Big SQL Worker Cleanup :
Phase1- Deregistering host from Big SQL (whether targeted Big SQL Worker is accessible or not )
– Removal from Big SQL db
– Removal from Big SQL cluster
Phase2- Removal of Big SQL binaries/packages from worker host (if targeted Big SQL Worker is accessible)
Phase3- Removal of Big SQL worker service entry from Ambari(whether targeted Big SQL Worker is accessible or not)

The 3 phases above will be performed by the utility fullBigSqlCleanup.sh with -w option.


WARNING : Running this script without -w option will WIPE the whole Big SQL cluster

Here is how :

Step 1 : ssh to Big SQL Head node as sudo user
Step 2 : Switch to bigsql user and determine the dead node’s node number in the Big SQL Cluster:

  
# su - bigsql $ cat ~/sqllib/db2nodes.cfg |grep workerHost2 2 worker2Host 0

The host is in question has Big SQL node number “2”.
It is the first field separated by space in db2nodes.cfg that corresponds worker2Host
node line.

Step 3 : Validate the host nodenumber one more time by attempting to start and/or stop Big SQL service from command line

sh2

Figure 4: Failure to ping and to start/stop Big SQL Worker worker2Host

Now user has validated the node number 2 is dead. It is completely out of the network, and user does not have intention to bring it back ever again.

Step 4 : As root on Big SQL Head node

  
$ su - bigsql $ cat ~/sqllib/db2nodes.cfg
0 head1.mydomain.mycompany.com 0 1 worker1.mydomain.mycompany.com 0 2 worker2.mydomain.mycompany.com 0 >> This node will be removed

Switch to root on Big SQL Head node :

  
[root@head1 ~]# cd /var/lib/ambari-agent [root@head1 ambari-agent]# find . -name fullBigSqlCleanup.sh ./cache/stacks/HDP/2.4/services/BIGSQL/package/scripts/fullBigSqlCleanup.sh [root@head1 ambari-agent]# cd ./cache/stacks/HDP/2.4/services/BIGSQL/package/scripts

Let’s run the command to get usage help :

  
root@headHost1 scripts]# ./fullBigSqlCleanup.sh
Usage: ./fullBigSqlCleanup.sh -u -p [-s ] [-w ] Required parameters: -u: Ambari admin username -p: Ambari admin password Worker node cleanup: -w: Worker node hostname Using this option will remove the specified worker from the existing Big SQL cluster. Optional: -Z: sudo_ssh_user specify the sudo/ssh user if it is other than root. WARNING: THIS SCRIPT SHOULD BE INVOKED FROM BIGSQL_HEAD_NODE

Here is the command to cleanup worker2.mydomain.mycompany.com from bigsql cluster:

  
[root@head1 scripts]# ./fullBigSqlCleanup.sh -u admin -p admin -w worker2.mydomain.mycompany.com

Output of the command will look like below, waiting user input for confirmation, enter “Y” to continue to remove:

  
Single host cleanup is requested on worker2.mydomain.mycompany.com SSL is NOT set Please confirm the following cluster info: Ambari server = worker1.mydomain.mycompany.com Ambari port = 8081 Ambari cluster = TESTHDP24 Would you like to continue? (Y/n): Y Exporting environment variables for bigsql service Successfully exported environment variables Cleanup parameters: BIGSQL_USER = bigsql DATA_DIRS = /var/ibm/bigsql/database,/hadoop/bigsql AMBARI_SERVER = worker1.mydomain.mycompany.com AMBARI_CLUSTER = TESTHDP24 AMBARI_PORT = 8081 AMBARI_USER = admin BIGSQL_USER_HOME = /home/bigsql TARGET_HOSTLIST = /tmp/bigsqlSSHHostList ************************** Existing Big SQL host list: ************************** worker1.mydomain.mycompany.com worker2.mydomain.mycompany.com head1.mydomain.mycompany.com head2.mydomain.mycompany.com ************************** 2 worker2.mydomain.mycompany.com 0 Target host: worker2.mydomain.mycompany.com Target nodes: 2 Current host: head1.mydomain.mycompany.com Big SQL Head Host: head1.mydomain.mycompany.com Processing worker removal for node 2 (forced: 0) Given node array to remove 0 head1.mydomain.mycompany.com 0 1 worker1.mydomain.mycompany.com 0 2 worker2.mydomain.mycompany.com 0 Processing removal of 2, worker2.mydomain.mycompany.com, 0 Log file of this shell is: /tmp/bigsql/logs/bigsql-fixtopology-2016-10-15_03.56.23.3491.log ... ... Timed out in first attempt. Retrying ambari-server restart Using python /usr/bin/python Restarting ambari-server Using python /usr/bin/python Stopping ambari-server Ambari Server stopped Using python /usr/bin/python Starting ambari-server Ambari Server running with administrator privileges. Organizing resource files at /var/lib/ambari-server/resources... Server PID at: /var/run/ambari-server/ambari-server.pid Server out at: /var/log/ambari-server/ambari-server.out Server log at: /var/log/ambari-server/ambari-server.log Waiting for server start.................... Ambari Server 'start' completed successfully.

Now Let’s validate the outcome:

  
[root@head1 scripts]# su - bigsql [bigsql@head1 sqllib]$ cat ~/sqllib/db2nodes.cfg 0 head1.mydomain.mycompany.com 0 1 worker1.mydomain.mycompany.com 0

Dead Host (worker2) with associated BigSQL node number “2” is now removed from the cluster.As seen in Ambari-UI bigsql service is now having only 1 worker.



screenshot-from-2016-10-16-064131

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260043