The IBM Big SQL Service relies on a few core components. Any outages in these components could hinder the availability of the Big SQL service. There are different recovery strategies that can be chosen, each has a set of requirements and offers a different level of protection. These strategies and requirements will be discussed in this blog. The Big SQL recovery options available include:
- 1.1 Big SQL/Hive Disaster Recovery
- 1.2 Big SQL/Hive High Availability (HA)
- 1.3 Regular Backup of Hive/Big SQL database
1.1 Big SQL Disaster Recovery
This solution can offer fast recovery of the Big SQL service in case of disaster where there is loss of service of the entire cluster or site. This solution requires the usage of an active and a DR cluster. In the case of disaster, whereby the active cluster is unusable, the DR cluster can take the role of the primary cluster. The DR cluster can be in a different location than the primary cluster to cater for a site outage. Note that HA can also be configured on the primary cluster. The options for solution include:
- a. Replication of distributed file system data and meta-data from the primary to the DR cluster via DR solution with IBM Big Replicate and IBM Big SQL
- b. Meta-store replication and home-grown data replication tools via Big SQL Standalone Disaster Recovery
The diagrams below the sequence of events that can occur during disaster recovery. The requirements for these solutions involve a DR cluster that is configured identically as the primary cluster.
When there is an outage of the primary cluster, the DR cluster can be used as the new primary and client requests can be sent to the new primary cluster.
When the cluster comes online again, this cluster can be used as the new standby cluster.
Note: When storage is separate from compute e.g. when EMC Isilon is used for storage, IBM Big Replicate does not need to be configured to replicate HDFS data to the DR cluster. The Big SQL Standalone DR strategy can be used to backup and restore the database from the primary to the standby cluster. After the database is restored on the standby cluster, the HDFS service on the standby cluster can be pointed to appropriate HDFS path on Isilon where the data resides.
1.2 Big SQL High Availability
Big SQL consists of management and compute nodes. If a compute node is stopped or removed, there is no loss of the Big SQL service. If the management node which stores the Big SQL metadata, Hive meta-store, HDFS Name node and the Big SQL scheduler is not accessible, there can be loss of Big SQL service. The Big SQL High Availability (HA) feature replicates transaction logs to a standby management node frequently. These logs are replayed on the standby node so that the primary and standby management nodes are kept in sync. If there is an outage affecting the primary management node, the standby management node can assume the role of the primary management node.
The diagrams below show the sequence of steps that can be taken when Big SQL HA is enabled. The logs are frequently shipped from the primary to the standby management node so that the standby management node is kept in sync with the primary management node.
When there is an outage on the primary management node, the standby management node can assume the role of the primary:
When the original primary management node is serviced, it can be added back as the new standby management node.
Big SQL HA also uses the Tivoli System Automation for multi-platforms (TSAMP) to create a complete resource model for the Big SQL resources on the primary and standby management nodes. This resource model contains resource groups for the Big SQL processes related to the Big SQL database and for the Big SQL Scheduler. More details in setting up Big SQL HA can be found in Big SQL HA introduction in Big SQL 184.108.40.206, Enabling HA from Ambari in Big SQL 4.2 and FAQs for Big SQL HA. Since Big SQL 4.2.5, automatic pruning of the Big SQL log files is done after the logs are replayed on the standby management node. This removes the hassle from the database administrator of managing these log files manually. More information can be found in Automatic Big SQL transaction log pruning for high availability.
The Big SQL catalog stores meta-data of Big SQL and DB2 tables. The Hive meta-store stores meta-data of Hive tables. High availability of the Hive meta-store can be enabled to provide protection from loss of the Hive meta-store. Note that High Availability of HiveServer2 can also be configured along with HA for other Ambari components.
The Big SQL Scheduler retrieves meta-data information from the HDFS NameNode such as location of the HDFS files on the HDFS Data Nodes. If there is loss of access to the HDFS NameNode this can hinder the ability for the Big SQL Service to access tables on HDFS. High availability of the HDFS NameNode can be configured to allow protection for loss of the NameNode. If Big SQL HBase tables are used, HA of HBase is also available. More information on HA for Ambari components can be found in the IBM Knowledge Center.
1.3 Regular Big SQL DB Backup
This strategy does not require any standby cluster or extra management node, the backups are taken on the existing cluster and restored only when necessary. Big SQL syncs with the Hive meta-store regularly. If there are any unforeseen issues whereby the Big SQL database is no longer accessible, as long as the data is accessible on the distributed file system and the Hive meta-store is intact, Big SQL can be re-installed and the tables can be re-synced from Hive via the auto-hcat sync utility. Any statistics that were gathered will need to be re-collected. Since Big SQL 4.2, if any table is synced from Hive, statistics will be automatically collected after the sync. Big SQL also provides as a feature, regular DB2 tables and views that reside only on the head node. Since these tables are not created in Hive, if the strategy discussed above is used to restore the tables, these tables will not be recovered.
When a Big SQL backup is taken, these DB2 tables and views are saved in the backup image as well as the statistics of the table. In cases, whereby the size of the tables is quite large, time can be saved from re-gathering the statistics if a backup of the Big SQL database is taken. It is up to the Big SQL administrator to determine the frequency of backup operations and the number of backups that should be stored.
Two options can be chosen for taking these backups. Offline backup of the Big SQL database takes a backup of all the nodes in the cluster. This backup is taken when the database is not active. All of the data even the data on the worker nodes are stored when this backup strategy is taken. Note that the only data that is currently allowed on the worker nodes is performance data such as those captured by event monitors. The Big SQL database can also be cloned with this solution to another cluster. Read more in Clone a Big SQL DB from one cluster to another.
The second approach is Online backups of the head node, these backups can be taken when the database is active. When the backup of the Big SQL database is taken, backups of the hive meta-store is recommended as discussed in those blogs. Do not store backups in the /home/bigsql directory as this directory is removed during uninstall of bigsql. Choose an alternate path for the backups.
This blog presented the various strategies that can be taken when implementing a recovery solution for Big SQL.
The DR solution requires an additional DR cluster. Either the IBM Big SQL and Big Replicate solution or the Big SQL standalone DR solution can be used with this approach. HA can also be combined with this solution on the active cluster. IBM Big SQL and IBM Big Replicate automates the replication process of the meta-data and the distributed file system meta-data. Automation scripts to replicate the distributed file system data is needed when the Big SQL standalone DR solution is used. These solutions offer protection in case an entire site or cluster goes down.
The head node can be single point of failure on the active cluster. HA can be used to configure a standby management node. In case of failure of the primary management node, the standby can take the place of the primary management node. An extra Big SQL management node is needed with this solution. HA can also be implemented on other components such as Hive and HDFS NameNode. This solution offers protection in case the management node goes down.
Regular backups of the Big SQL catalog and hive meta-store can be taken. There is no extra hardware needed for this solution but protection is at the database level. Offline backup takes a snapshot of the Big SQL database on all the nodes of the cluster. If there is data on the worker nodes this data is backed up. Online backups can be taken on the Big SQL head node. Online backups do not require a database deactivation however a deactivation is needed during restore. Restoration of these backups can be done in case of emergency.