IBM Support

An Active / Standby Disaster Recovery Solution for Big SQL Using IBM Big Replicate - Hadoop Dev

Technical Blog Post


Abstract

An Active / Standby Disaster Recovery Solution for Big SQL Using IBM Big Replicate - Hadoop Dev

Body

Introduction
The capability to recover from potential disaster scenarios is a priority to many Big SQL users. As such, it is now possible to use Big SQL with IBM Big Replicate in order to achieve an active/standby Disaster Recovery(DR) solution. This blog will provide an overview of such a DR solution and explain how this can be used to recover Hadoop and Big SQL data in the event of a disaster scenario. The Big SQL DR solution outlined when combined with a well thought out user DR Strategy will minimize any risk of data loss and potential downtime for Big SQL users.

Solution Architecture
In order to take advantage of a Big SQL DR solution a user should have two Big SQL clusters configured.

    1. The Big SQL Active Cluster – A Big SQL cluster with Big SQL service up and running.
    2. The Big SQL Standby Cluster – A Big SQL cluster with Big SQL service stopped during normal operation.

Important Note: Big SQL must be installed on both clusters with a consistent configuration, including: the same bigsql_db_path, the same bigsql_data_directories, the same number of Big SQL worker nodes and a consistent numbering for each Big SQL worker node in the db2nodes.cfg file.

The architecture of a Big SQL active/standby DR solution will then consist of two key components used across these Big SQL clusters:

    1. IBM Big Replicate – Hadoop Data Replication
    2. Big SQL – Database Replication ( via Backup and Restore )

BigReplicate

IBM Big Replicate – Hadoop Data Replication
IBM Big Replicate provides the ability to replicate HDFS data between Hadoop clusters. This is the component which enables a user to recover their Hadoop data in the event of a disaster scenario. For more information on IBM Big Replicate, including how to install and configure, refer to the IBM Big Replicate User Guide .

Big SQL – Database Replication ( via Backup and Restore )
The process of Big SQL database replication makes use of database backup and restore operations. Here, Big SQL backup and restore is used to keep the Big SQL data safe and readily available on both active and standby Big SQL clusters. Backups are taken on the Big SQL Active Cluster and then transferred to and restored on the Big SQL Standby Cluster.

Note: The transferring of backup images to the Standby Cluster is a critical step as it ensures that the Big SQL Standby Cluster is ready to be restored and used in the event of a critical failure impacting the Big SQL Active Cluster.

The bigsql_bar.py utility is provided to assist with this process, automating all necessary backup, image transfer and restore operations. For more information on the bigsql_bar.py utility, including all available options and some examples, refer to the related tech-note here.

Big SQL Disaster Recovery Strategy
The Big SQL DR Solution outlined provides all the necessary infrastructure and tools required to recover from a disaster. However, it’s also important for a Big SQL user to consider how they will use these in order to provide the best DR strategy for them. As the use of IBM Big Replicate ensures that all Hadoop data is transparently replicated between the Big SQL Active Cluster and the Big SQL Standby Cluster, the key considerations here are in relation to the backup and restore component of the solution. And as the bigsql_bar.py utility automates much of the backup and restore operations and can easily be scheduled to run as a cron job, the main decisions to be made will include:

    Backup: How frequently should a backup be taken/scheduled on the Big SQL Primary Cluster and transferred to the Big SQL Standby Cluster. And at what time of day should these backups be taken.
    Restore: How frequently should restore operations be performed/scheduled on the Big SQL Standby Cluster. That is, do you perform regular restores of the Big SQL Active Cluster’s backup images or perform a single restore of the most recent backup image in the event of a disaster scenario.

These decisions will depend on what’s required to meet a user’s individual requirements with regard to things such as: database usage (e.g.: DDL change frequency, regular off-peak times), resources available (e.g.: bandwidth, disk space) and required speed of recovery and required availability and access to data.

Note : The user will also need to decide how long to store ageing backup images on disk and manually maintain the ‘backup directory’.

Backup
When Big SQL and Big Replicate are installed and configured on both clusters, the Hadoop data is already being replicated across both Big SQL clusters. So with a DR strategy in place, it’s now necessary to schedule the regular backups of the BIGSQL database on the Big SQL Active Cluster’s Head Node and transfer the generated backup images to the Big SQL Standby Cluster’s Head Node. If Big SQL HA is enabled on the Big SQL Active Cluster, then the Big SQL primary head node should be used.

Note : If Big SQL HA is being used, Big SQL HA should be disabled on the Big SQL Standby Cluster during normal operation and until the time of a disaster scenario, where failover takes place. After Big SQL has been started and restored, Big SQL HA can then be enabled on what was originally the Big SQL Standby Cluster.

Restore
Depending on the user’s DR Strategy, the backup images can be restored on the Big SQL Standby Cluster in one of two ways:

    1. Regularly scheduled restores of the BIGSQL database, so that the Big SQL Standby Cluster is consistently up-to-date and ready for failover.
    2. Perform a single restore of the latest Big SQL backup image only when necessary as part of a user defined failover strategy.

Note : An initial offline backup taken on the Big SQL Active Cluster and a restore of the offline image on the Big SQL Standby Cluster must be performed before proceeding with Big SQL online backup and restore operations.

High Level Example of a Disaster Recovery Strategy
Once the necessary initial offline backup has been taken on the Big SQL Active Cluster and restored on the Big SQL Standby Cluster, then depending on the individual user requirements, a high-level disaster recovery strategy might look something like the following:

    Big SQL Active Cluster
    Every day at a given time:

    • Perform an online backup (using bigsql_bar.py – possibly setup as a cron-job)
    • Transfer the generated backup image to the Big SQL Standby Cluster (using bigsql_bar.py)

    Big SQL Standby Cluster
    Every day at a given time:

    • Start Big SQL service (as Big SQL is stopped on the Standby Cluster during normal operation)
    • Perform a restore of the latest available online backup image (using bigsql_bar.py – possibly as a cron job)
    • Execute HCAT_SYNC_OBJECTS as necessary (using bigsql_bar.py, or manually )
    • Stop Big SQL service OR start using cluster as Big SQL Active Cluster (failover only)
    • Enable HA if required (failover only)

HCAT_SYNC_OBJECTS
If there have been any DDL changes since the most recent available backup was taken, HCAT_SYNC_OBJECTS should to be executed to ensure a fully up-to-date cluster (necessary as part of a failover). This can be performed as part of bigsql_bar.py execution or by running HCAT_SYNC_OBJECTS manually after a restore has completed. For more details on HCAT_SYNC_OBJECTS see HCAT_SYNC_OBJECTS stored procedure.

Testing
It’s a good idea for Big SQL users to perform an end-to-end verification of their defined Big SQL Disaster Recovery Strategy. This will serve to ensure that in the event of an actual disaster, the user’s defined DR Strategy can successfully recover all Hadoop and Big SQL data and minimize downtime.

Summary
An active/standby disaster recovery solution is now possible for Big SQL. By using IBM Big Replicate and Big SQL database backup and restore operations, made simple via the bigsql_bar.py utility, it is possible to greatly minimize the risk of data loss and potential downtime as a result of any critical failures of the Big SQL Active Cluster.

Additional Information

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259873