The capability to recover from potential disaster scenarios is a priority to many Big SQL users. As such, it is now possible to use Big SQL with IBM Big Replicate in order to achieve an active/standby Disaster Recovery(DR) solution. This blog will provide an overview of such a DR solution and explain how this can be used to recover Hadoop and Big SQL data in the event of a disaster scenario. The Big SQL DR solution outlined when combined with a well thought out user DR Strategy will minimize any risk of data loss and potential downtime for Big SQL users.
In order to take advantage of a Big SQL DR solution a user should have two Big SQL clusters configured.
- The Big SQL Active Cluster – A Big SQL cluster with Big SQL service up and running.
- The Big SQL Standby Cluster – A Big SQL cluster with Big SQL service stopped during normal operation.
Important Note: Big SQL must be installed on both clusters with a consistent configuration, including: the same bigsql_db_path, the same bigsql_data_directories, the same number of Big SQL worker nodes and a consistent numbering for each Big SQL worker node in the db2nodes.cfg file.
The architecture of a Big SQL active/standby DR solution will then consist of two key components used across these Big SQL clusters:
- IBM Big Replicate â€“ Hadoop Data Replication
- Big SQL – Database Replication ( via Backup and Restore )
IBM Big Replicate – Hadoop Data Replication
IBM Big Replicate provides the ability to replicate HDFS data between Hadoop clusters. This is the component which enables a user to recover their Hadoop data in the event of a disaster scenario. For more information on IBM Big Replicate, including how to install and configure, refer to the IBM Big Replicate User Guide .
Big SQL – Database Replication ( via Backup and Restore )
The process of Big SQL database replication makes use of database backup and restore operations. Here, Big SQL backup and restore is used to keep the Big SQL data safe and readily available on both active and standby Big SQL clusters. Backups are taken on the Big SQL Active Cluster and then transferred to and restored on the Big SQL Standby Cluster.
Note: The transferring of backup images to the Standby Cluster is a critical step as it ensures that the Big SQL Standby Cluster is ready to be restored and used in the event of a critical failure impacting the Big SQL Active Cluster.
The bigsql_bar.py utility is provided to assist with this process, automating all necessary backup, image transfer and restore operations. For more information on the bigsql_bar.py utility, including all available options and some examples, refer to the related tech-note here.
The Big SQL DR Solution outlined provides all the necessary infrastructure and tools required to recover from a disaster. However, it’s also important for a Big SQL user to consider how they will use these in order to provide the best DR strategy for them. As the use of IBM Big Replicate ensures that all Hadoop data is transparently replicated between the Big SQL Active Cluster and the Big SQL Standby Cluster, the key considerations here are in relation to the backup and restore component of the solution. And as the bigsql_bar.py utility automates much of the backup and restore operations and can easily be scheduled to run as a cron job, the main decisions to be made will include:
- Backup: How frequently should a backup be taken/scheduled on the Big SQL Primary Cluster and transferred to the Big SQL Standby Cluster. And at what time of day should these backups be taken.
- Restore: How frequently should restore operations be performed/scheduled on the Big SQL Standby Cluster. That is, do you perform regular restores of the Big SQL Active Cluster’s backup images or perform a single restore of the most recent backup image in the event of a disaster scenario.
These decisions will depend on what’s required to meet a user’s individual requirements with regard to things such as: database usage (e.g.: DDL change frequency, regular off-peak times), resources available (e.g.: bandwidth, disk space) and required speed of recovery and required availability and access to data.
Note : The user will also need to decide how long to store ageing backup images on disk and manually maintain the ‘backup directory’.
When Big SQL and Big Replicate are installed and configured on both clusters, the Hadoop data is already being replicated across both Big SQL clusters. So with a DR strategy in place, it’s now necessary to schedule the regular backups of the BIGSQL database on the Big SQL Active Cluster’s Head Node and transfer the generated backup images to the Big SQL Standby Cluster’s Head Node. If Big SQL HA is enabled on the Big SQL Active Cluster, then the Big SQL primary head node should be used.
Note : If Big SQL HA is being used, Big SQL HA should be disabled on the Big SQL Standby Cluster during normal operation and until the time of a disaster scenario, where failover takes place. After Big SQL has been started and restored, Big SQL HA can then be enabled on what was originally the Big SQL Standby Cluster.
Depending on the user’s DR Strategy, the backup images can be restored on the Big SQL Standby Cluster in one of two ways:
- Regularly scheduled restores of the BIGSQL database, so that the Big SQL Standby Cluster is consistently up-to-date and ready for failover.
- Perform a single restore of the latest Big SQL backup image only when necessary as part of a user defined failover strategy.
Note : An initial offline backup taken on the Big SQL Active Cluster and a restore of the offline image on the Big SQL Standby Cluster must be performed before proceeding with Big SQL online backup and restore operations.
Once the necessary initial offline backup has been taken on the Big SQL Active Cluster and restored on the Big SQL Standby Cluster, then depending on the individual user requirements, a high-level disaster recovery strategy might look something like the following:
Big SQL Active Cluster
- Perform an online backup (using bigsql_bar.py – possibly setup as a cron-job)
- Transfer the generated backup image to the Big SQL Standby Cluster (using bigsql_bar.py)
- Start Big SQL service (as Big SQL is stopped on the Standby Cluster during normal operation)
- Perform a restore of the latest available online backup image (using bigsql_bar.py – possibly as a cron job)
- Execute HCAT_SYNC_OBJECTS as necessary (using bigsql_bar.py, or manually )
- Stop Big SQL service OR start using cluster as Big SQL Active Cluster (failover only)
- Enable HA if required (failover only)
Every day at a given time:
Big SQL Standby Cluster
Every day at a given time:
If there have been any DDL changes since the most recent available backup was taken, HCAT_SYNC_OBJECTS should to be executed to ensure a fully up-to-date cluster (necessary as part of a failover). This can be performed as part of bigsql_bar.py execution or by running HCAT_SYNC_OBJECTS manually after a restore has completed. For more details on HCAT_SYNC_OBJECTS see HCAT_SYNC_OBJECTS stored procedure.
It’s a good idea for Big SQL users to perform an end-to-end verification of their defined Big SQL Disaster Recovery Strategy. This will serve to ensure that in the event of an actual disaster, the user’s defined DR Strategy can successfully recover all Hadoop and Big SQL data and minimize downtime.
An active/standby disaster recovery solution is now possible for Big SQL. By using IBM Big Replicate and Big SQL database backup and restore operations, made simple via the bigsql_bar.py utility, it is possible to greatly minimize the risk of data loss and potential downtime as a result of any critical failures of the Big SQL Active Cluster.
- For more information on the Big SQL active/standby DR solution as well as the bigsql_bar.py utility, refer to this technote.
- For more information on IBM Big Replicate, including Big Replicate installation and user guides, refer to the IBM Big Replicate User Guide.
- For more background information on backup and restore refer to the existing documentation on BACKUP DATABASE command and RESTORE DATABASE command.