IBM Cloud Automation Manager (CAM) is designed for production orchestration of cloud workloads and builds off the enterprise capabilities of IBM Cloud Private and the Kubernetes environment. In this article we describe the capabilities of CAM and strategies around Backup/Restore, High Availability (HA) and Disaster Recovery (DR) for CAM instances that can and should be taken advantage of when running CAM in a production environment.

These instructions cover IBM Cloud Automation Manager 3.1.2.0 and newer.

For version 3.1.0.0 see: Backup_Restore, HA and DR for IBM Cloud Automation Manager – Cloud Automation 3.1.0.0

For versions prior to 3.1.0.0 see: Backup_Restore, HA and DR for IBM Cloud Automation Manager – Cloud Automation092418

Backup and Restore

Lets first start with the basic backup and restore of a CAM environment. CAM deploys and manages workloads to and across multiple cloud environments. The data that is used to create the deployments and manage the deployments across the clouds is stored inside CAM’s databases. It is important that a complete and recurring backup strategy be used to backup all CAM data so that in the event of either data loss or data corruption the CAM data can be restored.

Backing up CAM data:

CAM data resides in four Persistant Volumes (PVs) that are pre-created prior to the installation of CAM.
More on PVs from Kubernetes: https://kubernetes.io/docs/concepts/storage/persistent-volumes/

The default names of the four CAM PVs and their usages are:

Volume Persistant Volume (PV) name Default Size Usage
Mongo DB cam-mongo-pv 15 GB MongoDB volume for the main CAM database
Logs cam-logs-pv 10 GB Logs volume for CAM pod logs
Terraform Plugins cam-terraform-pv 15 GB Terraform volume for default and user added plugins
Template Designer DB cam-bpd-appdata-pv 15 GB Template Designer MariaDB

For more information on CAM PVs see:
https://www.ibm.com/support/knowledgecenter/SS2L37_2.1.0.2/cam_create_pv.html

As a best practice these four CAM volumes should reside in a storage location where consistent snapshots can be taken so that data is consistently backed up from all four volumes simultaneously. If the underlying storage system does not support consistent snapshots, manual backup actions can still be performed using the procedure below.

Finally, an optional custom encryption password can be configured at CAM chart deploy time to encrypt database contents and Terraform container contents with a custom password. It is important that this custom encryption password also be backed up separately (in parallel with the instructions in this document) since it will be needed to decrypt the data during the restore scenario. Without this password a new install of CAM (for restoring from a backup) will not be able to read the existing encrypted CAM data.

Manual backup procedure

Stop CAM
First, stop CAM to ensure there are no modifications during the backup.
stopCAM.sh

Logs
To backup the CAM logs, simply make a copy of the logs volume mount using a shell command like cp or rsync.

Terraform Plugins
To backup the Terraform Plugins, simply make a copy of the Terraform plugins volume mount using a shell command like cp or rsync.

Mongo DB
If CAM was deployed with an external MongoDB, best practices for MongoDB backup strategies should be utilized, particularly if the MongoDB instance is replicated or sharded. Find backup and restore best practices in the MongoDB documentation: https://docs.mongodb.com/manual/core/backups/

Backup bundled CAM MongoDB database

  1. Start just the cam-mongo database.
    kubectl scale -n services deploy cam-mongo --replicas=1
  2. It is a good idea to separate the system being backed up from the system capturing the backup. Start a temporary standalone MongoDB container in the same ICP cluster where CAM is deployed. This container will be used to perform the backup. Note, the port numbers for this command may need to be changed to avoid conflicts.
    docker run -d --name cam-mongo-backup mongo:3.4
  3. Next, execute the mongodump command, pointing to the cam-mongo service.
    MONGO_IP=$(kubectl get -n services svc/cam-mongo --no-headers | awk '{print $3}')
    MONGO_PASSWORD=$(kubectl get -n services secret cam-secure-values-secret -o yaml | grep mongoDbPassword: | awk '{print $2}' | base64 --decode)
    Note: Replace “cam-secure-values-secret” in the above command with your cam secret if you use a different name
    docker exec -it cam-mongo-backup mongodump --ssl --sslAllowInvalidCertificates --uri mongodb://root:$MONGO_PASSWORD@$MONGO_IP:27017/cam?ssl=true --archive=/tmp/mongo_backup.gz --gzip
  4. Once the dump is complete, copy the backup archive off of the standalone MongoDB backup container. Make sure to store this someplace safe.
    docker cp cam-mongo-backup:/tmp/mongo_backup.gz .
  5. Finally, cleanup the standalone MongoDB backup container:
    docker stop cam-mongo-backup
    docker rm cam-mongo-backup

Template Designer
There are two different actions to backup the CAM Template Designer data: the MariaDB and some repositories on the cam-bpd-ui container.

  • Backup bundled CAM MariaDB database
    If CAM was deployed with an external MariaDB, best practices for MariaDB backup strategies should be utilized, particularly if the MariaDB instance is replicated or sharded. Find backup and restore best practices in the MariaDB documentation: https://mariadb.com/kb/en/library/backup-and-restore-overview/

    If CAM was deployed with the bundled MariaDB, follow this procedure:

    1. Start just the cam-bpd-mariabd database.
      kubectl scale -n services deploy cam-bpd-mariadb --replicas=1
    2. It is a good idea to separate the system being backed up from the system capturing the backup. Start a temporary standalone MariaDB container in the same ICP cluster where CAM is deployed. This container will be used to perform the backup. Note, the port numbers for this command may need to be changed to avoid conflicts.
      docker run -d --name cam-maria-backup -e MYSQL_RANDOM_ROOT_PASSWORD=yes -d mariadb:10.1.16
    3. Next, execute the mysqldump command
      MARIA_IP=$(kubectl get -n services svc/cam-bpd-mariadb --no-headers | awk '{print $3}')
      MARIA_USERNAME=$(kubectl -n services get secret cam-secure-values-secret -o yaml | grep mariaDbUsername: | awk '{print $2}' | base64 --decode)
      MARIA_PASSWORD=$(kubectl -n services get secret cam-secure-values-secret -o yaml | grep mariaDbPassword: | awk '{print $2}' | base64 --decode)
      Note: Replace “cam-secure-values-secret” in the above commands with your cam secret if you use a different name
      docker exec -it cam-maria-backup mysqldump --host=$MARIA_IP --port=3306 --all-databases --user=$MARIA_USERNAME --password=$MARIA_PASSWORD --result-file=/tmp/mariadb_backup.sql
    4. Copy the backup archive off of the standalone MariaDB backup container. Make sure to store this someplace safe.
      docker cp cam-maria-backup:/tmp/mariadb_backup.sql .
    5. Finally, cleanup the standalone MariaDB backup container:
      docker stop cam-maria-backup
      docker rm cam-maria-backup
  • Backup CAM BPD UI repositories
    1. Start just the cam-bpd-ui service.
      kubectl scale -n services deploy cam-bpd-ui --replicas=1
    2. Create an archive of the important files
      export BPD_UI=$(kubectl get -n services pods | grep cam-bpd-ui | awk '{print $1}')
      kubectl exec -it -n services $BPD_UI -- tar -cvzf /tmp/cam-bpd-ui-backup.tgz -C /opt/ibm-ucd-patterns/ workspace repositories
    3. Then, copy the archive off the container. Make sure to save this someplace safe.
      kubectl cp -n services $BPD_UI:/tmp/cam-bpd-ui-backup.tgz .
    4. Finally, cleanup the created archive on the container.
      kubectl exec -it -n services $BPD_UI -- rm -f /tmp/cam-bpd-ui-backup.tar

Content Runtime
If you have deployed a Content Runtime, follow the steps here to backup the Chef Server, Software Repository Binaries and Pattern Manager: https://www.ibm.com/support/knowledgecenter/SS2L37_2.1.0.2/content/cam_content_runtime_backup.html

Start CAM
Once all manual backup steps are complete, start CAM.
startCAM.sh

Manual restore procedure

Pre-requisite: These manual restore steps assume there is a deployed instance of CAM. This can either be the same instance that the backup was created from which is being repaired, or it can be a new instance of CAM on a separate installation of IBM Cloud Private as part of a larger disaster recovery scenario. If a new instance of CAM is being deployed, input the “custom encryption password” when deploying the CAM chart if one was specified during the initial deployment.

Stop CAM
First, stop CAM to ensure there are no modifications during the restore.
stopCAM.sh

Logs
To restore the CAM logs, simply make a copy of the backed up log files onto the logs volume mount using a shell command like cp or rsync.

Terraform Plugins
To restore the Terraform Plugins, simply make a copy of the backed up Terraform plugins onto the Terraform plugins volume mount using a shell command like cp or rsync.

Mongo DB
Restoring bundled CAM MongoDB database

  1. Start just the cam-mongo database
    kubectl scale -n services deploy cam-mongo --replicas=1
  2. It is a good idea to separate the system being restored from the system performing the restore. Start a temporary standalone MongoDB container in the same ICP cluster where CAM is deployed. This container will be used to perform the restore. Note, the port numbers for this command may need to be changed to avoid conflicts.
    docker run -d --name cam-mongo-backup mongo:3.4
  3. Copy the desired backup archive to be restored into the standalone container
    docker cp mongo_backup.gz cam-mongo-backup:/tmp/mongo_backup.gz
  4. Next, execute the mongorestore command, pointing to the cam-mongo service.
    MONGO_IP=$(kubectl get -n services svc/cam-mongo --no-headers | awk '{print $3}')
    MONGO_PASSWORD=$(kubectl get -n services secret cam-secure-values-secret -o yaml | grep mongoDbPassword: | awk '{print $2}' | base64 --decode)
    Note: Replace “cam-secure-values-secret” in the above command with your cam secret if you use a different name
    docker exec -it cam-mongo-backup mongorestore --ssl --sslAllowInvalidCertificates --uri mongodb://root:$MONGO_PASSWORD@$MONGO_IP:27017/cam?ssl=true --archive=/tmp/mongo_backup.gz --gzip --drop
  5. Finally, cleanup the standalone MongoDB container
    docker stop cam-mongo-backup
    docker rm cam-mongo-backup

Template Designer

  • Restoring bundled CAM MariaDB database
    1. Start just the cam-bpd-mariadb database
      kubectl scale -n services deploy cam-bpd-mariadb --replicas=1
    2. It is a good idea to separate the system being restored from the system performing the restore. Start a temporary standalone MariaDB container in the same ICP cluster where CAM is deployed. This container will be used to perform the restore. Note, the port numbers for this command may need to be changed to avoid conflicts.
      docker run -d --name cam-maria-backup -e MYSQL_RANDOM_ROOT_PASSWORD=yes -d mariadb:10.1.16
    3. Copy the desired backup archive to be restored into the standalone container
      docker cp mariadb_backup.sql cam-maria-backup:/tmp/mariadb_backup.sql
    4. Next, execute the restore command
      MARIA_IP=$(kubectl get -n services svc/cam-bpd-mariadb --no-headers | awk '{print $3}')
      MARIA_USERNAME=$(kubectl -n services get secret cam-secure-values-secret -o yaml | grep mariaDbUsername: | awk '{print $2}' | base64 --decode)
      MARIA_PASSWORD=$(kubectl -n services get secret cam-secure-values-secret -o yaml | grep mariaDbPassword: | awk '{print $2}' | base64 --decode)
      Note: Replace “cam-secure-values-secret” in the above commands with your cam secret if you use a different name
      docker exec -it cam-maria-backup sh -c "mysql -v --host=$MARIA_IP --port=3306 --user=$MARIA_USERNAME --password=$MARIA_PASSWORD < /tmp/mariadb_backup.sql"
    5. Finally, cleanup the standalone MariaDB container
      docker stop cam-maria-backup
      docker rm cam-maria-backup
  • Restore CAM BPD UI repositories
    1. Start just the cam-bpd-ui service.
      kubectl scale -n services deploy cam-bpd-ui --replicas=1
    2. Copy the desired backup archive to be restored into the BPD UI container
      export BPD_UI=$(kubectl get -n services pods | grep cam-bpd-ui | awk '{print $1}')
      kubectl cp -n services cam-bpd-ui-backup.tgz $BPD_UI:/tmp/
    3. Expand the archive on the container
      kubectl exec -it -n services $BPD_UI -- tar -xvf /tmp/cam-bpd-ui-backup.tgz -C /opt/ibm-ucd-patterns/ workspace repositories
    4. Finally, cleanup the archive on the container.
      kubectl exec -it -n services $BPD_UI -- rm -f /tmp/cam-bpd-ui-backup.tar

Content Runtime
If you have deployed a Content Runtime, follow the steps here to restore the Chef Server, Software Repository Binaries and Pattern Manager: https://www.ibm.com/support/knowledgecenter/SS2L37_2.1.0.2/content/cam_content_recovery.html

Start CAM
Once all manual restore steps are complete, start CAM.
startCAM.sh

High Availability

Let's discuss building a CAM environment with high availability to address single points of failure for a basic deployment of CAM with a single set of pods. First it is important that an ICP be created with multiple worker nodes that can support multiple instances of CAM pods with an instances of CAM pods on each node. For more information about deploying an ICP HA environment see the following ICP knowledge center links:
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.2/supported_system_config/hardware_reqs.html
https://www.ibm.com/support/knowledgecenter/SSBS6K_2.1.0.2/installing/high_availability.html

CAM is composed of a set of pods or microservices. A CAM pod of each type can be run on each worker node. For example one of the CAM pods is called cam-iaas which services the core APIs in CAM. If you deploy an ICP with three worker nodes, a cam-iaas pod can run on each of the three worker nodes. Running multiple cam-iaas pods helps increase the capacity for handling CAM API calls as well as provide high availability of the CAM APIs themselves. If a worker node were to go down, the surviving two worker nodes and their cam-iaas pods could continue to handle the API requests for CAM.

One note here: technically a single worker node can be used to provide pod HA by deploying more than one instance of specific CAM pod on a single worker node. This would allow for surviving the loss of a specific CAM pod, however it would not provide high availability for losing a worker node with all running pods on it (e.g. the VM or physical hardware failing). It is recommended that CAM pods run on multiple worker nodes to provide high availability that covers the most failure scenarios.

In CAM there are multiple types pods that handle the front end proxy, API calls, UI calls, service broker, databases and more. Each type of pod can be configured to run on more than one worker node via a "replica count" configuration option at CAM install time. This picture shows when deploying the CAM chart configuring a replica count of 3 for the CAM microservices:

Each type of pod translates to multiple underlying microservices. For example there are multiple UI microservices and multiple API microservices. Configuring a replica count of 3 for APIs sets the number of pods to 3 for all microservices that are used to handle APIs. The types of pods to configure include:

  • Broker
  • Proxy
  • APIs
  • UI
  • Template Designer CDS
  • Template Designer MDS

Today when a replica count greater than 1 is configured and there are multiple worker nodes, the pods are spread across all available worker nodes. For example if a "Replica Count" of 3 is configured and there are 3 worker nodes, a single pod for that type will be started on each worker node for high availability. For more information on how Kubernetes scheduling works see http://blog.kubernetes.io/2017/03/advanced-scheduling-in-kubernetes.html

Pods can also be scaled up or down during runtime to address high availability concerns or address performance/load characteristics of CAM. To scale a specific CAM pod during runtime navigate to the ICP UI under Workloads > Deployments. Find the specific pod (for example cam-iaas in the screenshot below) and under the action menu select "Scale"

Then enter the number of instances to scale to (3 in the example below) and click "Scale Deployment"

By default CAM ships with a single MongoDB database for storing CAM data, a single Redis for session management and a single MariaDB for storing template designer data. To support a highly available CAM environment external clusters of MongoDB, Redis and MariaDB can be setup outside of CAM and then be configured for use by the CAM deployment at install time.

Cloud Automation Manager can be configured to use an external MongoDB and an external database for Cloud Automation Manager Template Designer by following the instructions here: https://www.ibm.com/support/knowledgecenter/SS2L37_2.1.0.2/cam_using_external_mongodb.html

Similarly an external Redis cluster that can provide high availability can also be configured for use with CAM by following the instructions here: https://www.ibm.com/support/knowledgecenter/SS2L37_2.1.0.2/cam_using_external_redis.html

Note: Since these external databases and Redis clusters are created and managed outside of CAM ensure you have an appropriate backup and disaster recovery strategy for these external components in addition to the practices outlined here.

Disaster Recovery

Disaster recovery (DR) typically covers scenarios where a datacenter is no longer available in either a planned or unplanned event, and services must be restored at another site that is geographically separated. Implementations of disaster recovery typically vary based on corporate requirements, locations of data centers, available technology and more.

Common technologies used involve site to site continuous replication of databases or underlying data volumes at either a software or hardware level, and either synchronously (typically shorter distances, no data loss) or asynchronously (typically longer distances, but with some data loss of transactions in flight that were not fully replicated). A more basic approach may involve taking regular backups and transferring copies of those backups to a remote site for recovery.

More information on disaster recovery and general practices can be found here: https://en.wikipedia.org/wiki/Disaster_recovery

Replicating CAM data to a backup site is one part of the DR story, but in the event of a disaster (or testing a disaster) a new CAM environment must also be installed that uses the replicated data. There are two key considerations with regards to a backup CAM environment:

  1. The workloads deployed and managed by CAM in target cloud environments need to be replicated and/or backed up along side CAM to the backup site. For example, if you have a target VMware environment, that environment and its workloads must be replicated or backed up to the backup site outside of CAM. CAM itself does not contain the deployed workloads or data inside deployed workloads, only the data about how to manage the deployed workloads resides within CAM.
  2. The backup CAM instance must have equivalent network access to target cloud environments. For example if you have a VMware vSphere target at the primary datacenter that has deployments, the VMware vSphere environment and deployments must be network accessible in the backup datacenter as they were in the primary datacenter. In this way existing cloud connections in CAM will still be valid and deployments from CAM can be managed by the backup site's CAM. As another example if a public cloud environment like IBM Cloud is used, equivalent network access to IBM Cloud must also be in place for the backup CAM instance to continue to deploy and manage existing IBM Cloud deployments.

As was discussed in the earlier backup/restore section, CAM stores its data in four persistent volumes. These volumes hold the CAM data and logs that need to be replicated to a backup site. These four volumes should also be placed in a consistency group so that data is transferred across all four volumes in the order it is being modified.

Replicated data can then be used with a new install of CAM at the backup site during an unplanned even (e.g. natural disaster), planned event (e.g. planned shutdown of a datacenter) or for a test of disaster recovery practices.

In any of these DR scenarios, the restore procedure documented above can be used to restore the replicated data into a newly deployed instance of CAM.

For an unplanned event (or test of an unplanned event):

  1. Stop replication and take a consistent snapshot of all four replicated CAM volumes at the backup site
  2. Create four new volumes and populate them with the data from the consistent snapshot
  3. Create four PVs that reference the volumes with restored CAM data
  4. Install CAM and reference these four PVs during the CAM chart install

For a planned event (or test of a planned event)

  1. Stop CAM first to prevent any transactions in flight from being lost.
  2. Take a consistent snapshot of all four replcating CAM volumes at the backup site
  3. Create four new volumes and populate them with the data from the consistent snapshot
  4. Create four PVs that reference the volumes with restored CAM data
  5. Install CAM and reference these four PVs during the CAM chart install

Final Thoughts

This article provides you with strategies and procedures for implementing Backup/Restore, High Availability and Disaster Recovery solutions of CAM instances. Implementations may vary depending on your specific requirements and available technologies but many of the underlying principles around CAM availability and data protection remain the same. Understanding these concepts with CAM will allow you to build a robust instance of CAM for deploying and managing production cloud applications.

Special thanks to Jeffrey Luo and Partha Kaushik for their assistance verifying the content of this blog

Join The Discussion

Your email address will not be published. Required fields are marked *