Securely access your data in Apache Spark with Keep Your Own Key

A cloud hardware security module (HSM) is a machine with specialized cryptographic processors or accelerators that makes cryptography both fast and secure. Cloud HSMs are widely available, so how can we leverage them in our applications?

The IBM Cloud Hyper Protect Crypto Services are built on the industry’s first and only FIPS 140-2 Level 4 certified HSM available in the public cloud.

By using a cloud HSM in our enterprise applications, the dedicated cryptographic co-processor helps process all the data related to cryptography, which makes tampering and cryptanalysis attacks less likely. Additionally, cloud HSMs make cryptography considerably faster by taking the load off of the CPU for ciphering and deciphering the data.

IBM Cloud has two unique cloud HSM offerings:

  • IBM Key Protect for IBM Cloud is a key management service (KMS) backed by a cloud HSM instance to securely manage keys. Key Protect is pay-as-you-go service and charges per API call to the KMS.

  • IBM Cloud Hyper Protect Crypto Services, is also a KMS service, but it lets you provision a dedicated instance of a cloud HSM server machine on the cloud. The Keep Your Own Key (KYOK) function is enabled to provide access to cloud-based cryptographic HSMs.

Both offerings integrate well with the other IBM Cloud services, such as Red Hat OpenShift on IBM Cloud, IBM Cloud Object Storage, and so on.

Another benefit of having dedicated hardware for cryptography is that the hardware generates and stores the keys on the dedicated crypto device, which keeps them very safe. The device can detect any attempt to tamper with it, including any physical damage. If tampering is detected, the device auto erases the keys thereby making any data that was accessible with the keys inaccessible.

Because we can store a key very securely and reliably in a cloud HSM, we can access big data processing systems such as Apache Spark with a single key. All the data encryption and all the authentication can be done by a single key or a group of keys, all of which is handled by a cloud HSM instance.

Spark takes advantage of the IBM KYOK support for Red Hat OpenShift. The following diagram shows an overview of data encryption in OpenShift using a user-provided root key (CRK) and managed by a KMS instance. Whether you use an IBM block storage or IBM Cloud Object Storage, the data is encrypted by the CRKs and derived wrapped-DEK (data encryption key). (You can read more about envelope encryption in the Key Protect documentation.)

Architectural overview of data encryption of Apache Spark data in a Red Hat OpenShift cluster on IBM Cloud

You can see a more detailed architecture diagram showing cluster encryption for Red Hat OpenShift on IBM Cloud in the documentation.

Because Spark takes advantage of the KYOK set up in Red Hat OpenShift on IBM Cloud, you do not need to provide additional jars or libraries to make it work. And, users do not need do any additional set up or configuration. However, it is possible to use the API provided by the KMS service to create applications that use KYOK with more control over the data and who can access the data. Both data at rest and data in use can be encrypted, with the user-provided keys. Learn more about securing cluster workloads in the Red Hat OpenShift on IBM Cloud docs. For encrypting data in use, review the IBM Cloud Data Shield documentation.

It is a common practice to use an object store for accessing and storing data to be processed. The entire data processing pipeline should be KYOK enabled, and not just one component, otherwise purpose of having a KYOK-enabled encryption is forfeited. IBM Cloud Object Storage integrates with Key Protect and Hyper Protect Crypto Services using a KMS interface, which means we can use KYOK encryption for accessing and storing data in IBM Cloud Object Storage.

In this tutorial, I’ll show you how to use the KYOK encryption capabilities for large scale data processing engines such as Spark.



Step 1: Enabling the KMS service for your Red Hat OpenShift cluster on IBM Cloud

  1. List the KMS instances that you are running currently: ibmcloud oc kms instance ls. You’ll see output similar to this (the IDs listed here are not real).

    Output of running step 1-1 command

  2. Get the root key from your KMS instance: ibmcloud oc kms crk ls --instance-id 21fe0624-UUID-UUID-UUID-4c9c5404f4c8.

    Output of running step 1-2 command

  3. Enable your KMS provider on your cluster. First you need to locate the cluster ID of your already deployed OpenShift cluster instance: ibmcloud oc cluster ls

    Output of running step 1-3.1 command

    In my sample output, abcd123456789 is my cluster ID. Next, you need to enable the KMS on your OpenShift cluster, specifying the cluster ID, KMS instance ID, and the root key that we gathered earlier: ibmcloud oc kms enable -c abcd123456789 --instance-id 21fe0624-UUID-UUID-UUID-4c9c5404f4c8 --crk f1328360-UUID-UUID-UUID-ed0f3526a8a4

    It will take awhile for this update to take effect.

    Output of running step 1-3.2 command

  4. Verify that the update was successful:

    ibmcloud oc cluster get -c abcd123456789 | grep -e "Master Status:".

    You should see a master status such as Ready (1 week ago).

    Output of running step 1-3.3 command

    After the master status appears as Ready, you can verify that your cluster secrets are encrypted by querying information that is in etcd in the master.

Step 2: Deploying an Apache Spark application to Red Hat OpenShift cluster on IBM Cloud

  1. Log in to OpenShift: oc login.

  2. Create a service account, with cluster roles to be used with Spark jobs or deployments:

    Output of running step 2-2 command

  3. Locate the master URL: kubectl cluster-info. You’ll see output similar to the following:

    Output of running step 2-3 command

  4. Go to the Spark release directory and launch the Spark job. We need a Spark image to deploy the Spark job using the Kubernetes interface. To build and upload your own Spark script, issue these commands:

     cd spark-release
     bin/ -r my-repo -t v3.0.0 build

    Here, my-repo is a Docker public repository, but you can set up your own private or public repo using OpenShift, which you can learn more about how to do that in the OpenShift documentation.

    Now, issue this command:

    bin/ -r my-repo -t v3.0.0 push

    Once the image is ready, submit the Spark job as follows.

            bin/spark-submit \
             --master k8s:// \
             --deploy-mode cluster \
             --name spark-pi \
             --class org.apache.spark.examples.SparkPi \
             -c spark.kubernetes.authenticate.driver.serviceAccountName=spark \
             -c spark.kubernetes.container.image=my-repo/spark:v3.0.0 \

    Watch for the deployed driver:

    Output of running step 2-4 command

The example that I’ve used in this tutorial comes in the Spark distribution, but you can write your own Spark application and submit it using the Spark documentation. You can use the provided Java sample for accessing IBM Cloud Object Storage with Spark. Download the file, unzip it, and follow the instructions in the file to deploy your own code using IBM Cloud Object Storage. More samples are available in my Spark-Templates GitHub repo.

Summary and next steps

In this tutorial, you learned about the importance of using the KYOK capability to securely access data and how you can KYOK to run a big data workload using Spark.

Learn how to create a root key and enable KMS encryption in Kubernetes in this tutorial.