2021 Call for Code Awards: Live from New York, with SNL’s Colin Jost! Learn more

Download data sets from the Red Hat Marketplace and mount them on Openshift

In this tutorial, learn how to find data sets on the Red Hat Marketplace, then set up Jupyter Notebooks on Red Hat OpenShift. In this example, you use the IBM Debator Sentiment Composition Lexicons data set.

The following image shows the OpenShift architecture diagram. The data set is stored in a Persistent Volume Claim (PVC), and the Jupyter Notebook image is in a pod on OpenShift. Notice how the PVC and pod are not connected. In this tutorial, you learn how to mount the PVC to the pod to access the data.

OpenShift architecture

The following definitions help you understand the diagram better.

  • Pod: Like a machine instance to a container. The Jupyter Notebook image is contained in a pod.
  • Deployment configuration: Acts like a pod template and describes how the pod should be deployed. Essentially, it configures how to start the Jupyter Notebook image.
  • Persistent Volume Claim (PVC): A storage container, and will contain the data set that you download from Red Hat Marketplace. The PVC requests PV resources without having specific knowledge of the underlying storage infrastructure, so essentially it is claiming storage space from the persistent volume.

Prerequisites

To follow this tutorial, you need:

  • A Red Hat Marketplace account.
  • An OpenShift cluster.
  • The OpenShift CLI. Follow the Configure your OpenShift cluster with Red Hat Marketplace tutorial to complete the prerequisites.

    • At the end of step 1, make sure to add the oc binary file to your PATH. For example, for Mac users:

      mv /<filepath>/oc /usr/local/bin/oc
      

      If additional help is needed to set up the OpenShift CLI, look at this documentation.

    • You can skip step 2.
    • In step 3, name your project whatever you like.
  • The Helm package manager, which is needed to mount the data set to OpenShift.

Estimated time

It should take you approximately 45 minutes to complete this tutorial.

Steps

The prerequisite tutorial explained the OpenShift web console and the OpenShift CLI. You’ll use both methods to access your OpenShift cluster, so keep the following differences in mind:

  • OpenShift command line interface (CLI): This is accessed with the oc command on your terminal.
  • OpenShift web console: This is accessed on a web browser. To access the console:

    1. On the upper-left corner of IBM Cloud, click the hamburger icon (navigation menu), then click OpenShift.
    2. Click on your cluster, then go to the Overview page of the cluster. It should look similar to the following image after it loads.
    3. Click OpenShift web console.

      Overview console

    4. You are directed to the Red Hat OpenShift web console.

      Web console

Step 1. Download the data set from Red Hat Marketplace

  1. Go to the Red Hat Marketplace and log in.
  2. Search for and click IBM Debator Sentiment Composition Lexicons.
  3. Select Get it free, then choose OpenShift as the download location. You should see something similar to the following image.

    Download image

  4. Recall that you already have the OpenShift CLI and an OpenShift cluster. You should already be logged in to your OpenShift CLI. If not, use something like the following command (see step 4 in the prerequisite tutorial):

     oc login --token=<TOKEN> --server=<URL
    
  5. Switch to the project you created in step 3 of the prerequisite tutorial.

     oc project <project_name>
    
  6. Follow the Steps to mount a storage object. When you “Mount to OpenShift” and are asked to choose a namespace, use the project you just switched to. You can skip step 4: Connect to application.

    Mounting a storage object

    If everything worked, you should see:

    Success

  7. You have mounted the data set to OpenShift, and it will be stored in a PVC, which is storage on OpenShift. Save the PVC name that is returned because you will use it later. It should be similar to the following name.

     rhm-dl-rhmccp-4e7ceec1-7a48-492c-9639-7ffb2d4f6f6e-pvc
    

Step 2. Create Jupyter Notebook image

  1. In the OpenShift CLI, make sure that you are in the correct project.

     oc project <project_name>
    
  2. To run a Jupyter Notebook with OpenShift, you must build a template image. In this tutorial, you use the Source-to-Image (S2I) build process to create a minimal Jupyter Notebook image. Using this S2I, you can create other Jupyter Notebooks. First, using the OpenShift CLI, create the minimal notebook.

     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/build-configs/s2i-minimal-notebook.json
    
  3. You can follow the progress of creating the notebook (this might take a few minutes).

     oc logs --follow bc/s2i-minimal-notebook-py36
    
  4. When complete, check that the minimal notebook was created.

     oc describe imagestream s2i-minimal-notebook
    

Step 3. Create Jupyter Notebook template

  1. Download a Jupyter Notebook template to more easily deploy notebooks. The template automatically sets deployment configurations and uses the s2i-minimal-notebook:3.6 image that you just created. Use the notebook-deployer template.

     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/templates/notebook-deployer.json
    

    If you want all templates, use the following commands.

     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/templates/notebook-deployer.json
     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/templates/notebook-builder.json
     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/templates/notebook-quickstart.json
     oc create -f https://raw.githubusercontent.com/jupyter-on-openshift/jupyter-notebooks/master/templates/notebook-workspace.json
    
  2. On the OpenShift web console, refresh the page. Make sure that you are in the Developer role by checking at the upper left. Click +Add, then click From Catalog, clear the filters, and search for Jupyter Notebook.

    Template catalog

  3. Create the Jupyter Notebook by selecting the deployer notebook.

    Jupyter notebook template

  4. Instantiate the template. Everything should be the default. In this example, you use dax-sentiment-notebook as the APPLICATION_NAME. The NOTEBOOK_PASSWORD is used to access the Jupyter Notebook.

    Template

  5. After clicking Create, a new pod and deployment configuration should be created (DC stands for deployment configuration). Also, if you click the circle, you see that 1 pod (dax-sentiment-notebook-[ID]) is running.

    DC and POD png

  6. (OPTIONAL) Test that you can now launch the Jupyter Notebook. After your pod is created (it might take a few minutes), you can check whether the Jupyter Notebook instance was launched correctly. In the OpenShift web console, select Developer in the drop-down menu at the upper left. Select Topology, and click the dax-sentiment-notebook that you just created.

    Launch Jupyter

  7. Under Routes, click the location link to launch the Jupyter Notebook. A new tab opens. You might need to enter the NOTEBOOK_PASSWORD that you set previously.

    Password window

The Notebook list should be empty.

Empty Notebook list

Step 4. Connect PVC to pod

Remember the OpenShift architecture diagram at the beginning of this tutorial where the PVC and the pod are not connected? In other words, you set up the Jupyter Notebook (pod), but you cannot access any of the data (PVC) from it. This step allows the pod to access the PVC.

  1. The deployment configuration object for the pod is named dax-sentiment-notebook, and the data is stored in the PVC named rhm-dl-rhmccp-4e7ceec1-7a48-492c-9639-7ffb2d4f6f6e-pvc (from step 1). Run the following command to connect the pod and PVC. You might have to update the claim-name with your PVC name:

    oc set volume dc/dax-sentiment-notebook --add -t='persistentVolumeClaim' --mount-path=/data --claim-name=rhm-dl-rhmccp-4e7ceec1-7a48-492c-9639-7ffb2d4f6f6e-pvc
    

    To understand the command in more detail:

    • oc set volume: The object you are adding a volume to is the DeploymentConfig (dc) object, dax-sentiment-notebook.
    • --add -t: To this object, you are adding a persistentVolumeClaim (PVC).
    • --mount-path: The location you are mounting the PVC to is /data. You can name this anything, but I use /data.
    • --claim-name: The name of the PVC you are mounting is rhm-dl-rhmccp-[UNIQUE_ID]-pvc.

      For more information about these commands see the OpenShift documentation.

  2. Verify that the PVC is mounted to the pod. From the output, you can can confirm that in your pod the data is stored in the directory /data. The command:

     oc set volume dc --all
    

    should return:

     dax-sentiment-notebook pvc/rhm-dl-rhmccp-4e7ceec1-7a48-492c-9639-7ffb2d4f6f6e-pvc (allocated 8GiB) as volume-kv6dh mounted at /data
    

Step 5. Putting it all together

  1. In the OpenShift web console, click Pod dax-sentiment-notebook. Under Pods, click dax-sentiment-notebook-[ID].

    Launch Jupyter terminal

  2. Select the Terminal tab, and type ls /data. You should see the sentiment-composition-lexicons.tar file listed.

    Verifying data

  3. Untar sentiment-composition-lexicons.tar to the following directory.

     tar -xvf /data/sentiment-composition-lexicons.tar
    

    If you look at the directory now, there will be several files. The command:

     ls
    

    returns:

     ADJECTIVES.xlsx  LEXICON_UG.txt  ReleaseNotes.txt       
     LEXICON_BG.txt   LICENSE         SEMANTIC_CLASSES.xlsx
    
  4. Launch the dax-sentiment-notebook on the OpenShift web console.

    Launch Jupyter

You should see the all of the files. Now, you can create a new notebook and start using the data set.

Jupyter with data

Summary

This tutorial explained how to launch an OpenShift cluster from IBM Cloud, how to download a data set from Red Hat Marketplace, and mount it to your OpenShift cluster as a PVC. Additionally, you learned how to create a Jupyter Notebook image pod on OpenShift and how to connect it to the PVC.