IBM Support

Notebook Experience on IBM Open Platform - Hadoop Dev

Technical Blog Post


Abstract

Notebook Experience on IBM Open Platform - Hadoop Dev

Body

We are providing a Notebook service to the data scientists on the Hadoop IBM Open Platform (IOP). The Jupyter Notebook has been gaining a lot of popularity after it evolved from the earlier known IPythonNotebook project. The IPython Notebook has been around since 2001 which is one of the longest lifetime Notebooks. The original Jupyter Notebook is more intended for a single user case and very limited security. In order to gain more scalability, have multiple end users in a large organization to leverage and share the computation cycles of a cluster and make the Notebook service secured for the enterprise usage, we, IBM Open Platform(IOP) is providing a solution to power up the original Jupyter Notebook with our IOP Hadoop/Spark Cluster. Below is the architectural diagram that explains the solution:

The end users use their browsers to connect to their Jupyter Notebook servers which reside outside of the IOP Hadoop/Spark cluster. Inside of the IOP cluster, we offer a gateway, a.k.a, the Jupyter Kernel Gateway(JKG) to handle multiple requests from different Jupyter Servers. Jupyter Kernel Gateway spawns the Jupyter Kernels to launch Spark jobs. Currently, we offer two Jupyter Kernels to support the Python and Scala language. The Python Kernel is based off the default IPython kernel that comes from the Jupyter Kernel Gateway and the Scala Kernel is an Open Sourced Jupyter Kernel that is developed by IBM named, Toree. The Jupyter Kernels, either IPython or Toree serves as the Spark Driver which launches Spark job on the IOP Cluster. The Spark jobs are running under the Yarn container which is the Spark execution model on the IOP cluster.

With our Jupyter Notebook solution, the end users installs their own Jupyter Server outside of the IOP cluster and connects to the Jupyter Kernel Gateway which sits inside of IOP Cluster behind the Knox security Gateway. The Jupyter Server is free to be obtained from the Open Sourced Jupyter project. The IBM DSX team has an enterprise offering of the Jupyter Server based off the Open Sourced version. Multiple end users can have different Jupyter Servers of their own all connected to the same Jupyter Kernel Gateway on IOP. The user authentication is grantedby the IOP Knox security gateway. The end user’s notebook is kept on their own Jupyter Notebook servers.In IOP 4.3, we offer the first release of this solution. We still have some limitations that we would like to develop in phases for the future releases:Multiple end users can have their own user’s credentials when connecting to Jupyter Notebook service on IOP Cluster. Their user credentials are checked by Knox. However, inside of the cluster, it is a shared cluster account which runs and managesall the Spark jobs created by each Notebook. From the IOP cluster’s perspective, it is a single Notebook user who runs every one’s Notebook jobs. This means that all Spark jobs launched by the Notebook service are all under this single cluster user’s account. If the end users need to check the logs of the Spark job, all of them need share the same Notebook user’s cluster account. In the future release, we will implement impersonation for the Notebook service user account on the IOP cluster in order to provide user account isolation.

With our Jupyter Notebook solution, the end users installs their own Jupyter Server outside of the IOP cluster and connects to the Jupyter Kernel Gateway which sits inside of IOP Cluster behind the Knox security Gateway. The Jupyter Server is free to be obtained from the Open Sourced Jupyter project. The IBM DSX team has an enterprise offering of the Jupyter Server based off the Open Sourced version. Multiple end users can have different Jupyter Servers of their own all connected to the same Jupyter Kernel Gateway on IOP. The user authentication is grantedby the IOP Knox security gateway. The end user’s notebook is kept on their own Jupyter Notebook servers.

In IOP 4.3, we offer the first release of this solution. We still have some limitations that we would like to develop in phases for the future releases:
Multiple end users can have their own user’s credentials when connecting to Jupyter Notebook service on IOP Cluster. Their user credentials are checked by Knox. However, inside of the cluster, it is a shared cluster account which runs and managesall the Spark jobs created by each Notebook. From the IOP cluster’s perspective, it is a single Notebook user who runs every one’s Notebook jobs. This means that all Spark jobs launched by the Notebook service are all under this single cluster user’s account. If the end users need to check the logs of the Spark job, all of them need share the same Notebook user’s cluster account. In the future release, we will implement impersonation for the Notebook service user account on the IOP cluster in order to provide user account isolation.

The shared cluster Notebook user account is an ordinary user account on the cluster, but not a service account / super user for the cluster. This also means that the user will need to renew the Kerberos ticket manually orusing a piece of code if the a Notebook runs more than the default Kerberos ticket active time window. Let’s call this Time Window A. On the other hand, we also expose the tuning parameter for any Notebook that has not been active for a Time Window B to be killed and release the cluster resources. In most cases, Time Window A should be large than Time Window B. Therefore, the Kerberos ticket should not to be expected to be renewed often by the user.

In summary, we offer multiple users to all be connected to the IOP cluster. We power up Jupyter Notebooks with large scale computation power. We provide user authentication via our Knox security gateway. Inside of the cluster we offer the Kerberization when the Spark job to access the Kerberized cluster.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16260021