We’ve all seen the classic Seinfeld episode in which Kramer lands the line “These pretzels are making me thirsty!” in an upcoming Woody Allen movie. But, in the realm of big data analytics and Jupyter Notebooks, that’s exactly what a server might say relative to its resources (or lack thereof). You see, Notebook kernel processes are the heart and soul of the Notebook application and are responsible for the submission of potentially extreme and resource intensive operations against vast amounts of data, particularly in big data analytics. Running these thirst-consuming processes on the same node results in a parched server struggling to meet the needs of subsequent kernels. This is hardly a recipe for success within an enterprise servicing dozens or even hundreds of data scientists simultaneously striving to unlock the secrets within their data sets.
Quench your thirst with Jupyter Enterprise Gateway
To analyze big data, resources in the form of processor and memory are highly leveraged. Often, these resources are spread across units of execution, or workers, which produce results that are then accumulated by a driver, or master, process to accomplish the task in an optimized manner. These processes are coordinated by resource managers monitoring the task, adding and subtracting workers along the way, ensuring the task’s optimal completion. Examples of resource managers used in big data analytics are Apache Hadoop YARN, Kubernetes, and IBM Spectrum Conductor to name a few.
The driver process typically consumes the most resources (primarily memory) because it’s responsible for accumulating the results from each of the workers. As a result, when all driver processes reside on the same node, that node quickly becomes a bottleneck for the entire cluster. With respect to Jupyter Notebooks, the driver process, in this case, is the Notebook kernel. Therefore, the number of kernels that can be supported to perform analytics is throttled by the capacity of a single node. However, with Jupyter Enterprise Gateway, the true power of your compute cluster can be utilized because it uses the optimized scheduling of the corresponding resource manager to determine on which node within the compute cluster the kernel process should reside. This eliminates the bottleneck inherent in all kernels co-located on the same node, leaving the gateway node unhindered. Also, by relying on the resource manager’s algorithm, the kernel (i.e., the driver process) is positioned to succeed in an optimized fashion.
Sounds great, but how? Pluggable process proxies
Jupyter Enterprise Gateway abstracts the notion of the kernel process by way of a process proxy class hierarchy. It is through this abstraction that resource manager-specific process proxies can be implemented and plugged into the Enterprise Gateway run-time. Out of the box, three concrete process proxy classes are provided,
LocalProcessProxy (the default),
DistributedProcessProxy, the last two of which derive from the abstract base class:
The process proxy is configured into the kernel’s kernel.json file (aka the kernelspec) which is traditionally used to specify the kernel’s run-time environment and launch argument vector. By extending the class that reads the kernelspec to understand the process proxy stanza, Enterprise Gateway can associate the target kernel with a resource manager-specific class.
In addition, the
process_proxy can be used to specify various parameters that override the globally-scoped values on a per kernel basis.
How the startup takes place relative to the respective resource manager is a function of the
argv stanza – typically through a
run.sh script. The function of the process proxy is to locate the kernel within the compute cluster, usually via the resource manager’s API, and manage the rest of its lifecycle. Lifecycle management consists of implementing the methods:
poll() – to determine if the kernel is alive,
send_signal() – to interrupt the kernel, and
kill() – to terminate the kernel process should the message-based termination not succeed. Each of these methods are called from the Jupyter kernel management layers and overridden by the process proxy implementation. General kernel communications still occur over the ZeroMQ ports – although, with Enterprise Gateway (or Jupyter Kernel Gateway), kernel communication is piped through a single websocket to the gateway where it then routes the message to the appropriate ZeroMQ port of the kernel.
Interrupting a remote kernel
As you’ve probably realized by now, Jupyter Enterprise Gateway essentially provides a general framework for distributing kernels across a compute cluster. While resource manager APIs typically provide calls for determining if a task is running (i.e.,
poll()) and as well as its termination (i.e.,
kill()), they don’t provide methods for interrupting the task. The Jupyter framework also implements interrupts via Unix-style signals which a) don’t span hosts and b) don’t span process owners. To address these issues, Enterprise Gateway actually embeds the target kernel in a language-specific kernel launcher. The kernel launcher creates a sixth port (along with the five traditional ZMQ ports) on the destination host which it communicates back to the gateway through the
response-address parameter (as specified in the
argv stanza above). This communication port is used by the gateway server to send message-based signals to the kernel launcher. The launcher then uses the traditional signaling mechanism to convey the desired action to the kernel. Although message-based interrupts have recently been added to the message protocol, their support requires kernel modifications. By embedding the kernel in its corresponding wrapper, kernel updates can be avoided – which is one of the design goals of Enterprise Gateway.
Wrapping things up
Jupyter kernels provide a fantastic function and have unleashed tremendous insight into data. When used in a large scale analytics capacity, however, resource bottlenecks can ensue if not properly addressed. One way to tackle these issues is to leverage the power of Jupyter Enterprise Gateway by distributing your kernels across the compute cluster. This eliminates the primary bottleneck on the gateway server and takes advantage of the configured resource manager’s scheduling algorithm – letting it do what it does best – manage resources. With Jupyter Enterprise Gateway, kernels will no longer make your server thirsty, leaving you to worry only about those dry pretzels.