There’s no denying that data analytics is the next frontier on the computational landscape. Companies are scrambling to establish teams of data scientists to better understand their clientele and how best to evolve product solutions to the ebb and flow of today’s business ecosystem. With Apache Hadoop and Apache Spark entrenched as the analytic engine and coupled with a trial-and-error model to iteratively build analytic applications, the Jupyter Notebook has proven to be the tool of choice for data scientists.
Inherent in the Jupyter Notebook is the ability to dynamically adjust and tune analytic functions, enabling data scientists to quickly accomplish their task. Eventually, the notebook content will evolve into a series of calls against the back-end cluster, each of which can take significant amounts of time, while consuming valuable resources across the cluster. These calls are made to the notebook’s kernel (a library or application that implements the Jupyter Message Protocol) that is essentially the programmatic interface exposed in a notebook’s cell. This behavior, coupled with the fact that data scientists are human and often fail to explicitly shutdown the kernel, can lead to a seemingly infinite consumption of resources, that to this point, could not be regained without administrative intervention.
To programmatically address this human behavior, the Jupyter Notebook has been extended, so that it can now cull idle kernels (where a notebook kernel is considered idle, if there is no activity performed by the notebook). As such, and indicative of the long-running nature of data analytic method calls, the idle timeout period that is safe to assume culling can take place is typically 12 or 24 hours.
Idle kernel culling is set to “off” by default. It’s enabled by setting
--MappingKernelManager.cull_idle_timeout to a positive value representing the number of seconds a kernel must remain idle to be culled (default: 0, recommended: 43200, 12 hours). Positive values less than 300 (5 minutes) will be adjusted to 300.
You can configure the interval that the kernels are checked for their idle timeouts by adjusting the setting
--MappingKernelManager.cull_interval to a positive value. If the interval is not set or set to a non-positive value, the system uses 300 seconds as the default value: (default: 300 seconds).
As evidenced by the nature of data analytic models, what if a given cell’s execution happens to take longer than the cull idle timeout period? That is, what if the kernel is in a busy state for the duration of the culling period? This behavior is addressed by the
--MappingKernelManager.cull_busy parameter (default: False). By default, long-running kernel cells are not culled.
Another use-case for the Jupyter Notebook is to configure the notebook as a sort of “kiosk”, where users will want to leave their notebooks in a connected state. In this configuration, a notebook has an associated browser tab that is open and, although the kernel might be idle, the kernel shouldn’t necessarily be culled, while other idle kernels not currently connected to a browser will be culled. This behavior can be configured by setting the following parameter to False
--MappingKernelManager.cull_connected (default: False).
This functionality is included in the 5.1.0 release of the Jupyter Notebook and can be installed using:
pip install notebook
For existing installations:
pip install notebook --upgrade