Kubernetes includes a default scheduler that is good for most of the use cases. Additionally, Kubernetes also provides the flexibility to deploy your own custom scheduler if the default scheduler does not suit your needs. You can run multiple schedulers simultaneously alongside the default scheduler and instruct Kubernetes what scheduler to use for each of your pods.
Recently, my team has been dabbling with Kubernetes scheduling and policies as part of IBM Cloud Private (ICP) work. In the process, we developed an example scheduler to experiment with a specific cloud-bursting scenario where a job requiring GPUs gets provisioned in remote Nimbix cloud when the local cluster is not able to satisfy the GPU resource requirements. Special thanks to Abhishek for making this possible.[If you have not tried IBM Cloud Private, please go ahead and give it a spin. It is built on Kubernetes and you’ll like it]
Nimbix is a cloud platform for running high performance computing jobs. Nimbix does not support Kubernetes APIs and so federation is ruled out for experimentation. However, Nimbix provides a rich set of APIs which we used for trying out the cloud-bursting scenario.
Note: This is just a technology demonstration that showcases the possibilities. Feel free to use it for learning and experimentation. You can also use this as an example for creating a custom Kubernetes scheduler based on your requirements.
- Kubernetes 1.8.3 cluster. You can use IBM Cloud Private (ICP) â€“ 126.96.36.199 Community Edition.
- Network connectivity between the local cluster and remote Nimbix so that a POD can call the Nimbix API from the local cluster.
- Nimbix username and API key.
The following is the overall flow:
- You provision a job requiring GPU and specify the custom scheduler to be used.
- Custom scheduler checks if the local cluster can satisfy the resource requirement.
- If the local cluster can satisfy the GPU requirements, the job is provisioned on the local cluster.
- If the local cluster cannot satisfy the GPU requirements, the custom scheduler modifies the POD specification to remove the GPU resource requirements and injects a new environment variable (REMOTE=1).
- This modified POD gets provisioned on the local cluster and calls the Nimbix API to provision the GPU job in Nimbix cloud.
For demonstration, an application that is already available on Nimbix: power8-ubuntu-mldl is used. This is the PowerAI environment that is available in Nimbix allowing you to use ML/DL frameworks on a Power8 server in Nimbix.
We use the same base image (which is used for power8-ubuntu-mldl), but add a custom entrypoint script. You can use your own custom application images as well.
The following is an example POD YAML.
apiVersion: batch/v1 kind: Job metadata: name: nimbix-job spec: template: metadata: name: nimbix-gpu labels: task-type: Nimbix spec: schedulerName: k8s-custom-sched restartPolicy: Never containers: - name: nimbix-job image: poweraijob imagePullPolicy: "Never" env: - name: "APP_NAME" value: "power8-ubuntu-mldl" - name: "APP_COMMAND" value: "run" - name: "APP_COMMAND_ARGS" value: "source /opt/DL/bazel/bin/bazel-activate && source /opt/DL/tensorflow/bin/tensorflow-activate && tensorflow-test" - name: "ARCH" value: "POWER" - name: "NUM_CPUS" value: "60" - name: "NUM_GPUS" value: "2" - name: "USERNAME" value: "
" - name: "APIKEY" value: " " resources: limits: alpha.kubernetes.io/nvidia-gpu: 2 command: ["python", "/jarvice_submit.py"]
Ensure that the GPUs requested: alpha.kubernetes.io/nvidia-gpu must match the value of the NUM_GPUS environment variable.
If the POD could not be scheduled on the local cluster, it looks for a label task-type: Nimbix in the POD request.
The scheduler checks if the local cluster has 2 (alpha.kubernetes.io/nvidia-gpu) GPUs available. If yes, the job gets provisioned in the local cluster. If the local cluster does not have 2 GPUs available, the scheduler modifies the POD specification as described here and provisions the POD in the local cluster:
- Adds REMOTE=1 environment variable
- Removes alpha.kubernetes.io/nvidia-gpu from resource request
- The modified POD gets provisioned in the local cluster.
- The new POD runs with the default minimum resource (CPU and memory only) on the local cluster and calls the Nimbix API to run the actual task on a remote Nimbix that has the required GPUs.
The entrypoint script checks for the presence of the REMOTE=1 variable. If it is found, it connects to the Nimbix cloud and provisions the job there. It uses the APP_NAME, APP_COMMAND, APP_COMMAND_ARGS arguments and uses the Nimbix API to provision the job.
The following is the entrypoint code that handles the execution of the job locally or by calling the Nimbix API:
def main(): if os.environ.get("REMOTE") : remote_exec() else: #Execute the command as-is app_command_args = os.environ.get("APP_COMMAND_ARGS") try: check_call(["/bin/bash", "-c", app_command_args])
The complete code is available in the following link – https://github.com/IBM/k8s-custom-scheduler/blob/master/nimbix-app/jarvice_submit.py.
Base Docker image
To make the same POD capable of running tasks on the local cluster (on-premise) and on the Nimbix cluster, the container must be created from a specific base image.
Nimbix expects a specific layout of the application environment in the Docker image and hence a specific base image needs to be used.
See the nimbix-app link for the example base image code.
Deploy Nimbix scheduler on IBM Cloud Private
While these instructions are specific for ICP, they apply for any Kubernetes setup with minor modifications.
- Create a secret for the certificate files to access apiserver over https. In ICP, certificates file reside on /etc/cfc/conf in master node. Use the following command to create the secret:
kubectl create secret generic certs --from-file=kube-scheduler-config.yaml --from-file=kube-scheduler.crt --from-file=kube-scheduler.key
- Build the images for the custom scheduler and sample nimbix job. Use the Dockerfiles in the scheduler and nimbix-app directories.
- Deploy the scheduler by running the following command. An example deployment yaml is available at deploy/k8s-custom-sched.yaml. Update the yaml file with the MASTER_IP.
kubectl create -f k8s-custom-sched.yaml
- Create an appropriate role binding so that custom scheduler from system: kube-scheduler can modify pods from the default namspace.
kubectl create rolebinding someRole --clusterrole=admin --user=system:kube-scheduler --namespace=default
- Deploy a sample GPU job by using the custom scheduler. An example yaml is available at deploy/sample-job.yaml. Update the yaml with your Nimbix USERNAME and APIKEY. The job is provisioned to the Nimbix cloud if the resource requirement is not met in the local cluster
kubectl create -f sample-job.yaml
Here is a demo video showing the custom scheduler in action