Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Create a custom Kubernetes scheduler

Kubernetes is not only known for its powerful ability to orchestrate and manage the workloads, but it also provides a series of extension mechanism for developers to customize their business needs. Extensibility is extremely critical for open source software so that developers can fulfill their needs without modifying the upstream source code, and contribute back to upstream if appropriate. This article discusses various ways of extending the Kubernetes scheduler and then uses a simple example to demonstrate how to write a simple “scheduler extender”.

Overview

Generally speaking, there are four ways to extend the Kubernetes scheduler.

One way is to clone the upstream source code, modify the code in place, and then re-compile to run the “hacked” scheduler. This is not recommended and isn’t practical because a lot of extra efforts need to be spent on lining up with upstream scheduler changes.

The second way is to run a separate scheduler along with the default scheduler. The default scheduler and your custom scheduler covers respective pods (by spec.schedulerName) exclusively. However, co-existence of multiple schedulers might have more problems than it sounds like, such as a distributed lock and cache synchronization. You may see tricky issues when pods get scheduled onto the same node by multiple schedulers. Additionally, maintaining a high-quality custom scheduler isn’t trivial, as it requires comprehensive understanding of the default scheduler, overall Kubernetes architecture knowledge, and the relationship/restrictions of various kinds of Kubernetes API objects.

The third way is called scheduler extender. It’s the most viable solution at this moment to extend scheduler with minimal efforts, and is also compatible with the upstream scheduler. The phrase “scheduler extender” simply means configurable webhooks, also knowns as “filter” and “prioritize”, which corresponds to the two major phases (“Predicates” and “Priorities”) in a scheduling cycle. We will do a deep dive in the next section.

The fourth way is through a scheduler framework. We’ll skip this for now and illustrate more in the last section of this article.

How the scheduler extender works

Prior to jumping into scheduler extender, let’s understand the basics of how the Kubernetes scheduler works. In a nutshell, it works in the following sequences:

  1. The default scheduler starts up according to the parameters given.
  2. It watches on apiserver, and puts pods where its spec.nodeName is empty into its internal scheduling queue.
  3. It pops out a pod from the scheduling queue and starts a standard scheduling cycle.
  4. It retrieves “hard requirements” (like cpu/memory requests, nodeSelector/nodeAffinity) from the pod’s API spec. Then the predicates phase occurs where it calculates to give a node candidates list which satisfies those requirements.
  5. It retrieves “soft requirements” from the pod’s API spec and also applies some default soft “policies” (like the pods prefer to be more packed or scattered across the nodes). It finally gives a score for each candidate node, and picks up the final winner with the highest score.
  6. It talks to the apiserver (by issuing a bind call) and sets spec.nodeName to indicate the node that this pod should be scheduled to.

In the above scheduling cycle, there are several extension points that relate to the scheduler extender.

Startup parameters

In the official doc, it points out that --config is the entry where you should specify which parameters the scheduler will take. From the API’s perspective, that config file is supposed to contain a KubeSchedulerConfiguration object:

 # content of the file passed to "--config"
 apiVersion: kubescheduler.config.k8s.io/v1alpha1
 kind: KubeSchedulerConfiguration
 clientConnection:
   kubeconfig: "/var/run/kubernetes/scheduler.kubeconfig"
 algorithmSource:
   policy:
     file:
       path: "/root/config/scheduler-extender-policy.json"

The key parameter we should put here is algorithmSource.policy, as it could be a local file or a configmap – depending on how your scheduler is deployed. For the sake of simplicity, I’m using a local file here.

The file /root/config/scheduler-extender-policy.json should be subject to the kubernetes/pkg/scheduler/api/v1#Policy API and can only be in JSON format for now (see k/k#75852). Here is a minimized example: (See the full ExtenderConfig specs here.)

{
    "kind" : "Policy",
    "apiVersion" : "v1",
    "extenders" : [{
        "urlPrefix": "http://localhost:8888/",
        "filterVerb": "filter",
        "prioritizeVerb": "prioritize",
        "weight": 1,
        "enableHttps": false
    }]
}

This policy file defines an http extender service, which runs at localhost:8888, and registered with the default scheduler so that at the end of the Predicate and Priority phase, the results will pass onto this extender service – at path <urlPrefix>/<filterVerb> and <urlPrefix>/<prioritizeVerb> respectively. And in the extender, we can further filter and prioritize to adapt to our particular business needs.

In the next section, I will walk you through the basics of running the http service and how to deal with incoming data.

A simple example

First, run your extender as an http(s) service. You can write the program in any language. Here is a Golang snippet for reference:

func main() {
    router := httprouter.New()
    router.GET("/", Index)
    router.POST("/filter", Filter)
    router.POST("/prioritize", Prioritize)

    log.Fatal(http.ListenAndServe(":8888", router))
}

Next, you need to implement the handlers to respond to /filter and /prioritize.

Filter the extension functions to accept an input with type schedulerapi.ExtenderArgs and return with *schedulerapi.ExtenderFilterResult. Within the function, we can do further filtering on the incoming nodes, like this:

// filter filters nodes according to predicates defined in this extender
// it's webhooked to pkg/scheduler/core/generic_scheduler.go#findNodesThatFit()
func filter(args schedulerapi.ExtenderArgs) *schedulerapi.ExtenderFilterResult {
    var filteredNodes []v1.Node
    failedNodes := make(schedulerapi.FailedNodesMap)
    pod := args.Pod

    for _, node := range args.Nodes.Items {
        fits, failReasons, _ := podFitsOnNode(pod, node)
        if fits {
            filteredNodes = append(filteredNodes, node)
        } else {
            failedNodes[node.Name] = strings.Join(failReasons, ",")
        }
    }

    result := schedulerapi.ExtenderFilterResult{
        Nodes: &v1.NodeList{
            Items: filteredNodes,
        },
        FailedNodes: failedNodes,
        Error:       "",
    }

    return &result
}

Inside the function, we iterate each node and use business logic to judge whether we should approve this node. In podFitsOnNode(), we simply check if a random number is even; if yes, we think it’s a lucky node and approve it.

var predicatesSorted = []string{LuckyPred}

var predicatesFuncs = map[string]FitPredicate{
    LuckyPred: LuckyPredicate,
}

type FitPredicate func(pod *v1.Pod, node v1.Node) (bool, []string, error)

func podFitsOnNode(pod *v1.Pod, node v1.Node) (bool, []string, error) {
    fits := true
    failReasons := []string{}
    for _, predicateKey := range predicatesSorted {
        fit, failures, err := predicatesFuncs[predicateKey](pod, node)
        if err != nil {
            return false, nil, err
        }
        fits = fits && fit
        failReasons = append(failReasons, failures...)
    }
    return fits, failReasons, nil
}

func LuckyPredicate(pod *v1.Pod, node v1.Node) (bool, []string, error) {
    lucky := rand.Intn(2) == 0
    if lucky {
        log.Printf("pod %v/%v is lucky to fit on node %v\n", pod.Name, pod.Namespace, node.Name)
        return true, nil, nil
    }
    log.Printf("pod %v/%v is unlucky to fit on node %v\n", pod.Name, pod.Namespace, node.Name)
    return false, []string{LuckyPredFailMsg}, nil
}

The filtering function is implemented in a similar fashion. Let’s randomly give a score on each node:

// it's webhooked to pkg/scheduler/core/generic_scheduler.go#PrioritizeNodes()
// you can't see existing scores calculated so far by default scheduler
// instead, scores output by this function will be added back to default scheduler
func prioritize(args schedulerapi.ExtenderArgs) *schedulerapi.HostPriorityList {
    pod := args.Pod
    nodes := args.Nodes.Items

    hostPriorityList := make(schedulerapi.HostPriorityList, len(nodes))
    for i, node := range nodes {
        score := rand.Intn(schedulerapi.MaxPriority + 1)
        log.Printf(luckyPrioMsg, pod.Name, pod.Namespace, score)
        hostPriorityList[i] = schedulerapi.HostPriority{
            Host:  node.Name,
            Score: score,
        }
    }

    return &hostPriorityList
}

Note: The complete source code can be found here.

The above two functions are the most important ones to extend the behavior of a default scheduler, and as aforementioned, you can explore the full ExtenderConfig spec to implement your own “preempt” and “bind” functions as well.

So far, we created a very simple scheduler extender. Now let’s run a Deployment to see how this works. Let’s prepare a deployment yaml with 20 replicas:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: pause
spec:
  replicas: 20
  selector:
    matchLabels:
      app: pause
  template:
    metadata:
      labels:
        app: pause
    spec:
      containers:
      - name: pause
        image: k8s.gcr.io/pause:3.1

We’re running the whole example in an one-node cluster, so it’s expected that in the first scheduling attempt approximately half of them will be in a running state:

2019/03/29 19:41:24 pod pause-5f8c98c84f-6dd6z/default is unlucky to fit on node 127.0.0.1
2019/03/29 19:41:24 pod pause-5f8c98c84f-kjqtq/default is lucky to fit on node 127.0.0.1
2019/03/29 19:41:24 pod pause-5f8c98c84f-nngm7/default is unlucky to fit on node 127.0.0.1
2019/03/29 19:41:25 pod pause-5f8c98c84f-x5w7z/default is lucky to fit on node 127.0.0.1
......

It’s worth pointing out that the default scheduler retries the failed pods periodically, hence they will re-pass onto our scheduler extender again and again. Since our logic is to check if a random number is even, eventually all pods will be in a running state.

Limitations

The scheduler extender can be a good option to use in many scenarios, but note that it does have some limitations:

  • Communication cost: The data are transferred in http(s) between the default scheduler and scheduler extenders. So be aware of the associated costs of performing serialization and deserialization.
  • Limited extension points: As you might have noticed, extenders can only be involved at the end of certain phases, such as “Filter” and “Prioritize.” They can’t be called in the beginning or middle of any phase.
  • Subtraction over addition: Compared to the node candidate list that was passed by the default scheduler, you might have a good reason to add a new one. But it’s risky doing so because it’s not guaranteed that the new node can pass other requirements. So the scheduler extender is preferable to perform “subtraction” (further filtering) instead of “addition” (adding nodes).
  • Cache sharing: The above example that I used is just a fake one to connect all the dots of developing a scheduler extender. In practice, you need to give scheduling decision by looking at the status of the whole cluster. The default scheduler makes scheduling decisions pretty well, but its caching can’t be shared which means you have to build and maintain your own ones.

What’s next

Due to its limitations, the Kubernetes scheduling group recently raised the aforementioned fourth way for better extension, which is called Scheduler Framework. It will resolve all the pain points and become the official recommended way of scheduling extensions, so stay tuned for the beta/GA release date.

If you feel comfortable with creating your own Kubernetes scheduler, check out the other items in the Kubernetes: Enterprise Container Orchestration module in our Kubernetes Learning Path.

Wei Huang