YARN Node Labels: Label-based scheduling and resource isolation

Apache Hadoop is mostly known for distributed parallel processing inside the whole cluster; MapReduce jobs, for example. However, as more and more different kinds of applications run on Hadoop clusters, new requirements emerge. For example, because some Spark applications require a lot of memory, you want to run them on memory-rich nodes to accelerate processing and to avoid having to steal memory from other applications. Some machine leaning jobs might benefit from running on nodes with powerful CPUs. With YARN Node Labels, you can mark nodes with labels such as “memory” (for nodes with more RAM) or “high_cpu” (for nodes with powerful CPUs) or any other meaningful label so that applications can choose the nodes on which to run their containers. The YARN ResourceManager will schedule jobs based on those node labels.

Node Labels can also help you to manage different workloads and organizations in the same cluster as your business grows. You can use them to help provide good throughput and access control. By assigning a label for each node, you can group nodes with the same label together and separate the cluster into several node partitions. These partitions let you isolate resources among workloads or organizations, as well as share data in the same cluster.

Currently, a node can have only one label assigned to it. Nodes that do not have a label belong to the “Default” partition.

How does YARN Node Labels work?

When you submit an application, you can specify a node label expression to tell YARN where it should run. Containers are then allocated only on those nodes that have the specified node label. A node label expression is a phrase that contains node labels that can be specified for an application or for a single ResourceRequest. The expression can be a single label or a logical combination of labels, such as “x&&y” or “x||y”. Currently, we only support the form of a single label.

YARN manages resources through a hierarchy of queues. Each queue’s capacity specifies how much cluster resource it can consume, and resources are shared among queues according to the specified capacities. You can associate node labels with queues. Each queue can have a list of accessible node labels and the capacity for every label to which it has access. A queue can also have its own default node label expression. Applications that are submitted to this queue will use this default value if there are no specified labels of their own. If neither of the above two are specified, Default partition will be considered.

A queue’s accessible node label list determines the nodes on which applications that are submitted to this queue can run. That is, an application can specify only node labels to which the target queue has access; otherwise, it is rejected. All queues have access to the Default partition.

In the following example, Queue A has access to both partition X (nodes with label X) and partition Y (nodes with label Y). Queue B has access to only partition Y, and Queue C has access to only the Default partition (nodes with no label). Partition X is accessible only by Queue A with a capacity of 100%, whereas Partition Y is shared between Queue A and Queue B with a capacity of 50% each.

Figure 1. Accessible node labels with capacities for queues

When you submit an application, it is routed to the target queue according to queue mapping rules, and containers are allocated on the matching nodes if a node label has been specified. In the following example, User_1 has submitted App_1 and App_2 to Queue A with node label expression “X” and “Y”, respectively. Containers for App_1 have been allocated on Partition X, and containers for App_2 have been allocated on Partition Y.

Figure 2. Submitting applications to queues

Exclusive and non-exclusive node labels

There are two kinds of node labels:

Resources on nodes with exclusive node labels can be allocated only to applications that request them.
Resources on nodes with non-exclusive node labels can be shared by applications that request the Default partition.

The exclusivity attribute must be specified when you add a node label; the default is “exclusive”. In the following example, an exclusive label “X” and a non-exclusive label “Y” are added:

yarn rmadmin -addToClusterNodeLabels "X,Y(exclusive=false)"

Figure 3. Exclusive and non-exclusive node labels

When a queue is associated with one or more exclusive node labels, all applications that are submitted by the queue have exclusive access to nodes with those labels. When a queue is associated with one or more non-exclusive node labels, all applications that are submitted by the queue get first priority on nodes with those labels. If idle capacity is available on those nodes, resources are shared with applications that are requesting resources on the Default partition. In this case, with preemption enabled, the shared resources are preempted if there are applications asking for resources on non-exclusive partitions, to ensure that labeled applications have the highest priority.

In the example shown in Figure 2, User_1 has submitted App_3 to Queue A without specifying a node label expression. Assume that Queue A doesn’t have a default node label expression configured. YARN assumes that App_3 is asking for resources on the Default partition, as described earlier. User_2 has submitted App_4 to Queue C, which only has access to the Default partition. Partition Y is non-exclusive and has available resources to share. Containers for App_3 and App_4 have been allocated on both the Default partition and Partition Y.

Associating node labels with queues

The following properties in the capacty-scheduler.xml file are used to associate node labels with queues for the CapacityScheduler:

yarn.scheduler.capacity.<queue-path>.capacity defines the queue capacity for resources on nodes that belong to the Default partition.
yarn.scheduler.capacity.<queue-path>.accessible-node-labels defines the node labels that the queue can access. “*” means that the queue can access all the node labels; ” ” (a blank space) means that the queue can access only the Default partition.
yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.capacity defines the queue capacity for accessing nodes that belong to partition “label”. The default is 0.
yarn.scheduler.capacity.<queue-path>.accessible-node-labels.<label>.maximum-capacity defines the maximum queue capacity for accessing nodes that belong to partition “label”. The default is 100.
yarn.scheduler.capacity.<queue-path>.default-node-label-expression defines the queue’s default node label expression. If applications that are submitted to the queue don’t have their own node label expression, the queue’s default node label expression is used.

You can set these properties for the root queue or for any child queue as long as the following items are true:

Capacity was specified for each node label to which the queue has access.
For each node label, the sum of the capacities of the direct children of a parent queue at every level is 100%.
Node labels that a child queue can access are the same as (or a subset of) the accessible node labels of its parent queue.

The following listing shows the content of the capacity-scheduler.xml file for the previous example:

<!– configuration of queue root –>      <property>        <name>yarn.scheduler.capacity.root.queues</name>        <value>A,B,C</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.capacity</name>        <value>100</value>      </property>            <property>        <name>yarn.scheduler.capacity.root.maximum-capacity</name>        <value>100</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.accessible-node-labels</name>        <value>*</value>      </property>            <property>        <name>yarn.scheduler.capacity.root.accessible-node-labels.X.capacity</name>        <value>100</value>      </property>            <property>        <name>yarn.scheduler.capacity.root.accessible-node-labels.X.maximum-capacity</name>        <value>100</value>      </property>            <property>        <name>yarn.scheduler.capacity.root.accessible-node-labels.Y.capacity</name>        <value>100</value>      </property>            <property>        <name>yarn.scheduler.capacity.root.accessible-node-labels.Y.maximum-capacity</name>        <value>100</value>      </property>    <!– configuration of queue root.A –>      <property>        <name>yarn.scheduler.capacity.root.A.capacity</name>        <value>40</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.maximum-capacity</name>        <value>100</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.accessible-node-labels</name>        <value>X,Y</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.default-node-label-expression</name>        <value>X</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.accessible-node-labels.X.capacity</name>        <value>100</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.accessible-node-labels.X.maximum-capacity</name>        <value>100</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.accessible-node-labels.Y.capacity</name>        <value>50</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.A.accessible-node-labels.Y.maximum-capacity</name>        <value>100</value>      </property>    <!– configuration of queue root.B –>      <property>        <name>yarn.scheduler.capacity.root.B.capacity</name>        <value>30</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.B.maximum-capacity</name>        <value>100</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.B.accessible-node-labels</name>        <value>Y</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.B.accessible-node-labels.Y.capacity</name>        <value>50</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.B.accessible-node-labels.Y.maximum-capacity</name>        <value>100</value>      </property>    <!– configuration of queue root.C –>      <property>        <name>yarn.scheduler.capacity.root.C.capacity</name>        <value>30</value>      </property>        <property>        <name>yarn.scheduler.capacity.root.C.maximum-capacity</name>        <value>100</value>      </property>

The Ambari Queue Manager View provides a great visual way to configure the capacity scheduler and to associate node labels with queues.

Figure 4. Accessible node labels and capacities for the root queue

Figure 5. Accessible node labels and capacities for Queue A

Figure 6. Accessible node labels and capacities for Queue B

Figure 7. Accessible node labels and capacities for Queue C

Label-based resource scheduling

As mentioned, the ResourceManager allocates containers for each application based on node label expressions. Containers are only allocated on nodes with an exactly matching node label. However, those containers that request the Default partition might be allocated on non-exclusive partitions for better resource utilization. Moreover, during scheduling, the ResourceManager also calculates a queue’s available resources based on labels. For the example shown in Figure 1, let’s see how many resources each queue can acquire. Table 1 shows the queue capacities:

Table 1: Queue capacities

Queue	Capacity
A	40% of resources on nodes without any label 100% of resources on nodes with label X 50% of resources on nodes with label Y
B	30% of resources on nodes without any label 50% of resources on nodes with label Y
C	30% of resources on nodes without any label

Suppose that a cluster has 6 nodes and that each node can run 10 containers. Node n1 and n2 have node label “X”; n3 and n4 have node label “Y”; and node n5 and n6 don’t have node labels assigned.

Resources in Partition X = 20 (all containers can be allocated on nodes n1 and n2)
Resources in Partition Y = 20 (all containers can be allocated on nodes n3 and n4)
Resources in the Default partition = 20 (all containers can be allocated on nodes n5 and n6)

Queue A can access the following resources, based on its capacity for each node label:

Available resources in Partition X = Resources in Partition X * 100% = 20
Available resources in Partition Y = Resources in Partition Y * 50% = 10
Available resources in the Default partition = Resources in the Default partition * 40% = 8

Queue B can access the following resources, based on its capacity for each node label:

Available resources in Partition Y = Resources in Partition Y * 50% = 10
Available resources in the Default partition = Resources in the Default partition * 30% = 6

Queue C can access the following resources, based on its capacity for each node label:

Available resources in the Default partition = Resources in the Default partition * 30% = 6

During scheduling, the ResourceManager ensures that a queue on a certain partition can get its fair share of resources according to the capacity. For example:

A single application submitted to Queue A with node label expression “X” can get a maximum of 20 containers because Queue A has 100% capacity for label “X”.
A single application submitted to Queue A with node label expression “Y” can get a maximum of 10 containers.
If there are several applications from different users submitted to Queue A with node label expression “Y”, the total number of containers that they can get could reach the maximum capacity of Queue A for label “Y”, which is 100%, meaning 20 containers in all.
If there are resource requests from Queue B for label “Y” after Queue A has consumed more than 50% of resources on label “Y”, Queue B will get its fair share for label “Y” slowly, as containers being released from Queue A. Queue A returns to its normal capacity of 50%. If preemption is enabled, Queue B will get its share quickly after preempting containers from Queue A.

Specifying a node label in your application

You can specify a node label in one of several ways.

By using provided Java APIs:
- ApplicationSubmissionContext.setNodeLabelExpression(..) to set the node label expression for all containers of the application.
- ResourceRequest.setNodeLabelExpression(..) to set the node label expression for individual resource requests. This can overwrite the node label expression set in ApplicationSubmissionContext.
- Specify setAMContainerResourceRequest.setNodeLabelExpression in ApplicationSubmissionContext to indicate the expected node label for the ApplicationMaster container. This can overwrite the node label expression set in ApplicationSubmissionContext.
By specifying a node label for a MapReduce job. You can use the following properties:
- mapreduce.job.node-label-expression
  All the containers of the MapReduce job will run with this node label expression.
- mapreduce.job.am.node-label-expression
  This is the node label configuration for the MapReduce ApplicationMaster container. If it is not configured, mapreduce.job.node-label-expression is used instead.
- mapreduce.map.node-label-expression
  This is the node label configuration for map task containers. If it is not configured, mapreduce.job.node-label-expression is used instead.
- mapreduce.reduce.node-label-expression
  This is the node label configuration for reduce task containers. If it is not configured, mapreduce.job.node-label-expression is used instead.
If the ApplicationMaster, Map, or Reduce container’s node label expression hasn’t been set, the job level setting of mapreduce.job.node-label-expression is used instead. If that property is not set, the queue’s default node label expression is used instead; otherwise, the Default partition is used.

Here is an example that uses the node label expression “X” for map tasks:
```
yarn jar /usr/iop/<iopversion>/hadoop-mapreduce/hadoop-mapreduce-examples.jar    wordcount -D mapreduce.map.node-label-expression="X" /tmp/source /tmp/result
```

By specifying a node label for jobs that are submitted through the distributed shell. You can set this node label through -node_label_expression. For example:

yarn org.apache.hadoop.yarn.applications.distributedshell.Client    -jar /usr/iop/<iopversion>/hadoop-yarn/hadoop-yarn-applications-distributedshell.jar    -num_containers 10 --container_memory 1536    -node_label_expression "X" -shell_command "sleep 60"

By specifying a node label for Spark jobs. Spark enables you to set a node label expression for ApplicationMaster containers and task containers separately through --conf spark.yarn.am.nodeLabelExpression and --conf spark.yarn.executor.nodeLabelExpression. For example:
```
./bin/spark-submit --class org.apache.spark.examples.SparkPi    --master yarn --deploy-mode cluster --driver-memory 3g --executor-memory 2g    --conf spark.yarn.am.nodeLabelExpression=Y    --conf spark.yarn.executor.nodeLabelExpression=Y jars/spark-examples.jar 10
```

Recommended versions

The YARN node labels feature was introduced in Apache Hadoop 2.6, but it’s not mature in the first official release. The recommended versions are 2.8 and later, which include a lot of fixes and improvements. For IOP, the supported version begins with IOP 4.2.5, which is based on Apache Hadoop 2.7.3. It has all the important fixes and improvements for node labels and has been thoroughly tested by us.

Tips

YARN Node Labels: Label-based scheduling and resource isolation - Hadoop Dev

Technical Blog Post

Abstract

Body