A deep dive into YARN CGroups

The term cgroups refers to “control groups”, a Linux kernel feature that limits resource usage for a group of processes including CPU, I/O, memory, and network processes. Apache YARN leverages this feature to provide CPU isolation for Hadoop workloads. Currently, YARN supports only the limiting of CPU usage with cgroups.

The cgroups feature is useful when you are managing multiple workloads running concurrently on a Hadoop cluster. For example, if Spark runs on YARN and aggressively consumes CPU resource, it competes with other services and applications for this resource. Cgroups gives you the option to set a limit on CPU usage for all YARN containers on a node manager. It guarantees that YARN applications can run with a predictable amount of CPU resource.

How it works in YARN

The fundamental concept behind YARN resource allocation is the container. A container is the minimum resource allocation unit with fixed resource limits for memory and vcores. This provides the ability to limit memory and vcore resource for a task (process) running within a container. When a node manager cannot provide enough resource to a container, a task cannot launch. Similarly, cgroups monitors the CPU usage of all containers on a node manager and stops allocating CPU time to containers when the limit is reached. At this point, the task process enters a waiting state that lasts until there are adequate resources. The following list outlines the process:

A user defines how much CPU time (in percent) the node manager is allowed to use.
When the user submits a job, the application master calculates the number of processors that each task needs (the number of vcores that a container requests), and then submits a request to the resource manager.
The resource manager allocates containers to the appropriate nodes, which are able to provide the required resources.
The node manager calculates a CPU usage limit (a percentage value) for each container and writes that value to the cgroups configuration file.
The node manager launches containers to complete tasks.
The Linux kernel guarantees that each container runs within the CPU limit.

Internally, YARN only writes and updates cgroups configuration files on each node manager, and the kernel cgroups library controls the CPU usage for containers. For information on how to configure YARN cgroups, see Using cgroups with YARN. There are both soft and hard limits when you configure cgroups.

Soft and hard limits

The total CPU usage for all YARN containers on a node manager will not exceed the value set by the yarn.nodemanager.resource.percentage-physical-cpu-limit property, regardless of whether soft or hard limits are used.

<property>    <name>yarn.nodemanager.resource.percentage-physical-cpu-limit<name>    <value>60</value>  </property>

This property sets the CPU limit for all containers on a node manager, which means that the total CPU usage of all containers at any given time will not exceed 60%.

By default, YARN cgroups uses soft limits, which allow containers to run beyond the defined CPU limit if there is available resource on the system. If a hard limit is used, containers cannot use more CPU time than what was allocated, even if spare resource is available. The soft or hard limit switcher is controlled by the following property:

<property>    <name>yarn.nodemanager.linux-container-executor.cgroups.strict-resource-usage<name>    <value>true</value>  </property>

A soft limit is often more suitable for long-running services. Although a container might persist for a long time, it might not always require the same CPU share that it claimed originally. When there are idle CPU resources, other containers running on the same node manager can obtain spare CPU time to run their tasks. Hard limits, however, guarantee that each container has a fixed CPU share, and prevent containers from having to wait for CPU resources, thereby facilitating more predicable response times.

An example

To show how the cgroups feature works on YARN, the following job launches a specific number of containers by using the distributed shell. Each container then uses a shell script that runs multiple infinite loops concurrently.

yarn jar hadoop-yarn-applications-distributedshell.jar org.apache.hadoop.yarn.applications.distributedshell.Client -shell_script multi_thread_loop.sh -shell_args 4  -num_containers 16 -container_memory 1024 -container_vcores 3 -master_memory 1024 -master_vcores 1 -priority 10

shell_args: The number of threads that a container launches.
num_containers: The number of containers that the job launches.
container_vcores: The number of vcores that a container requests.

Tip: multi_thread_loop.sh is a shell script that calls a Java executable, which in turn launches a number of threads, each of which runs an infinite loop.

Consider a single node pseudo cluster with 8 physical CPU cores. In the YARN configuration, the maximum number of vcores for each node manager is set to 6:

<property>    <name>yarn.nodemanager.resource.cpu-vcores<name>    <value>6</value>  </property>

The CPU limit is set to 50%:

<property>    <name>yarn.nodemanager.resource.percentage-physical-cpu-limit<name>    <value>50</value>  </property>

Note: YARN detects the number of physical cores by parsing the /proc/cpuinfo file.

In soft limit mode, YARN containers on a node manager simply share CPU resources, and each container is allowed to run within a relative share of those resources. The size of the share depends on the number of vcores that a container requests. This means that when a container requests 2 vcores, it will receive about twice as much CPU time as a container that requests 1 vcore. And, as stated earlier, the total CPU usage will not exceed the physical limit (defined in yarn.nodemanager.resource.percentage-physical-cpu-limit).

In hard limit mode, YARN uses the following formula to calculate the CPU limit for each container.

C: number of physical CPU cores  P: strict limit percent  V: number of vcores on a single node manager  A: number of vcores for a single container    CPU_LIMIT = C * P * A / V;

In our example, a single container requests one vcore. The CPU limit for this container will be

CPU_LIMIT = C * P * A / V = 8 * 50% * 1 / 6 = 2/3 = 66.7%

And if there is another container requesting 2 vcores, the CPU limit for that container will be

CPU_LIMIT = C * P * A / V = 8 * 50% * 2 / 6 = 4/3 = 133%

Note: Unlike soft limit mode, hard limit mode guarantees that each container always runs within its CPU limit.

Scenarios

Case 1: Launch a job with identical vcore requests for each container

Job1  NumOfContainers=4  NumOfVcorePerContainer=1  NumOfThreadsPerContainer=1

The following image shows example output from the Linux top (type 1) command, which you can use to review the usage of each physical CPU core.

In this soft limit scenario, the node has 8 physical CPU cores and 6 vcores, and can launch all 4 containers. Each container receives the same CPU share because they all requested 1 vcore. Because each thread is an infinite loop, it will use up all of the CPU time by increasing up to 100%. However, total usage does not exceed half of the total number of physical cores (4 * (~100%) <= (8 * 100%) / 5).

In this hard limit scenario, each container’s CPU usage is no more than two-thirds (66.7%) of a physical CPU core (8 * 50% * 1 / 6). Each container’s CPU usage doesn’t exceed this limit, even though each physical CPU core still has one-third of its resources available.

Case 2: Launch two jobs, each job requesting a different number of vcores

Job1  NumOfContainers=1  NumOfVcorePerContainer=1  NumOfThreadsPerContainer=1    Job2  NumOfContainers=1  NumOfVcorePerContainer=2  NumOfThreadsPerContainer=2

In this soft limit scenario, Job2’s container consumes 200% of the CPU resources (2 physical cores), while Job1’s container consumes 100%. The container for Job2 uses twice as much CPU resource because Job2 requests 2 vcores per container and Job1 requests 1 vcore.

In this hard limit scenario, the CPU limit for Job1’s container is (8 * 50% * 1 / 6 = 66.7%), and the CPU limit for Job2’s container is (8 * 50% * 2 / 6 = 133%). Each container’s CPU limit is honored and the CPU limit is strictly controlled, even if spare CPU resource is available.

Case 3: Launch two jobs and use up all vcores on a node manager

Job1  NumOfContainers=2  NumOfVcorePerContainer=1  NumOfThreadsPerContainer=1    Job2  NumOfContainers=2  NumOfVcorePerContainer=2  NumOfThreadsPerContainer=2

In this case, soft and hard limit scenarios produce similar results. Job1 and Job2 request a total of 4 containers with 6 vcores (2 * 1 + 2 * 2). This uses all predefined vcores on the node manager. Soft limit mode ensures that each container gets its share based on the container vcore request. Hard limit mode calculates the limit to 133% and 66.7%, respectively.

Enabling CPU scheduling

When using the Linux cgroups feature, we recommend that you enable CPU scheduling by setting the following property:

<property>    <name>yarn.scheduler.capacity.resource-calculator<name>    <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value>  </property>

Using DominantResourceCalculator enables YARN to consider the tradeoff between memory and vcore limits while assigning containers. However, DefaultResourceCalculator calculates only memory and ignores vcores. You need to set the appropriate number of vcores for each node manager. For example:

<property>    <name>yarn.nodemanager.resource.cpu-vcores<name>    <value>16</value>  </property>

You can then let each container request a reasonable number of vcores. The following example applies to a MapReduce task:

<property>    <name>mapreduce.map.cpu.vcores<name>    <value>2</value>  </property>    <property>    <name>mapreduce.reduce.cpu.vcores<name>    <value>4</value>  </property>

Tuning these parameters for different types of jobs until there is no obvious resource bottleneck will help to constrain resource usage without compromising performance.

Configuring the Linux container executor

If you are using cgroups in a non-Kerberos environment, the Linux container executor runs all tasks as user “nobody” by default. This will cause permission issues when accessing local directories. It is recommended that you let tasks run as the job submitter by setting the following property:

<property>    <name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users<name>    <value>false</value>  </property>

Known issues

Although the YARN cgroups implementation is heavily dependent on the Linux kernel cgroups library, cgroups was not very stable in some kernel versions. Be sure to use at least kernel-3.10.0-327.44.2 on RHEL 7.2, RHEL 7.3, or later.

Tips

A deep dive into YARN CGroups - Hadoop Dev

Technical Blog Post

Abstract

Body