YARN Introduction

YARN is a central resource manager component which manages and allocates resources to applications running on Hadoop. Multiple applications can run on Hadoop and they can share a common resource manager via YARN. Apache Slider allows for non-YARN services to be managed by YARN. Big SQL is integrated with YARN through Apache Slider since Big SQL 4.2.5 and there have been further improvements in the 5.0.1 release in this area. When a component requests resources from YARN, it is put on a YARN priority queue and YARN schedules the allocation based on priority. When the request is serviced, YARN will determine the number of containers that should be allocated to that component.

Big SQL integration with YARN Overview

When YARN is enabled for Big SQL, YARN controls how much memory and CPU resources are assigned to Big SQL at any given point in time. Other applications such as Spark or Hive can continue to use YARN resources when YARN is enabled for Big SQL. YARN container settings include memory and CPU configuration parameters. Adjusting these parameters requires a YARN restart (as well as any other component that Ambari recommends a restart of). It is recommended that each compute node have the same amount of memory and CPU cores.

Since Big SQL 4.2.5, Logical Big SQL Workers were introduced to boost Big SQL performance. By increasing the number of logical Big SQL workers on each compute node, there can be better query parallelism. Big SQL integration with YARN configures logical Big SQL workers under the covers. It is highly recommended to read the blog on logical Big SQL workers before continuing.

The Big SQL YARN integration architecture stipulates that the resources of one Big SQL YARN container is always assigned to one logical Big SQL worker. In the diagram below each red box represents a Big SQL YARN container. Note the 1:1 relationship between the number of logical worker and the number of YARN containers.

When YARN is enabled for Big SQL, the maximum number of Big SQL YARN containers that can be used for the logical Big SQL workers is first determined. This calculation assumes that 100% of the YARN resources will be allocated to Big SQL. All available YARN resources may not be used by Big SQL, for example if there are Big SQL and Spark jobs running side by side, Big SQL can use 50% of the resource and Spark jobs can use 50% of the YARN resources. Therefore, not all logical Big SQL workers can be assigned the resources from a YARN container. When a logical Big SQL worker is assigned the resources of a YARN container, we will refer to this behavior as an activation of Big SQL YARN containers in this blog and other Big SQL and YARN blogs.

Big SQL Integration with YARN Configuration Options, gives more details on the bigsql_capacity setting. This controls the percentage of resources allocated to Big SQL versus other applications managed by YARN. For example, when this setting is set to 50% then half of the Big SQL YARN containers will be activated. If this setting is later changed to 100%, then the maximum number of Big SQL YARN containers will be activated.

Activation of the Big SQL YARN containers by flexing up the bigsql_capacity setting does not require a restart of YARN or Big SQL. This is because the maximum number of logical workers is pre-determined and configured when YARN is enabled for Big SQL. Deactivation of Big SQL YARN containers by flexing down this setting also does not require a restart of Big SQL or YARN. The major advantage of enabling YARN for Big SQL is this dynamic flex up and flex down capability.

The Big SQL YARN integration feature is not a performance feature. For details on an internal study we conducted read more about Performance Implications when YARN is enabled . The major advantage of Big SQL and YARN integration is the flexibility and dynamic nature of the activation and deactivation of Big SQL Yarn Containers.

Limitations with Big SQL and YARN Integration

There are some limitations when High Availability (HA) is used with YARN in Big SQL 5.0.1 and prior releases. These have been addressed in Big SQL 5.0.2 where the only limitation is that HA must be enabled before enabling YARN
Kerberos support when YARN is enabled for Big SQL is available in the Big SQL 5.0.1 release
The maximum number of YARN containers that can be allocated on each compute node is 7
The maximum number of YARN containers that can be allocated across all compute nodes is 999
YARN can unevenly distribute Big SQL YARN containers across the compute nodes at times across the compute nodes and this can be sub-optimal for performance

Thanks to the following major contributors to this work: Hebert Pereyra, Metin Kalayci, Diego Santesteban, Armando Paniagua, Xiao Wei Zhang, Abhayan Sundararajan

Tips

YARN Slider Integration in Big SQL Overview

Technical Blog Post

Abstract

Body

YARN Introduction

Big SQL integration with YARN Overview

Limitations with Big SQL and YARN Integration

UID

Share your feedback

Need support?