IBM Support

Big SQL Integration with YARN Configuration Options - Hadoop Dev

Technical Blog Post


Abstract

Big SQL Integration with YARN Configuration Options - Hadoop Dev

Body

When YARN is enabled for Big SQL, YARN controls how much memory and CPU resources are assigned to Big SQL at any given point in time. Other applications such as Spark or Hive can continue to use YARN resources when YARN is enabled for Big SQL. Read more about Big SQL and YARN integration. There are several YARN and Big SQL configuration parameters that are at play with YARN is enabled with Big SQL. These configuration parameters and best practice to setting these parameter are discussed in this blog.

YARN Container Memory Settings

The total amount of memory as well as the minimum container and maximum container memory sizes can be specified in the YARN settings from Ambari. YARN has a range of container sizes because each application requesting memory from YARN can request different amounts of memory. The YARN memory settings can be configured from the YARN->Configs->Settings->Memory section. In the picture below, the cluster is configured such that 80% of the memory resources are used for YARN. The parameter names have been added in blue as they will be referred to in the sections below.

YARN Memory Settings

YARN Container CPU Settings

The number of virtual cores (vcores) used for each container can also be specified from Ambari as well as the overall percentage of physical CPU given to YARN. The YARN memory settings can be configured from the YARN->Configs->Settings->CPU section. In the picture below, the cluster is configured such that 80% of the memory resources are used for YARN. The parameter names have been added in blue as they will be referred to in the sections below.

YARN Container CPU Settings

Big SQL YARN Configuration and Recommendations

To allow YARN to manage the resources assigned to Big SQL, check the enable_yarn setting in the ‘Advanced bigsql-env’ configuration section and then restart the Big SQL Service. The bigsql_resource_allocation configuration parameter as described in increasing Big SQL memory is ignored when YARN is enabled for Big SQL. Only the Big SQL worker nodes are managed by YARN resource and scheduling algorithms. The management nodes are not affected. When enable_yarn is checked, Big SQL restart is needed (as well as any other component that Ambari recommends a restart of).

Enable YARN and Slider for Big SQL

There are several configuration options under the Advanced bigsql-slider-env tab. The bigsql_container_mem setting determines the size of each Big SQL YARN container in MBs. The bigsql_container_vcore settings specifies how many virtual cores are assigned to each container. The default Big SQL YARN container size is 28672MB (28GB) and 4 virtual cores. Changing any of these ‘bigsql-slider-env’ configuration requires a Big SQL and YARN service restart (as well as any other component that Ambari recommends a restart of).

Slider Flex Big SQL Configuration Options

It is recommended that the bigsql_container_mem setting should be a multiple of yarn.scheduler.minimum-allocation-mb. In this example, 28GB is a multiple of 2GB YARN minimum container size. This is important because YARN allocates resources based on these minimum YARN allocation units. If the yarn.scheduler.minimum-allocation-mb is not a factor of bigsql_container_mem increase bigsql_container_mem accordingly.

Advanced Slider Flex Big SQL Configuration Options

The default bigsql_container_mem was chosen based on internal testing. It is recommended that at minimum the bigsql_container_mem should be set to 28GBs.

It is recommended that the bigsql_container_vcore setting should be a multiple of yarn.scheduler.minimum-allocation-vcores. For performance reasons, it also maybe desirable to increase the bigsql_container_vcore setting based on the following formula:
bigsql_container_vcore = max(2, (bigsql_container_mem /yarn.nodemanager.resource.memory-mb) x yarn.nodemanager.resource.cpu-vcores)

Since resources need to be shared among Big SQL and other YARN applications, Big SQL by default, utilizes 50% of the YARN resources. If no other services are running on the system, or by intention Big SQL should be given more resources, then Big SQL YARN containers can be activated or flexed up with the ‘Advanced bigsql-slider-flex’ -> bigsql_capacity parameter. Later if fewer resources are needed for Big SQL, then Big SQL YARN containers can be deactivated or flexed down by adjusting this parameter. Configuring this parameter does not require a Big SQL or YARN service restart.

Consider carefully the percentage chosen for the bigsql_capacity setting. This percentage determines the number of Big SQL YARN containers that will be activated at any point in time. Since the recommendations above involve adjusting bigsql_container_vcore based on the bigsql_container_mem setting, the calculations below only need to consider memory resources. First calculate the maximum number of Big SQL containers that can be activated.

    max_num_bigsql_container = INT(yarn.nodemanager.resource.memory-mb / bigsql_container_mem)  bigsql_capacity = num_bigsql_container x bigsql_container_mem / yarn.nodemanager.resource.memory-mb x 100    In our example, yarn.nodemanager.resource.memory-mb is set to 100GB and bigsql_container_mem is set to 28GBs.  max_num_bigsql_container = INT(100/28)= 3  num_bigsql_container can be 3,2 or 1  Therefore if 3,2 or 1 Big SQL YARN containers are activated:  bigsql_capacity = 3 x 28/100*100= 84  bigsql_capacity = 2 x 28/100*100= 56  bigsql_capacity = 1 x 28/100*100= 28  This means that if bigsql_capacity is set to 84-100%, 3 Big SQL YARN containers will be activated.  If bigsql_capacity is set between 56-83%, 2 YARN containers will be activated. If bigsql_capacity is set between 28-55%, 2 YARN containers will be activated.    

Let us say you have Spark and Big SQL jobs running simultaneously on your system. 84% of resources is probably too much for Big SQL. Closer to 50% makes more sense. But since Big SQL tries to round down instead of round up, you may need to adjust bigsql_capacity slightly higher than 50%.
In our example, 56% will perform much better than 50% because 2 Big SQL YARN containers will be activated instead of 1.

For added performance the intra-partition parallelism recommendations for logical Big SQL workers can also be applied when YARN is enabled for Big SQL. But this parameter is not dynamic and requires a Big SQL service restart. If the bigsql_capacity setting is flexed up or flexed down frequently it may not be desirable to re-configure this parameter. However, if this setting is unchanged for the majority of times then you can tune the DFT_DEGREE parameter according to this formula:

    num_bigsql_container * DFT_DEGREE(y) <= num_cores * bigsql_capacity / 100  db2 update db cfg for bigsql using DFT_DEGREE y;    restart the Big SQL service    

Read more about this Performance Impact of Big SQL with YARN integration for more details on an internal study with YARN enabled for Big SQL.

Summary

There are several YARN and Big SQL configuration parameters that need to be configured when YARN is enabled for Big SQL. Adjusting the bigsql_capacity setting is dynamic and does not require a Big SQL or YARN restart.

Thanks to the following major contributors to this work: Hebert Pereyra, Metin Kalayci, Diego Santesteban, Armando Paniagua, Xiao Wei Zhang, Abhayan Sundararajan

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259841