When YARN is enabled for Big SQL, YARN controls how much memory and CPU resources are assigned to Big SQL at any given point in time. Other applications such as Spark or Hive can continue to use YARN resources when YARN is enabled for Big SQL. Read more about Big SQL and YARN integration. There are several YARN and Big SQL configuration parameters that are at play with YARN is enabled with Big SQL. These configuration parameters and best practice to setting these parameter are discussed in this blog.
YARN Container Memory Settings
The total amount of memory as well as the minimum container and maximum container memory sizes can be specified in the YARN settings from Ambari. YARN has a range of container sizes because each application requesting memory from YARN can request different amounts of memory. The YARN memory settings can be configured from the YARN->Configs->Settings->Memory section. In the picture below, the cluster is configured such that 80% of the memory resources are used for YARN. The parameter names have been added in blue as they will be referred to in the sections below.
YARN Container CPU Settings
The number of virtual cores (vcores) used for each container can also be specified from Ambari as well as the overall percentage of physical CPU given to YARN. The YARN memory settings can be configured from the YARN->Configs->Settings->CPU section. In the picture below, the cluster is configured such that 80% of the memory resources are used for YARN. The parameter names have been added in blue as they will be referred to in the sections below.
Big SQL YARN Configuration and Recommendations
To allow YARN to manage the resources assigned to Big SQL, check the enable_yarn setting in the ‘Advanced bigsql-env’ configuration section and then restart the Big SQL Service. The bigsql_resource_allocation configuration parameter as described in increasing Big SQL memory is ignored when YARN is enabled for Big SQL. Only the Big SQL worker nodes are managed by YARN resource and scheduling algorithms. The management nodes are not affected. When enable_yarn is checked, Big SQL restart is needed (as well as any other component that Ambari recommends a restart of).
There are several configuration options under the Advanced bigsql-slider-env tab. The bigsql_container_mem setting determines the size of each Big SQL YARN container in MBs. The bigsql_container_vcore settings specifies how many virtual cores are assigned to each container. The default Big SQL YARN container size is 28672MB (28GB) and 4 virtual cores. Changing any of these ‘bigsql-slider-env’ configuration requires a Big SQL and YARN service restart (as well as any other component that Ambari recommends a restart of).
It is recommended that the bigsql_container_mem setting should be a multiple of yarn.scheduler.minimum-allocation-mb. In this example, 28GB is a multiple of 2GB YARN minimum container size. This is important because YARN allocates resources based on these minimum YARN allocation units. If the yarn.scheduler.minimum-allocation-mb is not a factor of bigsql_container_mem increase bigsql_container_mem accordingly.
The default bigsql_container_mem was chosen based on internal testing. It is recommended that at minimum the bigsql_container_mem should be set to 28GBs.
It is recommended that the bigsql_container_vcore setting should be a multiple of yarn.scheduler.minimum-allocation-vcores. For performance reasons, it also maybe desirable to increase the bigsql_container_vcore setting based on the following formula:
bigsql_container_vcore = max(2, (bigsql_container_mem /yarn.nodemanager.resource.memory-mb) x yarn.nodemanager.resource.cpu-vcores)
Since resources need to be shared among Big SQL and other YARN applications, Big SQL by default, utilizes 50% of the YARN resources. If no other services are running on the system, or by intention Big SQL should be given more resources, then Big SQL YARN containers can be activated or flexed up with the ‘Advanced bigsql-slider-flex’ -> bigsql_capacity parameter. Later if fewer resources are needed for Big SQL, then Big SQL YARN containers can be deactivated or flexed down by adjusting this parameter. Configuring this parameter does not require a Big SQL or YARN service restart.
Consider carefully the percentage chosen for the bigsql_capacity setting. This percentage determines the number of Big SQL YARN containers that will be activated at any point in time. Since the recommendations above involve adjusting bigsql_container_vcore based on the bigsql_container_mem setting, the calculations below only need to consider memory resources. First calculate the maximum number of Big SQL containers that can be activated.
max_num_bigsql_container = INT(yarn.nodemanager.resource.memory-mb / bigsql_container_mem) bigsql_capacity = num_bigsql_container x bigsql_container_mem / yarn.nodemanager.resource.memory-mb x 100 In our example, yarn.nodemanager.resource.memory-mb is set to 100GB and bigsql_container_mem is set to 28GBs. max_num_bigsql_container = INT(100/28)= 3 num_bigsql_container can be 3,2 or 1 Therefore if 3,2 or 1 Big SQL YARN containers are activated: bigsql_capacity = 3 x 28/100*100= 84 bigsql_capacity = 2 x 28/100*100= 56 bigsql_capacity = 1 x 28/100*100= 28 This means that if bigsql_capacity is set to 84-100%, 3 Big SQL YARN containers will be activated. If bigsql_capacity is set between 56-83%, 2 YARN containers will be activated. If bigsql_capacity is set between 28-55%, 2 YARN containers will be activated.
Let us say you have Spark and Big SQL jobs running simultaneously on your system. 84% of resources is probably too much for Big SQL. Closer to 50% makes more sense. But since Big SQL tries to round down instead of round up, you may need to adjust bigsql_capacity slightly higher than 50%.
In our example, 56% will perform much better than 50% because 2 Big SQL YARN containers will be activated instead of 1.
For added performance the intra-partition parallelism recommendations for logical Big SQL workers can also be applied when YARN is enabled for Big SQL. But this parameter is not dynamic and requires a Big SQL service restart. If the bigsql_capacity setting is flexed up or flexed down frequently it may not be desirable to re-configure this parameter. However, if this setting is unchanged for the majority of times then you can tune the DFT_DEGREE parameter according to this formula:
num_bigsql_container * DFT_DEGREE(y) <= num_cores * bigsql_capacity / 100 db2 update db cfg for bigsql using DFT_DEGREE y; restart the Big SQL service
Read more about this Performance Impact of Big SQL with YARN integration for more details on an internal study with YARN enabled for Big SQL.
Summary
There are several YARN and Big SQL configuration parameters that need to be configured when YARN is enabled for Big SQL. Adjusting the bigsql_capacity setting is dynamic and does not require a Big SQL or YARN restart.
Thanks to the following major contributors to this work: Hebert Pereyra, Metin Kalayci, Diego Santesteban, Armando Paniagua, Xiao Wei Zhang, Abhayan Sundararajan