by Nailah Bissoon
Senior Big SQL Technical Architect

The first stage of implementing a customized workload management (WLM) configuration is to monitor the resources on the cluster when queries are executing. In this blog, we will demonstrate some techniques that can be used to monitor the system and point out relevant Big SQL metrics that can be used to derive a customized WLM configuration. Consult an overview of WLM for Big SQL for more information on the capabilities of workload management.

Monitoring System Resources

At a system level, CPU and memory utilization should be studied when the workload is executing. In most Big Data applications, queries are more complex in nature, and these workloads tend to run low on memory. Utilities such as nmon, vmstat or Ambari cluster monitoring can be used to monitor system resources for each of the Big SQL worker nodes. Here is an example of a system that is configured so that Big SQL utilizes 90% of the system resources. Note the very high run queue(r) which is an indication that the system is over-saturated. This is one indication that a customized workload management configuration maybe needed.

Monitoring Big SQL Memory Utilization

Big SQL tends to reserve its memory upfront. Therefore, the low free memory (free) shown in the output above could be normal. There are several major components within Big SQL such as the DB2 and the DFSIO components. It is useful to look at the memory utilized for each of these components because WLM only has control of the DB2 component. The amount of memory available to Big SQL can be determined by running this command:


db2 get dbm cfg  | grep INSTANCE_MEMORY
 Global instance memory (% or 4KB)     (INSTANCE_MEMORY) = 90

The result is the percentage of cluster memory that is available to Big SQL. This command can be used to monitor the memory utilization for the major Big SQL components across all the nodes in the cluster. The output of just one of the Big SQL workers is shown below.


db2_all “db2pd -dbptnmem” > dbptnmem.out 
Database Member 1 -- Active -- Up 17 days 03:48:39 -- Date 2018-02-09-19.59.50.671246
Database Member Memory Controller Statistics

Controller Automatic: N
Controller License Limit: N
Controller Limit Enforced: Y

Memory Limit:         29633468 KB
Current usage:        29631120 KB
HWM usage:            29633424 KB
Cached memory:        4082176 KB

Individual Memory Consumers:

Name             Mem Used (KB) HWM Used (KB) Cached (KB)
========================================================
APPL-BIGSQL              26304        160000        1024
DBMS-bigsql             134784        134784           0
FMP_RESOURCES          2324416       5243520     2319424
PRIVATE                 132416        138560           0
FCM_RESOURCES          1370832       2411152           0
DFSRW_PRIVATE          9521600       9610240           0
DB-BIGSQL             16120768      16189184     1761728

The component utilizing the most memory should be DB-BIGSQL which should utilize ~60% of the available Big SQL instance memory. The other component which should utilize the remaining memory is the DFSIO component (reported as DFSRW_PRIVATE above). Also, more commonly known as the readers/ writers or IO interfaces. Big SQL WLM only governs the resources assigned to the DB2 component.
If it is observed that the DB2-BIGSQL memory is much lower than 60% of the Big SQL instance memory and the DFSIO memory is much higher than 40%, it is recommended to contact IBM support before trying to customize the WLM configuration on your system.

DB2 Big SQL Major Memory Consumers

Big SQL queries typically consist of operations such as aggregation, groupby, sort and hash joins. These operators utilize memory from the DB2 shared sort heap (SHEAPTHRES_SHR). The available DB2 shared sort and database memory can be queried by running the following command:


db2 get db cfg for bigsql | grep SHEAPTHRES_SHR
Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = AUTOMATIC(12914436)
db2 get db cfg for bigsql | grep DATABASE_MEMORY
Size of database shared memory (4KB)  (DATABASE_MEMORY) = AUTOMATIC(16146097)

Note in both cases AUTOMATIC is set, this means that Big SQL’s self tuning memory manager is enabled which is the default behavior. Also note that most of the memory available to the Big SQL database is utilized by shared sort, this is because Big SQL correctly recognized the need for additional shared sort memory and automatically increased this heap.
Big SQL will try to fit as many 4K pages as it can in this shared sort memory. When shared sort memory gets exceeded, pages spill to temporary disks. If these pages are required again, they will be read into our DB2 bufferpool which is another major memory consumer in Big SQL. The DB2 bufferpool utilizes most of the remaining DATABASE_MEMORY. If we want to determine which queries are utilizing the most memory one angle is to look at the shared sort memory utilization across the queries.

Monitoring the DB2 component at the Query Level

Big SQL Monitoring can be categorized either as adhoc or historical. Adhoc monitoring involves invoking monitoring table functions while queries are running i.e. this is more of a hands-on approach to monitoring. Historical monitoring involves using event monitor tables to record query metrics in monitoring tables. The advantage of using historical monitoring is that when the event monitors are active they can capture all the metrics for all queries running on the system. It is more of a set it and come back later to analyze the data approach to monitoring. Big SQL Job History Analysis using Event Monitors shows how to create the event monitors for Big SQL and gives some examples of how to query the data captured from event monitors. When customizing the WLM configuration on your cluster event monitors created on all the worker nodes will be used.

Analysis of the Collected Data

If queries are failing due to DB2 out of memory errors then it is worthwhile to identify the queries that are utilizing the most memory. As explained in the section above, we can monitor the shared sort memory usage looking for queries that utilize the most shared sort memory. The statement below can be used to query the event monitor tables to report the maximum shared sort memory used by all the query classes. Queries are grouped into light weight and heavy weight classes because the default WLM configuration uses this classification. Queries with timeron cost of less than 150000 are grouped into the light weight category and queries with greater than 150000 timerons are grouped into the heavy-weight class.


WITH V1 AS (SELECT CASE WHEN QUERY_COST_ESTIMATE < 150000 THEN 'LIGHT-WEIGHT' WHEN QUERY_COST_ESTIMATE > 150000 THEN 'HEAVY-WEIGHT' END AS TARGET_WORKCLASS, QUERY_COST_ESTIMATE, APPL_NAME, SORT_SHRHEAP_TOP, timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS LIFETIME_DURATION FROM ACTIVITY_DB2ACTIVITIES), V2 AS (SELECT TARGET_WORKCLASS, QUERY_COST_ESTIMATE, APPL_NAME, SORT_SHRHEAP_TOP, LIFETIME_DURATION FROM V1) SELECT TARGET_WORKCLASS, AVG(QUERY_COST_ESTIMATE) AS AVG_COST, COUNT(*) AS COUNT, SUM(LIFETIME_DURATION) AS TOT_LIFETIME_MINS, AVG(SORT_SHRHEAP_TOP)  AS AVG_SORT_MEM4K, MAX(SORT_SHRHEAP_TOP) AS MAX_SORT_MEM4K, MAX(LIFETIME_DURATION) AS MAX_LIFETIME_MINS, INT(SUM(LIFETIME_DURATION)/COUNT(*)) AS AVG_LIFETIME_MINS FROM V2 WHERE APPL_NAME LIKE 'db2jcc_application' GROUP BY TARGET_WORKCLASS,APPL_NAME ORDER BY TARGET_WORKCLASS;

TARGET_WORKCLASS AVG_COST             COUNT                             TOT_LIFETIME_MINS AVG_SORT_MEM4K       
MAX_SORT_MEM4K       MAX_LIFETIME_MINS AVG_LIFETIME_MINS
---------------- -------------------- --------------------------------- ----------------- -------------------- -------------------- ----------------- -----------------
HEAVY-WEIGHT                  8803799                             4848.             32471               403278              9723719               446                 6
LIGHT-WEIGHT                     3221                             2110.                 0                  102                11814                 0                 0

In this example, there is at least one query in this application that is utilizing ~9M of the 12M 4K shared sort pages. Also, there is at least one query that is taking 446mins to execute. We can drill down into the queries that are taking long to execute by executing a SQL query such as:


select timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS ACT_LIFETIME, SORT_SHRHEAP_TOP, SUBSTR(STMT_TEXT, 1, 200) AS STMT_TEXT from ACTIVITYSTMT_DB2ACTIVITIES as ast, ACTIVITY_DB2ACTIVITIES as av where ast.appl_id=av.appl_id and ast.uow_id=av.uow_id and ast.activity_id=av.activity_id order by ACT_LIFETIME desc fetch first 40 rows only;

ACT_LIFETIME SORT_SHRHEAP_TOP     STMT_TEXT 
------------ -----------------------------------------------------------------------------------
446         847281 -- start query 67 in stream 7 using template query78.tpl and seed 2000679862
with ws as
  (select d_year AS ws_sold_year, ws_item_sk,
    ws_bill_customer_sk ws_customer_sk
…

We can also drill down into the queries that is utilizing the most DB2 shared sort memory by running a query such as:


select timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS ACT_LIFETIME, QUERY_TIMERON_COST, SORT_SHRHEAP_TOP, SUBSTR(STMT_TEXT, 1, 200) AS STMT_TEXT from ACTIVITYSTMT_DB2ACTIVITIES as ast, ACTIVITY_DB2ACTIVITIES as av where ast.appl_id=av.appl_id and ast.uow_id=av.uow_id and ast.activity_id=av.activity_id order by SORT_SHRHEAP_TOP desc fetch first 40 rows only;

ACT_LIFETIME QUERY_TIMERON_COST SORT_SHRHEAP_TOP STMT_TEXT

------------ -------------------------------------------------------------------------------------------------------
7            25890940              9723719 -- start query 59 in stream 2 using template query4.tpl and seed 668399603
with year_total as (
 select c_customer_id customer_id
       ,c_first_name customer_first_name
       ,c_last_name customer_            
…

Here we can also see that this query has a timeron cost of ~25M. Now we have identified the queries that are taking a long time to execute and are taking the most resources, it is recommended to consult the performance best practices at Big SQL v5 performance tips to determine whether any of these tips could be applied to those queries. At some point you may still end up with queries that are resource intensive and the workload management configuration on the cluster may need to be customized to help control how many long running or resource intensive queries execute concurrently on the system. Consult Stage 2 of customized WLM configuration – establishing gatekeeper concurrency controls for details on customizing the WLM configuartion on the cluster.

Conclusion

In this blog we studied how to monitor the system in preparation for customizing WLM configuration. In subsequent blogs we will show how to use this data to customize the WLM configuration on your cluster.

2 comments on"Big SQL Workload Management Stage 1 – Monitoring"

  1. […] Workload management is needed when resource intensive queries consume significant resources on the cluster which slow down or inhibit other queries from executing. The most effective way to control work is to control when it can start. Once a piece of work starts executing, it consumes CPU, I/O and memory resources. The recommended approach to workload management is to use gatekeeper concurrency thresholds to control when work can start execution and then, supplement that control with CPU prioritization techniques if applicable.Workload management is enabled by default for Big SQL. There is one gatekeeper concurrency control which allows a certain number of heavy weight statements to execute on the system at a time. The default WLM configuration can be adapted to introduce more work classes. Adding additional work classes allows for finer grained control. The intention is to restrict the number of resource intensive queries but to not have the same level of restriction for simpler or medium classed queries. Once these additional work classes are created, the gatekeeper concurrency controls define the level of restriction that need to be imposed. In this blog, we will focus on creating additional work classes and establishing gatekeeper concurrency controls for workloads running on the system. In a subsequent blog we will discuss how to assign CPU resources percentages for each workload. It is recommended to implement the technique described in this blog if queries are returning DB2 OOM (out of memory) errors and resource intensive queries have been identified in Stage 1 of WLM custom configuration. […]

  2. […] be applied. The steps to monitor and analyze the work executing on the cluster can be found in Big SQL Workload Management – Stage 1 Monitoring. It is recommended to keep monitoring the cluster to keep an eye out for new […]

Leave a Reply to Big SQL Workload Management for Improved System Stability and Performance - biva Cancel reply

Your email address will not be published. Required fields are marked *