Big SQL Workload Management Stage 1 - Monitoring

by Nailah Bissoon
Senior Big SQL Technical Architect

The first stage of implementing a customized workload management (WLM) configuration is to monitor the resources on the cluster when queries are executing. In this blog, we will demonstrate some techniques that can be used to monitor the system and point out relevant Big SQL metrics that can be used to derive a customized WLM configuration. Consult an overview of WLM for Big SQL for more information on the capabilities of workload management.

Monitoring System Resources

At a system level, CPU and memory utilization should be studied when the workload is executing. In most Big Data applications, queries are more complex in nature, and these workloads tend to run low on memory. Utilities such as nmon, vmstat or Ambari cluster monitoring can be used to monitor system resources for each of the Big SQL worker nodes. Here is an example of a system that is configured so that Big SQL utilizes 90% of the system resources. Note the very high run queue(r) which is an indication that the system is over-saturated. This is one indication that a customized workload management configuration maybe needed.

Monitoring Big SQL Memory Utilization

Big SQL tends to reserve its memory upfront. Therefore, the low free memory (free) shown in the output above could be normal. There are several major components within Big SQL such as the DB2 and the DFSIO components. It is useful to look at the memory utilized for each of these components because WLM only has control of the DB2 component. The amount of memory available to Big SQL can be determined by running this command:

    db2 get dbm cfg  | grep INSTANCE_MEMORY   Global instance memory (% or 4KB)     (INSTANCE_MEMORY) = 90

The result is the percentage of cluster memory that is available to Big SQL. This command can be used to monitor the memory utilization for the major Big SQL components across all the nodes in the cluster. The output of just one of the Big SQL workers is shown below.

    db2_all “db2pd -dbptnmem” > dbptnmem.out   Database Member 1 -- Active -- Up 17 days 03:48:39 -- Date 2018-02-09-19.59.50.671246  Database Member Memory Controller Statistics    Controller Automatic: N  Controller License Limit: N  Controller Limit Enforced: Y    Memory Limit:         29633468 KB  Current usage:        29631120 KB  HWM usage:            29633424 KB  Cached memory:        4082176 KB    Individual Memory Consumers:    Name             Mem Used (KB) HWM Used (KB) Cached (KB)  ========================================================  APPL-BIGSQL              26304        160000        1024  DBMS-bigsql             134784        134784           0  FMP_RESOURCES          2324416       5243520     2319424  PRIVATE                 132416        138560           0  FCM_RESOURCES          1370832       2411152           0  DFSRW_PRIVATE          9521600       9610240           0  DB-BIGSQL             16120768      16189184     1761728

The component utilizing the most memory should be DB-BIGSQL which should utilize ~60% of the available Big SQL instance memory. The other component which should utilize the remaining memory is the DFSIO component (reported as DFSRW_PRIVATE above). Also, more commonly known as the readers/ writers or IO interfaces. Big SQL WLM only governs the resources assigned to the DB2 component.
If it is observed that the DB2-BIGSQL memory is much lower than 60% of the Big SQL instance memory and the DFSIO memory is much higher than 40%, it is recommended to contact IBM support before trying to customize the WLM configuration on your system.

DB2 Big SQL Major Memory Consumers

Big SQL queries typically consist of operations such as aggregation, groupby, sort and hash joins. These operators utilize memory from the DB2 shared sort heap (SHEAPTHRES_SHR). The available DB2 shared sort and database memory can be queried by running the following command:

    db2 get db cfg for bigsql | grep SHEAPTHRES_SHR  Sort heap thres for shared sorts (4KB) (SHEAPTHRES_SHR) = AUTOMATIC(12914436)  db2 get db cfg for bigsql | grep DATABASE_MEMORY  Size of database shared memory (4KB)  (DATABASE_MEMORY) = AUTOMATIC(16146097)

Note in both cases AUTOMATIC is set, this means that Big SQL’s self tuning memory manager is enabled which is the default behavior. Also note that most of the memory available to the Big SQL database is utilized by shared sort, this is because Big SQL correctly recognized the need for additional shared sort memory and automatically increased this heap.
Big SQL will try to fit as many 4K pages as it can in this shared sort memory. When shared sort memory gets exceeded, pages spill to temporary disks. If these pages are required again, they will be read into our DB2 bufferpool which is another major memory consumer in Big SQL. The DB2 bufferpool utilizes most of the remaining DATABASE_MEMORY. If we want to determine which queries are utilizing the most memory one angle is to look at the shared sort memory utilization across the queries.

Monitoring the DB2 component at the Query Level

Big SQL Monitoring can be categorized either as adhoc or historical. Adhoc monitoring involves invoking monitoring table functions while queries are running i.e. this is more of a hands-on approach to monitoring. Historical monitoring involves using event monitor tables to record query metrics in monitoring tables. The advantage of using historical monitoring is that when the event monitors are active they can capture all the metrics for all queries running on the system. It is more of a set it and come back later to analyze the data approach to monitoring. Big SQL Job History Analysis using Event Monitors shows how to create the event monitors for Big SQL and gives some examples of how to query the data captured from event monitors. When customizing the WLM configuration on your cluster event monitors created on all the worker nodes will be used.

Analysis of the Collected Data

If queries are failing due to DB2 out of memory errors then it is worthwhile to identify the queries that are utilizing the most memory. As explained in the section above, we can monitor the shared sort memory usage looking for queries that utilize the most shared sort memory. The statement below can be used to query the event monitor tables to report the maximum shared sort memory used by all the query classes. Queries are grouped into light weight and heavy weight classes because the default WLM configuration uses this classification. Queries with timeron cost of less than 150000 are grouped into the light weight category and queries with greater than 150000 timerons are grouped into the heavy-weight class.

    WITH V1 AS (SELECT CASE WHEN QUERY_COST_ESTIMATE < 150000 THEN 'LIGHT-WEIGHT' WHEN QUERY_COST_ESTIMATE > 150000 THEN 'HEAVY-WEIGHT' END AS TARGET_WORKCLASS, QUERY_COST_ESTIMATE, APPL_NAME, SORT_SHRHEAP_TOP, timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS LIFETIME_DURATION FROM ACTIVITY_DB2ACTIVITIES), V2 AS (SELECT TARGET_WORKCLASS, QUERY_COST_ESTIMATE, APPL_NAME, SORT_SHRHEAP_TOP, LIFETIME_DURATION FROM V1) SELECT TARGET_WORKCLASS, AVG(QUERY_COST_ESTIMATE) AS AVG_COST, COUNT(*) AS COUNT, SUM(LIFETIME_DURATION) AS TOT_LIFETIME_MINS, AVG(SORT_SHRHEAP_TOP)  AS AVG_SORT_MEM4K, MAX(SORT_SHRHEAP_TOP) AS MAX_SORT_MEM4K, MAX(LIFETIME_DURATION) AS MAX_LIFETIME_MINS, INT(SUM(LIFETIME_DURATION)/COUNT(*)) AS AVG_LIFETIME_MINS FROM V2 WHERE APPL_NAME LIKE 'db2jcc_application' GROUP BY TARGET_WORKCLASS,APPL_NAME ORDER BY TARGET_WORKCLASS;    TARGET_WORKCLASS AVG_COST             COUNT                             TOT_LIFETIME_MINS AVG_SORT_MEM4K         MAX_SORT_MEM4K       MAX_LIFETIME_MINS AVG_LIFETIME_MINS  ---------------- -------------------- --------------------------------- ----------------- -------------------- -------------------- ----------------- -----------------  HEAVY-WEIGHT                  8803799                             4848.             32471               403278              9723719               446                 6  LIGHT-WEIGHT                     3221                             2110.                 0                  102                11814                 0                 0

In this example, there is at least one query in this application that is utilizing ~9M of the 12M 4K shared sort pages. Also, there is at least one query that is taking 446mins to execute. We can drill down into the queries that are taking long to execute by executing a SQL query such as:

    select timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS ACT_LIFETIME, SORT_SHRHEAP_TOP, SUBSTR(STMT_TEXT, 1, 200) AS STMT_TEXT from ACTIVITYSTMT_DB2ACTIVITIES as ast, ACTIVITY_DB2ACTIVITIES as av where ast.appl_id=av.appl_id and ast.uow_id=av.uow_id and ast.activity_id=av.activity_id order by ACT_LIFETIME desc fetch first 40 rows only;    ACT_LIFETIME SORT_SHRHEAP_TOP     STMT_TEXT   ------------ -----------------------------------------------------------------------------------  446         847281 -- start query 67 in stream 7 using template query78.tpl and seed 2000679862  with ws as    (select d_year AS ws_sold_year, ws_item_sk,      ws_bill_customer_sk ws_customer_sk  …

We can also drill down into the queries that is utilizing the most DB2 shared sort memory by running a query such as:

    select timestampdiff(4,TIME_COMPLETED - TIME_STARTED) AS ACT_LIFETIME, QUERY_TIMERON_COST, SORT_SHRHEAP_TOP, SUBSTR(STMT_TEXT, 1, 200) AS STMT_TEXT from ACTIVITYSTMT_DB2ACTIVITIES as ast, ACTIVITY_DB2ACTIVITIES as av where ast.appl_id=av.appl_id and ast.uow_id=av.uow_id and ast.activity_id=av.activity_id order by SORT_SHRHEAP_TOP desc fetch first 40 rows only;    ACT_LIFETIME QUERY_TIMERON_COST SORT_SHRHEAP_TOP STMT_TEXT    ------------ -------------------------------------------------------------------------------------------------------  7            25890940              9723719 -- start query 59 in stream 2 using template query4.tpl and seed 668399603  with year_total as (   select c_customer_id customer_id         ,c_first_name customer_first_name         ,c_last_name customer_              …

Here we can also see that this query has a timeron cost of ~25M. Now we have identified the queries that are taking a long time to execute and are taking the most resources, it is recommended to consult the performance best practices at Big SQL v5 performance tips to determine whether any of these tips could be applied to those queries. At some point you may still end up with queries that are resource intensive and the workload management configuration on the cluster may need to be customized to help control how many long running or resource intensive queries execute concurrently on the system. Consult Stage 2 of customized WLM configuration – establishing gatekeeper concurrency controls for details on customizing the WLM configuartion on the cluster.

Conclusion

In this blog we studied how to monitor the system in preparation for customizing WLM configuration. In subsequent blogs we will show how to use this data to customize the WLM configuration on your cluster.

Tips

Big SQL Workload Management Stage 1 - Monitoring - Hadoop Dev

Technical Blog Post

Abstract

Body