IBM Support

Big SQL v5 Best Practices and Guidelines - Performance - Hadoop Dev

Technical Blog Post


Abstract

Big SQL v5 Best Practices and Guidelines - Performance - Hadoop Dev

Body

This blog outlines some Best Practices to improve Big SQL performance in Big SQL v5. The tips are ordered according to when we believe they should be implemented. i.e. we cover install, schema design and runtime tuning in that order in this blog.

1. Increase resource allocation

A Hadoop cluster with Big SQL installed and YARN not enabled has its resources split between Big SQL and all other Hadoop components. The ‘Big SQL resource allocation’ install property specifies how much of the cluster’s resources (CPU and memory) are assigned to Big SQL. One way to improve Big SQL query performance is to allocate more cluster resources to Big SQL. If Big SQL is already installed, the bigsql resource allocation setting from Ambari has no effect but there are other steps to increase the resource allocation provided in Increasing Big SQL Memory. When YARN is enabled for Big SQL, there is no need to adjust this parameter.

2. Update data directories

The Big SQL default temporary table space hold temporary data required by Big SQL when performing operations such as sorts. If these sorts can not fit in memory, they will spill to the disks associated with the temporary tablespace. When adding the Big SQL service in Ambari, use the ‘Data directories’ field to specify the paths to the temporary tablespace. These directories should be accessible from all the nodes where Big SQL is installed on the cluster. As a best practice, the path or paths specified should be backed by multiple spindles for good performance. Any paths that just use the root disks should NOT be used e.g /hadoop/bigsql. Ensure that the Hadoop group has write permission on the path(s) specified.
Big SQL Data Directories
For improved Big SQL performance, the root disk should not be used for the HDFS “DataNode directories”. If for example, /hadoop/hdfs/data is listed as one of the paths along with other paths such as /data1/hadoop/hdfs/data then remove /hadoop/hdfs/data if this path uses the root disk.
If Big SQL has already been installed, the temporary tablespace can be altered using the procedure outlined in How to redistribute Big SQL storage paths.

3. Choose appropriate data types

The STRING data type is commonly used in Hive. In Big SQL, the STRING type is mapped internally as VARCHAR(32K). The Big SQL engine internally works with data in units of 32K pages and works most efficiently when the definition of a table allows a row to fit within 32k of memory. Once the calculated row size exceeds 32k, performance degradation can be seen for certain queries. Therefore, the use of STRING when defining columns of a table is strongly discouraged. Instead, modify references to STRING to explicit VARCHAR(n) that most appropriately fits the data size, or use the bigsql.string.size property to lower the default size of the VARCHAR to which the STRING is mapped when creating new tables.
The use of TIMESTAMP and DATE data types with the PARQUET storage format is discouraged when LOAD HADOOP and INSERT…SELECT from Hive is used to generate the data files, more information can be found in Performance impact of accessing TIMESTAMP fields from Big SQL with Parquet MR files.
When tables created from Hive with the DATE type are synchronized with Big SQL, the Java I/O interface is used. For cases where the Java I/O engine is used the recommended file format is ORC. Other data types that are only supported in the Java I/O interface include complex types and binary type. For these cases ORC should be used instead of parquet. Read more about data types supported in Big SQL v5.

4. Choose an optimized storage format

The storage format used to store your Big SQL data is a critical contributor to the performance and efficiency of your cluster. Not only does the storage format have a direct influence on query times, it also impacts the volume of data stored (due to compression), CPU consumed to read and process that data, and data ingest times. In Big SQL 5.0.1 there have been significant improvements to Big SQL performance with ORC file format. From our internal testing using the TPC-DS benchmark, performance with ORC and Parquet file formats is on par for most queries. Although Big SQL supports many of the popular Hadoop storage formats, for analytical workloads the Parquet or ORC storage format is recommended on Big SQL v5.0.1 and beyond. For cases where the Java I/O engine is used the recommended file format is ORC as described in the section above.

5. PARTITION the data

Partitioning is a common practice in SQL over Hadoop solutions today. The aim of partitioning a table is to reduce the amount of data a query must process to arrive at the final answer. By reducing the data processed, the query completes faster, and consumes fewer resources. Data partitions are stored as separate directories and files on HDFS. Big SQL eliminates unneeded data directories and files during query processing. The more files eliminated during query processing the faster the queries can be executed. Specify the “PARTITIONED BY” clause on the “CREATE HADOOP TABLE” statement to take advantage of partitioning.
Big SQL v4.2 introduced several new and unique techniques to better exploit partitioned data sets – which expands the use-cases for partitioning in Big SQL above and beyond those offered by other SQL over Hadoop solutions. One of these features is the join range predicate filtering and deferred partition elimination feature. Here we inject some smart predicates automatically which can be used to eliminate additional partition directories during query processing. Partitioning on column expression is another feature which can be immensely powerful because it allows users to partition on high cardinality columns which previously was not practical. See it in action – play the following video.

6. Data Ingestion

Regardless of the storage format chosen, it is important to understand that HDFS is optimized for storing and processing large files. If your data ingest techniques create thousands of small files, then this can severely impact the performance and scalability of the Hadoop cluster. Read more on Best practices for Big SQL Data Ingestion.

7. Use ANALYZE to gather statistics

This is an important step in ensuring your Big SQL queries are efficiently and expeditiously executed. The ANALYZE command gathers statistics about the data within a table which is used by the Big SQL cost-based optimizer to plan efficient queries. The Auto-Analyze feature which is enabled by default since Big SQL 4.2 automatically collects statistics for all tables and all columns immediately following a LOAD or HCAT_SYNC_OBJECTS statement. If tables are ingested via INSERT…SELECT/CTAS or adding files directly to HDFS, an auto-analyze task is scheduled. If the table changes more than 50% auto-analyze is also automatically triggered on the table. ANALYZE uses sampling to speed up ANALYZE performance significantly. Since Analyze is now automatic, there is no need to worry about collect statistics manually. We also provide mechanisms whereby the Auto-analyze can be scheduled to execute at different times of the day e.g. at night time. For partitioned tables, ANALYZE collects incremental statistics. If a new partition is added to the table only that new partition is analyzed.
Since by default the analyze tasks are run one at a time in the background (because we did not want to impact workload performance), there can be circumstances whereby Analyze has not been run before the query is executed. For such cases, we have introduced optimizer heuristics to avoid bad operators such as NLJNs (nested loop joins) during query planning which is enabled by default. Read more about other Big SQL runtime and optimizer improvements.

8. Define Informational Constraints

Informational Constraints are much like regular primary and foreign key constraints (available in most RDBMSs today) except for Big SQL these constraints are not enforced. One might ask what’s the use of a constraint that is not enforced – well, even though these constraints are not enforced, once defined they do provide useful information to the Big SQL optimizer which it can use when planning a query. It is recommended to create primary and foreign key informational constraints to define relationships that exist in your schema. In an internal study, performance improved by 2x when defining informational constraints. Read more about Informational Constraints in Big SQL.

9. Add Logical Big SQL Workers

Logical Big SQL workers first introduced in Big SQL 4.2.5 can significantly improve performance for Big SQL. There are guidelines to enabling logical Big SQL workers which should be closely followed to get the maximum benefit from logical Big SQL workers. Enabling too many logical Big SQL workers can have an impact on the memory utilization of the cluster. However, adding just enough logical workers can greatly improve the parallelism and performance of Big SQL for INSERT and SELECT queries.

When YARN is enabled with Big SQL there can be performance benefits compared to when YARN is disabled. This is because logical Big SQL workers are activated in the background when YARN is enabled. However, when YARN is enabled for Big SQL, keep in mind that the YARN scheduler is given total control of the resources and if the scheduler decides it should give resources to other YARN containers that are not housing Big SQL workers there can be performance implications. Read more about Performance Implications of Enabling YARN for Big SQL.

10. Configuration parameters

There are other knobs within Big SQL and the larger Hadoop eco-system that can be tweaked to improve performance. Big SQL automatically calculates the memory required for sorting, caching and various heaps based on the memory available on the cluster and the Big SQL percent usage specified. As workloads are executed, Big SQL’s Self Tuning Memory Manager (STMM) monitors the memory consumption between the consumers and automatically adjusts the allocation to tune for the workload being executed.
The JAVA_HEAP_SZ configuration parameter is outside of the control of STMM. Prior to Big SQL 5.0.2, the default was 300MB, in Big SQL 5.0.2 the default has been increased to 2GB. This parameter may need to be increase IF there are errors during execution of ANALYZE or LOAD using the following command which would use a DB2 java heap size of 2GB “db2 update dbm cfg using JAVA_HEAP_SZ 524288”.
There are new compiler and runtime configuration parameters that are enabled by default for cases where we felt there would be widespread positive impact. On a case by case basis some of these options can be used to improve individual query performance.

11. Workload Management

Big SQL has a sophisticated built in workload management technology that can be used to stabilize the cluster and prioritize workloads ahead of others. The goal of workload management is to neither over saturate or under utilize the cluster, it can also help in cases where DB2 out of memory errors are encountered. More details can be found in Big SQL Workload Management for Improved Stability and Performance.

12. Materialized Query Tables (MQTs)

Hadoop MQTs are new in Big SQL 5.0.2. MQTs can simplify query processing and improve performance because expensive join or aggregation operations can be calculated and cached in MQTs.

13. Miscellaneous

Finally, but by no means least, if you want the maximum performance out of your Big SQL cluster, be sure to check that CPU Scaling is disabled.
On one of my clusters I ran into severe issues with performance because the resource manager memory was too low and the cluster keep running out of memory, crippling my system. From there the dashboard there is an indicator that the resource manager memory maybe getting too long, in this case you can increase it, HDP best practice recommends increasing it to 4GB from the default 1GB as described here.

14. Performance Debugging

There are lots of places where things can go wrong leading to performance issues. There could be issues with the operating system, the cluster configuration, memory or cpu contention, the list is long… We have provided some scripts to help gather information on the cluster when queries are running slower than usual to debug such issues. Information has been provided on this in Big SQL problem determination and performance debugging.

15. Performance Monitoring

Adhoc performance monitoring can be done with Data Server Manager or monitoring table functions. Information on Monitoring table functions and the metrics that they report can be found in the Monitoring section of the Big SQL Knowledge center. Historical Monitoring can also be done to answer the question of what is the longest or most resource intensive query that is running on the system. It uses tables to store the metrics instead of using in memory caching which Adhoc monitoring uses. More information can be found in Historical Monitoring with Event Monitors.

Summary

This blog presented some tips to improve performance in Big SQL v5. The tips with the most impact include ensuring that ANALYZE has been executed on the system, partitioning the tables and using ORC or Parquet file format when running on Big SQL v5.0.1 or later. In Big SQL v5.0.1 there have been major performance improvements for ORC file format. Adding logical Big SQL workers can also improve Big SQL performance but use the guidelines to determine the number of logical workers that should be configured.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259797