Support My IBM Log in

IBM Support

Big SQL Analyze - Improve Analyze performance with Sampling - Hadoop Dev

Technical Blog Post


Abstract

Big SQL Analyze - Improve Analyze performance with Sampling - Hadoop Dev

Body

Sampling Overview

Invoking the BigInsights Big SQL ANALYZE command typically scans the entire table and gathers statistics to feed to the Big SQL Optimizer. A lack of statistics is one major reason for inefficient access plans. But scanning the entire table, especially when the table is very large, can take some time. With this Analyze Sampling feature in Big SQL 4.2, a sample of the table can be scanned, instead of scanning the entire table. This can reduce the time taken to run Analyze tremendously. You can specify the percentage of the table that should be scanned, using the TABLESAMPLE SYSTEM option of the ANALYZE command, as shown in green in the example below:

    ANALYZE TABLE schema.table COMPUTE STATISTICS FOR COLUMNS col1, col2, col3 TABLESAMPLE SYSTEM(10);    

How Analyze Sampling Works

The Big SQL Scheduler, by design, divides the table into “splits”. When Analyze is executed, if 10% TABLESAMPLE is chosen and the Big SQL Scheduler determines that there are 500 splits for that table, then 50 splits will be used as the sample during ANALYZE. Analyze gathers statistics on these sampled splits, from which full table statistics are extrapolated (e.g. table cardinality). For some statistics, the sample statistics are not relevant, for example, number of files and partitions for the table. In such cases, these statistics are always gathered on the entire table.
When new partitions are added to a table and ANALYZE was executed once before, ANALYZE only scans partitions that were never scanned before, this is referred to as incremental analyze. Note that when the TABLESAMPLE option is chosen, ANALYZE will scans partitions from the previous data as well as the newly added partitions when performing an incremental analyze. Because a sample of the data is being scanned this should not have any impact on analyze performance.

Performance Implications of Analyze Sampling

The overall objective is to reduce the time it takes to run Analyze on tables, without regressing queries. If you choose a percentage that is too small, Analyze may run fast. However, since smaller samples increase the level of inaccuracy in the statistics, some queries could perform better if a larger percentage is chosen. From our internal studies using the TPC-DS benchmark workload, a sample percentage of 10 gave us almost the same level of query performance as not using sampling at all, while a 5x improvement in Analyze processing time was observed. Therefore all automatic ANALYZE features (Auto Analyze, Analyze after Load and Analyze after HCAT_SYNC_OBJECTS) use a sample percentage of 10.

Performance Results

For TPC-DS, sampling at 10% reduces runtime of Analyze v2 by 4.6x and had no impact on query runtime.
For TPC-H, sampling at 10% reduces Analyze v2 time by 3.4x with no impact on query runtime.

Some Gotchas

  • The percentage specified when using TABLESAMPLE SYSTEM does not represent the percentage of rows that are scanned, but rather the percentage of the “splits”.
  • For very small tables, where the number of splits is less than 100, a percentage of the table that is actually scanned can be greater than the percentage specified. For example, if the Big SQL Scheduler determines that there there are only 5 splits for that table and the TABLESAMPLE SYSTEM(10) option is chosen, then 1 split is used as the sample. This means that instead of 10%, 20% of the table is actually scanned.
  • Subsequent ANALYZE commands could go against different “splits” – which means that the statistics gathered each time are not guaranteed to be exactly the same. Our internal testing found that this did not impact query performance, but this is something to consider if you are comparing statistics gathered for multiple ANALYZE…TABLESAMPLE SYSTEM commands.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259933

Overview Annual report Corporate social responsibility Inclusion@IBM Financing Investor Newsroom Security, privacy & trust Senior leadership Careers with IBM Website Blog Publications Automotive Banking Consumer Good Energy Government Healthcare Insurance Life Sciences Manufacturing Retail Telecommunications Travel Our strategic partners Find a partner Become a partner - Partner Plus Partner Plus log in IBM TechXChange Community LinkedIn X Instagram YouTube Subscription Center Participate in user experience research Podcasts Contact IBM Privacy Terms of use Accessibility United States — English