Sampling Overview
Invoking the BigInsights Big SQL ANALYZE command typically scans the entire table and gathers statistics to feed to the Big SQL Optimizer. A lack of statistics is one major reason for inefficient access plans. But scanning the entire table, especially when the table is very large, can take some time. With this Analyze Sampling feature in Big SQL 4.2, a sample of the table can be scanned, instead of scanning the entire table. This can reduce the time taken to run Analyze tremendously. You can specify the percentage of the table that should be scanned, using the TABLESAMPLE SYSTEM option of the ANALYZE command, as shown in green in the example below:
ANALYZE TABLE schema.table COMPUTE STATISTICS FOR COLUMNS col1, col2, col3 TABLESAMPLE SYSTEM(10);
How Analyze Sampling Works
The Big SQL Scheduler, by design, divides the table into “splits”. When Analyze is executed, if 10% TABLESAMPLE is chosen and the Big SQL Scheduler determines that there are 500 splits for that table, then 50 splits will be used as the sample during ANALYZE. Analyze gathers statistics on these sampled splits, from which full table statistics are extrapolated (e.g. table cardinality). For some statistics, the sample statistics are not relevant, for example, number of files and partitions for the table. In such cases, these statistics are always gathered on the entire table.
When new partitions are added to a table and ANALYZE was executed once before, ANALYZE only scans partitions that were never scanned before, this is referred to as incremental analyze. Note that when the TABLESAMPLE option is chosen, ANALYZE will scans partitions from the previous data as well as the newly added partitions when performing an incremental analyze. Because a sample of the data is being scanned this should not have any impact on analyze performance.
Performance Implications of Analyze Sampling
The overall objective is to reduce the time it takes to run Analyze on tables, without regressing queries. If you choose a percentage that is too small, Analyze may run fast. However, since smaller samples increase the level of inaccuracy in the statistics, some queries could perform better if a larger percentage is chosen. From our internal studies using the TPC-DS benchmark workload, a sample percentage of 10 gave us almost the same level of query performance as not using sampling at all, while a 5x improvement in Analyze processing time was observed. Therefore all automatic ANALYZE features (Auto Analyze, Analyze after Load and Analyze after HCAT_SYNC_OBJECTS) use a sample percentage of 10.
Performance Results
For TPC-DS, sampling at 10% reduces runtime of Analyze v2 by 4.6x and had no impact on query runtime.
For TPC-H, sampling at 10% reduces Analyze v2 time by 3.4x with no impact on query runtime.
Some Gotchas
- The percentage specified when using TABLESAMPLE SYSTEM does not represent the percentage of rows that are scanned, but rather the percentage of the “splits”.
- For very small tables, where the number of splits is less than 100, a percentage of the table that is actually scanned can be greater than the percentage specified. For example, if the Big SQL Scheduler determines that there there are only 5 splits for that table and the TABLESAMPLE SYSTEM(10) option is chosen, then 1 split is used as the sample. This means that instead of 10%, 20% of the table is actually scanned.
- Subsequent ANALYZE commands could go against different “splits” – which means that the statistics gathered each time are not guaranteed to be exactly the same. Our internal testing found that this did not impact query performance, but this is something to consider if you are comparing statistics gathered for multiple ANALYZE…TABLESAMPLE SYSTEM commands.