A Tour of ANALYZE in Big SQL v4.2

Introduction

Accurate and up-to-date statistics are the key ingredient when it comes to Big SQL query performance. Good statistics can improve query performance manyfold, whereas old, inaccurate or non-existent statistics can have the opposite effect. The following article takes a tour of the many enhancements to Analyze in gathering statistics in Big SQL v4.2.

The ANALYZE command is used to gather both table and column level statistics on Big SQL tables. Prior to Big SQL v4.2, the ANALYZE command spawned a MapReduce job, which read the table being analyzed, gathered the appropriate statistics and stored the information in the Big SQL catalog. The ANALYZE command has been completely overhauled in Big SQL v4.2; it no longer creates a MapReduce job, and all processing to gather and update statistics is done within the Big SQL engine itself. This greatly simplifies statistics gathering, since there is now no need to tune and monitor the MapReduce jobs created by ANALYZE.

Tip: If you wish to return to the MapReduce method, use the following command before invoking Analyze – but be warned, the MapReduce method is deprecated and likely to be removed in future releases of Big SQL:

SET HADOOP PROPERTY biginsights.stats.use.v2=false

Big SQL v4.2 also introduces autonomic smarts, which automatically gather statistics when:

a LOAD HADOOP statement has completed, or
an HCAT_SYNC_OBJECTS procedure has completed, or
Big SQL detects a table’s statistics are out of date.

The remainder of the article will explore these new features in more detail.

Gathering statistics explicitly using ANALYZE

Use the ANALYZE statement to gather table and column level statistics for a Big SQL table. For example, to gather statistics for the WEATHER table, along with columns date, location, temp, rain_mm and windspeed_kmh, use:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS DATE, LOCATION, TEMP, RAIN_MM, WINDSPEED_KMH;

Depending on the properties and number of columns analyzed, this can be a resource intensive operation – often scanning large portions of the table in order to gather the statistics it needs. By default, ANALYZE collects table level statistics, such as cardinality, number and size of files, and column level statistics such as min and max values, column cardinality, distribution and histogram statistics, number of nulls, etc.

Using Sampling to improve efficiency

Sampling has been introduced in Big SQL v4.2 to improve the efficiency of the ANALYZE statement. The TABLESAMPLE clause of the ANALYZE statement can be used to specify the percentage of table data to sample when gathering statistics. For example, the following statement will use a 10% sample of data in the WEATHER table:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS DATE, LOCATION, TEMP, RAIN_MM, WINDSPEED_KMH TABLESAMPLE SYSTEM (10);

Performance tests have shown sampling 10% of the data can improve performance of ANALYZE by as much as 5x compared to no sampling. In lab tests, a 10% sample still maintained sufficient accuracy of the statistics and did not adversely impact query performance. The Big SQL Analyze – Improve Analyze performance with Sampling article has more details on Sampling.

The `ALL COLUMNS` clause

Another new feature of ANALYZE in v4.2 is support for the ALL COLUMNS clause. For example, to gather statistics for all columns in the WEATHER table using a 10% sampling rate:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM (10);

When the ALL COLUMNS clause is used on a partitioned table, column group statistics are automatically collected on the partitioning columns. Although this clause provides convenience, if a table has many columns and only a small set actually require statistics, then specifying ALL COLUMNS will result in unnecessary statistics being gathered, and increased resource consumption during their collection.

Cumulative Statistics

Also new in Big SQL v4.2 is the cumulative collection of statistics. For example, if columns {A,B,C} of table T are analyzed:

ANALYZE TABLE T COMPUTE STATISTICS FOR COLUMNS A, B, C TABLESAMPLE SYSTEM (10);

Then sometime later it is deemed necessary to collect statistics for columns {D,E}:

ANALYZE TABLE T COMPUTE STATISTICS FOR COLUMNS D, E TABLESAMPLE SYSTEM (10);

Following these two ANALYZE statements, table T will have statistics for columns {A,B,C,D,E}. Of course, if there have been significant changes in the data in table T since the first ANALYZE, then it would be wise to re-collect statistics for all columns. The Big SQL Analyze in Stages: Cumulative Statistics article has more information on Cumulative Statistics.

Auto-Analyze after LOAD

Following a LOAD HADOOP command, Auto-Analyze will automatically kick-off an Analyze statement for the table just loaded. All columns for the table just loaded will be analyzed, with a sampling rate of 10%. So a Load in Big SQL v4.2 will take longer than previous versions, because it is also gathering statistics for the table.

The behavior of Auto-Analyze following a Load can be customized using the biginsights.stats.auto.analyze.post.load configuration property. Valid values for this property are:

ONCE – only execute Auto-Analyze on this table once. This is the default.
DEFERRED – instead of immediately executing ANALYZE following the Load, add the table to the Auto-Analyze queue (similar in behaviour to HCAT_SYNC_OBJECTS).
ALWAYS – always follow the Load command with an Analyze
NEVER – effectively disables Auto-Analyze after Load

biginsights.stats.auto.analyze.post.load can be set at the session level using the SET HADOOP PROPERTY command, or globally by editing the bigsql-conf.xml file.

Auto-Analyze after HCAT_SYNC_OBJECTS

Each table synchronized via the HCAT_SYNC_OBJECT command will be automatically added to the Auto-Analyze queue. At pre-defined intervals, Auto-Analyze will wake up and take a table from the queue to Analyze. All columns of the synchronized table will be analyzed, with a sampling rate of 10%.

The behavior of Auto-Analyze following object synchronization can be controlled using the biginsights.stats.auto.analyze.post.syncobj configuration property. Valid values for this property are:

DEFERRED – add the table to the Auto-Analyze queue. This is the default.
NEVER – effectively disables Auto-Analyze after synchronization
COPYHIVE – the statistics are copied into Big SQL from the Hive catalog. This is a low cost alternative to computing the statistics within Big SQL, but is likely to provide less accurate statistics. Hive by default will only collects table statistics; the table must be explicitly analyzed in Hive for column level statistics to be collected. In addition, Hive does not collect all the detailed stats regularly used by Big SQL.

The biginsights.stats.auto.analyze.post.syncobj property can be set in a similar fashion to biginsights.stats.auto.analyze.post.load.

Auto-Analyze

Auto-Analyze is another new feature in Big SQL v4.2, designed to greatly simplify the process of collecting statistics. If a significant amount of data is added to a table, or the table does not have any statistics, then Auto-Analyze will schedule execution of an ANALYZE task for that table by placing the table onto the Auto-Analyze queue.

The queue is serviced by the Administrative Task Scheduler (ATS), which by default causes Auto-Analyze to wake up every 10 minutes, select a table from the queue, and Analyze that table. Once this is complete, Auto-Analyze goes back to sleep until it is reawakened by the ATS.

Auto-Analyze schedules can be customized. For example a daytime schedule can be created to execute a single Analyze statement per cycle during the hours of 8AM to 6PM, and a nighttime schedule to execute 5 Analyze statements per cycle from 6PM to 8AM.

Auto-Analyze will collect statistics for the table and all columns, using a sampling rate of 10%. For example, if Auto-Analyze was analyzing the WEATHER table, it would execute the following:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR ALL COLUMNS TABLESAMPLE SYSTEM (10);

Auto-Analyze will also internally collect column group statistics for all the partitioning columns of a partitioned table. So if the WEATHER table is partitioned on (DATE, LOCATION), then Auto-Analyze will issue the same statement as above, but also collect column group statistics for (DATE, LOCATION).

Use the BIGSQL_AUTO_ANALYZE_STATUS stored procedure to check the status of running and completed Auto-Analyze tasks:

CALL SYSHADOOP.BIGSQL_AUTO_ANALYZE_STATUS(NULL, NULL);

A failed task is indicated by a non-zero SQLCODE and a running task is indicated by a NULL in STATUS and END_TIME.

The Faster BigInsights Big SQL queries with Auto-Analyze article has more details on Auto-Analyze.

Additional information

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

Tips

A Tour of ANALYZE in Big SQL v4.2 - Hadoop Dev

Technical Blog Post

Abstract

Body

Introduction

Gathering statistics explicitly using ANALYZE

Using Sampling to improve efficiency

The `ALL COLUMNS` clause

Cumulative Statistics

Auto-Analyze after LOAD

Auto-Analyze after HCAT_SYNC_OBJECTS

Auto-Analyze

Additional information

UID

Share your feedback

Need support?

Tips

A Tour of ANALYZE in Big SQL v4.2 - Hadoop Dev

Technical Blog Post

Abstract

Body

Introduction

Gathering statistics explicitly using ANALYZE

Using Sampling to improve efficiency

The ALL COLUMNS clause

Cumulative Statistics

Auto-Analyze after LOAD

Auto-Analyze after HCAT_SYNC_OBJECTS

Auto-Analyze

Additional information

UID

Share your feedback

Need support?

The `ALL COLUMNS` clause