Big SQL Analyze in Stages: Cumulative Statistics

Cumulative Statistics Overview

The BigInsights Big SQL ANALYZE command collects statistics, such as cardinality, number of distinct values, minimum value, maximum value, most frequent values, column value histograms, etc. The Big SQL query optimizer uses these statistics to produce plans that will efficiently process queries. Collecting more statistics on a table generally means that the optimizer can make better decisions about processing the query in the most efficient way.

The Cumulative Statistics feature of ANALYZE enables merging the statistics collected by the current ANALYZE command with any previously collected statistics on the table. This is especially useful when we want to collect statistics for a table in stages. For example, if we have a table with three columns and we want to collect statistics for all three columns, we can run three separate ANALYZE commands with each column individually, instead of running a single ANALYZE command with all three columns. Collecting statistics in this manner has benefits, such as:

reusing previously collected statistics for better query processing performance, and
breaking down the task of collecting statistics for many columns in one ANALYZE command into less resource intensive multiple ANALYZE commands.

This blog describes how to use the Cumulative Statistics feature of ANALYZE, along with the pros and cons of using it.

Cumulatively Collecting Statistics

First, Cumulative Statistics is a new feature added to Big SQL v4.2 and is only supported in ANALYZE v2. In Big SQL v4.2, ANALYZE v2 is enabled by default. If ANALYZE v2 is not enabled for any reason, we have to enable it by setting the biginsights.stats.use.v2 property to true. We can set the value of this property either at the session level or system-wide. See Enabling or disabling ANALYZE v2 for how to do this.

Second, the default behavior of ANALYZE v2 is to keep the previously collected statistics – i.e. Cumulative Statistics is enabled by default. It is possible to change this default behavior — so that the user can choose to remove the previously collected statistics — by setting the biginsights.stats.cumulative property to false, either at the session level or system-wide. See Enabling or disabling Cumulative Statistics for how to do this.

How It Works

Now let’s consider that we have the following table storing information about orders:

CREATE HADOOP TABLE orders  (      orderkey       int             not null,      custkey        int             not null,      orderstatus    char(1)         not null,      totalprice     double          not null,      orderdate      date            not null,      orderpriority  char(1)         not null,      state          varchar(20)             ,      country        varchar(20)             ,      comment        varchar(100)  )  STORED AS PARQUETFILE;

Furthermore, let’s consider that we are in a situation where we have collected statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE and ORDERPRIORITY columns, using the following ANALYZE command:

ANALYZE TABLE BIGSQL.ORDERS COMPUTE STATISTICS FOR COLUMNS ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE, ORDERPRIORITY;

At this point, let’s consider we have some new queries, where we reference the CUSTKEY, STATE and COUNTRY columns. We would like to collect statistics for CUSTKEY and also column group statistics for STATE and COUNTRY. We need these to help the Big SQL optimizer infer the relationship between the STATE and COUNTRY columns. This will aid in generating more efficient query plans for queries involving the STATE and COUNTRY columns. We would run the following ANALYZE command:

ANALYZE TABLE BIGSQL.ORDERS COMPUTE STATISTICS FOR COLUMNS CUSTKEY, (STATE, COUNTRY);

The Cumulative Statistics feature of ANALYZE will retain the statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE and ORDERPRIORITY columns, which were collected using the first ANALYZE command, after running the second ANALYZE command. If we check which columns do not have statistics at this point, we will only see the COMMENT column. We can use the following query to check this:

SELECT COLCARD, COLNAME FROM SYSSTAT.COLUMNS WHERE TABSCHEMA='BIGSQL' AND TABNAME='ORDERS' AND COLCARD=-1;    COLCARD              COLNAME               -------------------- --------------------                    -1 COMMENT      1 record(s) selected.

The second ANALYZE command has collected the statistics for the CUSTKEY, STATE and COUNTRY columns (STATE and COUNTRY as part of the column group statistics collection) and merged the previously collected statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE and ORDERPRIORITY columns.

Pros and Cons of Cumulative Statistics

As we have seen in the example above, Cumulative Statistics enables collecting statistics in stages. The main advantages of this include:

Preserving previously collected statistics
Breaking down resource intensive ANALYZE commands into multiple, less resource intensive, ANALYZE commands

Collecting more statistics will generally lead to better query processing performance. Therefore, if we have previously collected some statistics, keeping them will generally lead to better query processing performance. Furthermore, it may get difficult to manage a resource intensive ANALYZE command, with many columns for example, in terms of ANALYZE execution time. In this scenario, Cumulative Statistics enables breaking down the ANALYZE command with many columns into multiple less-resource-intensive and more manageable ANALYZE commands. As Cumulative Statistics will merge the previously collected statistics with the statistics collected by the current ANALYZE, the overall result of these multiple ANALYZEs and the single ANALYZE with many columns is the collection of the same statistics.

If the table data has significantly changed between ANALYZE commands, then previously collected statistics might become inconsistent with respect to the table data. If this happens, the previously collected statistics will be discarded. Therefore, we should consider not changing the table data significantly between ANALYZE commands when we are using the Cumulative Statistics feature.

Additional information

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

IBM Support

Tips

Big SQL Analyze in Stages: Cumulative Statistics - Hadoop Dev

Technical Blog Post

Abstract

Body

Cumulative Statistics Overview

Cumulatively Collecting Statistics

How It Works

Pros and Cons of Cumulative Statistics

Additional information

UID

Share your feedback

Need support?