Cumulative Statistics Overview
The BigInsights Big SQL ANALYZE command collects statistics, such as cardinality, number of distinct values, minimum value, maximum value, most frequent values, column value histograms, etc. The Big SQL query optimizer uses these statistics to produce plans that will efficiently process queries. Collecting more statistics on a table generally means that the optimizer can make better decisions about processing the query in the most efficient way.
The Cumulative Statistics feature of ANALYZE enables merging the statistics collected by the current ANALYZE command with any previously collected statistics on the table. This is especially useful when we want to collect statistics for a table in stages. For example, if we have a table with three columns and we want to collect statistics for all three columns, we can run three separate ANALYZE commands with each column individually, instead of running a single ANALYZE command with all three columns. Collecting statistics in this manner has benefits, such as:
- reusing previously collected statistics for better query processing performance, and
- breaking down the task of collecting statistics for many columns in one ANALYZE command into less resource intensive multiple ANALYZE commands.
This blog describes how to use the Cumulative Statistics feature of ANALYZE, along with the pros and cons of using it.
Cumulatively Collecting Statistics
First, Cumulative Statistics is a new feature added to Big SQL v4.2 and is only supported in ANALYZE v2. In Big SQL v4.2, ANALYZE v2 is enabled by default. If ANALYZE v2 is not enabled for any reason, we have to enable it by setting the biginsights.stats.use.v2
property to true
. We can set the value of this property either at the session level or system-wide. See Enabling or disabling ANALYZE v2 for how to do this.
Second, the default behavior of ANALYZE v2 is to keep the previously collected statistics – i.e. Cumulative Statistics is enabled by default. It is possible to change this default behavior — so that the user can choose to remove the previously collected statistics — by setting the biginsights.stats.cumulative
property to false,
either at the session level or system-wide. See Enabling or disabling Cumulative Statistics for how to do this.
How It Works
Now let’s consider that we have the following table storing information about orders:
CREATE HADOOP TABLE orders ( orderkey int not null, custkey int not null, orderstatus char(1) not null, totalprice double not null, orderdate date not null, orderpriority char(1) not null, state varchar(20) , country varchar(20) , comment varchar(100) ) STORED AS PARQUETFILE;
Furthermore, let’s consider that we are in a situation where we have collected statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE
and ORDERPRIORITY
columns, using the following ANALYZE command:
ANALYZE TABLE BIGSQL.ORDERS COMPUTE STATISTICS FOR COLUMNS ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE, ORDERPRIORITY;
At this point, let’s consider we have some new queries, where we reference the CUSTKEY, STATE
and COUNTRY
columns. We would like to collect statistics for CUSTKEY
and also column group statistics for STATE
and COUNTRY
. We need these to help the Big SQL optimizer infer the relationship between the STATE
and COUNTRY
columns. This will aid in generating more efficient query plans for queries involving the STATE
and COUNTRY
columns. We would run the following ANALYZE command:
ANALYZE TABLE BIGSQL.ORDERS COMPUTE STATISTICS FOR COLUMNS CUSTKEY, (STATE, COUNTRY);
The Cumulative Statistics feature of ANALYZE will retain the statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE
and ORDERPRIORITY
columns, which were collected using the first ANALYZE command, after running the second ANALYZE command. If we check which columns do not have statistics at this point, we will only see the COMMENT
column. We can use the following query to check this:
SELECT COLCARD, COLNAME FROM SYSSTAT.COLUMNS WHERE TABSCHEMA='BIGSQL' AND TABNAME='ORDERS' AND COLCARD=-1; COLCARD COLNAME -------------------- -------------------- -1 COMMENT 1 record(s) selected.
The second ANALYZE command has collected the statistics for the CUSTKEY
, STATE
and COUNTRY
columns (STATE
and COUNTRY
as part of the column group statistics collection) and merged the previously collected statistics for the ORDERKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE
and ORDERPRIORITY
columns.
Pros and Cons of Cumulative Statistics
As we have seen in the example above, Cumulative Statistics enables collecting statistics in stages. The main advantages of this include:
- Preserving previously collected statistics
- Breaking down resource intensive ANALYZE commands into multiple, less resource intensive, ANALYZE commands
Collecting more statistics will generally lead to better query processing performance. Therefore, if we have previously collected some statistics, keeping them will generally lead to better query processing performance. Furthermore, it may get difficult to manage a resource intensive ANALYZE command, with many columns for example, in terms of ANALYZE execution time. In this scenario, Cumulative Statistics enables breaking down the ANALYZE command with many columns into multiple less-resource-intensive and more manageable ANALYZE commands. As Cumulative Statistics will merge the previously collected statistics with the statistics collected by the current ANALYZE, the overall result of these multiple ANALYZEs and the single ANALYZE with many columns is the collection of the same statistics.
If the table data has significantly changed between ANALYZE commands, then previously collected statistics might become inconsistent with respect to the table data. If this happens, the previously collected statistics will be discarded. Therefore, we should consider not changing the table data significantly between ANALYZE commands when we are using the Cumulative Statistics feature.
Additional information
Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.