When to collect statistics:

The decisions made by the cost-based optimizer are crucial in determining the performance of a Big SQL query, and the optimizer makes its decisions based on statistical information it has about the data being queried. Therefore, accurate and up-to-date statistics on the data are the most crucial weapon to ensure optimal Big SQL performance. As such, statistics should be collected whenever:

  1. a new table is populated with data,
  2. an existing table’s data undergoes significant changes:
    • new data added,
    • old data removed,
    • existing data is updated
  3. a table is synchronized from Hive using the HCAT_SYNC_OBJECTS stored procedure

Which statistics to collect:

In order for the optimizer to have sufficient information to build efficient execution plans, statistics should be gathered for:

  1. Any table referenced in a query, and
  2. All columns which are referenced in predicates (including join predicates) and aggregation functions referenced in queries, and
  3. All columns referenced in informational constraints, and
  4. Column group statistics for columns with implied relationships, and
  5. For partitioned tables, column group statistics on the partitioning columns.

There is no need to gather statistics for columns which only appear in the SELECT list of a query.
The Analyze statement is used to collect statistics in Big SQL. By default, Analyze collects table level statistics such as cardinality, number and size of files, and column level statistics such as min and max values, column cardinality, distribution and histogram statistics, number of nulls, etc…

How to collect statistics:

1. Table and column statistics:

According the rules above, the following Analyze statement will collect statistics for the WEATHER table, including columns STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH;

2. Use sampling to reduce Analyze time:

Use a 10% sampling rate whenever possible to improve the efficiency of the Analyze statement. Internal lab tests have shown that in most cases accuracy of the statistics is not impacted, but performance of Analyze can improve by as much as 5x compared to no sampling.

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH 
TABLESAMPLE SYSTEM (10);

3. Collect statistics for columns referenced in informational constraints:

Assuming the WEATHER table has the following information constraint defined as a primary key:
ALTER TABLE SIMON.WEATHER ADD PRIMARY KEY (STATION_ID, DATE) NOT ENFORCED;

Then column group statistics should be collected on STATION_ID and DATE:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH, 
(STATION_ID, DATE)
TABLESAMPLE SYSTEM (10);

4. Collect column group statistics for partitioning columns:

Collecting column group statistics on the partitioning columns of a table will improve the accuracy of the optimizer’s filter factors for predicates involving these columns. If the WEATHER table is partitioned on DATE, COUNTRY and CITY, then to collect column group statistics on these columns:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS
STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH,
(STATION_ID, DATE), (DATE, COUNTRY, CITY) 
TABLESAMPLE SYSTEM (10);

When using the ALL COLUMNS clause of the Analyze statement, or the table is being analyzed by Auto-Analyze, then column group statistics will automatically be collected for the partitioning columns of the table. However, when specifying a column list in the Analyze statement, the partitioning columns must be explicitly specified.

5. Collect column group statistics for columns with implied relationships:

If there is an implied relationship between columns in a table, and these columns are often used in predicates within queries, then collecting column group statistics on the related columns will provide the optimize with valuable information about that relationship.
For example, in the WEATHER table there is an implied relationship between the CITY and COUNTRY columns. If CITY is ‘San Francisco’, then COUNTRY will be ‘USA’, and if CITY is ‘Melbourne’ then COUNTRY will be ‘USA’ or ‘AUSTRALIA’. Column group statistics on CITY and COUNTRY will help the Big SQL optimizer infer this relationship and use the information to more efficiently plan the execution of queries.
To collect column group statistics on COUNTRY and CITY:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS
STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH,
(STATION_ID, DATE), (DATE, COUNTRY, CITY),(COUNTRY, CITY) 
TABLESAMPLE SYSTEM (10);

6. Customize Auto-Analyze schedules to suit your environment:

Create multiple Auto-Analyze schedules to suit your work environment. The default schedule is to Analyze a single table every 10 minute cycle, throughout a 24 hour period. This schedule is designed to have minimal impact on Big SQL workloads. However, multiple schedules can be defined in which the number of tables to Analyze per cycle and the duration of the cycles can be customized.
For example, if the majority of queries are executed during an 8AM to 6PM working day, and the rest of the time the cluster is relatively idle, then by creating the following two schedules one can maximise the through-put of Auto-Analyze during the quite period:

  1. Day schedule, which analyzes a single table every 10 mins between 08:00 and 17:59:
    CALL SYSPROC.ADMIN_TASK_ADD('BIGSQL_CHECK_ANALYZE DAY',NULL, NULL, NULL, '0,10,20,30,40,50 8-17 * * *',  'SYSHADOOP','BIGSQL_CHECK_ANALYZE','VALUES(1)',NULL,NULL);
  2. Night schedule, which analyzes 5 tables every 15 minute cycle during the hours of 18:00 to 07:59 the next day:
    CALL SYSPROC.ADMIN_TASK_ADD('BIGSQL_CHECK_ANALYZE NIGHT',NULL, NULL, NULL, '0,15,30,45 0-7,18-23 * * *',  'SYSHADOOP','BIGSQL_CHECK_ANALYZE','VALUES(5)',NULL,NULL);

7. Do not use the ALL COLUMNS clause on wide tables:

The ALL COLUMNS clause is a convenient way to collect statistics on all columns in a table without listing each column individually. For the WEATHER table the Analyze statement specified in Section 5 can be abbreviated to:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR ALL COLUMNS
(DATE, COUNTRY, CITY),(COUNTRY, CITY) 
TABLESAMPLE SYSTEM (10);

However, if the WEATHER table is a very wide table (say with more than 100 columns), this abbreviated form of the Analyze statement will gather statistics on many more columns than is actually necessary to achieve optimal query performance; and of course the Analyze statement itself will consume more resources, and take longer to execute. Therefore, for wide tables (those with many columns), avoid using the ALL COLUMNS clause and issue the Analyze statement by explicitly specifying the columns to collect stats on.

8. Make use of Cumulative Statistics:

Cumulative Statistics allows the user to collect statistics on additional columns without having to re-collect statistics on those columns which already have statistics. For example, if we now want to collect statistics on columns TEMP_MIN and TEMP_MAX of the WEATHER table; assuming there has been no significant changes in the data since the last Analyze:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS
TEMP_MIN, TEMP_MAX
TABLESAMPLE SYSTEM (10);

If there have been significant changes to the data since the original stats were collected, then it is recommended to re-collect stats on all appropriate columns:

ANALYZE TABLE SIMON.WEATHER COMPUTE STATISTICS FOR COLUMNS
STATION_ID, DATE, COUNTRY, CITY, TEMP, RAIN_MM, WINDSPEED_KMH, TEMP_MIN, TEMP_MAX
(STATION_ID, DATE), (DATE, COUNTRY, CITY),(COUNTRY, CITY)
TABLESAMPLE SYSTEM (10);

9. Analyzing partitioned tables:

The first time a partitioned table is analyzed, all the partitions in the table will be read and statistics gathered for the specified columns. Subsequent Analyze statements will only collect statistics for partitions that have been added or changed since the last Analyze statement was run – and may therefore be substantially faster. This incremental feature can be specified using the INCREMENTAL keyword of the Analyze statement and is also the default for partitioned tables.

10. Keep track of tables which do not have statistics:

The following SQL identifies all tables in schema SIMON which do not have statistics:

SELECT  SUBSTR(tabname,1,20) AS table, stats_time, card 
  FROM  syscat.tables 
  WHERE tabschema='SIMON' 
  AND   card=-1;

To identify the columns in the SIMON.WEATHER table that have no statistics (colcard=-1):

SELECT  SUBSTR(colname,1,20) AS column, colcard 
  FROM  syscat.columns 
  WHERE tabschema='SIMON' 
  AND   tabname='WEATHER' 
  AND   colcard=-1;

Additional information:

For a tour of all the great new features of Analyze in Big SQL v42, see A Tour of ANALYZE in Big SQL v4.2.

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

1 comment on"Big SQL v4.2: Best Practices for Collecting Statistics"

  1. […] plan the execution of queries. For more information on which statistics to gather, see the Big SQL v4.2: Best Practices for Collecting Statistics […]

Join The Discussion

Your email address will not be published. Required fields are marked *