IBM Support

Gathering statistics for Big SQL v41 using the ANALYZE command - Hadoop Dev

Technical Blog Post


Abstract

Gathering statistics for Big SQL v41 using the ANALYZE command - Hadoop Dev

Body

Note: This article applies to Big SQL v4.1 and all prior versions. In Big SQL v4.2, significant improvements where made to the ANALYZE command.

Accurate and up-to-date statistics are the key ingredient when it comes to Big SQL query performance. Good statistics can improve query performance many fold, whereas old, inaccurate or non-existent statistics can have the opposite effect. This blog describes best practices for collecting these statistics, along with hints on how to fine tune the collection process itself.

So why are statistics so important to Big SQL query performance? It’s because the decisions made by the cost-based optimizer are crucial in determining the performance of a query, and the optimizer makes its decisions based on statistical information it has about the data being queried. Therefore, accurate and up-to-date statistics on the data are the most crucial weapon to ensure optimal Big SQL performance. As such, statistics should be updated whenever:

  • a new table is populated with data,
  • an existing table’s data undergoes significant changes:
    • new data added,
    • old data removed,
    • existing data is updated

Gathering statistics using the ANALYZE command:
The ANALYZE command is used to gather both table and column level statistics on Big SQL tables. Unlike most other Big SQL commands, the ANALYZE command spawns a Map-Reduce job to read the table being analyzed, gathers the appropriate statistics and store this information in both the Big SQL and Hive catalogs. During query optimization, the optimizer reads the statistics from the Big SQL catalog and uses the information to generate an optimal execution plan.

In order for the optimizer to have sufficient information to build efficient execution plans, statistics should be gathered for:

  1. Any table referenced in a query, and
  2. All columns which are referenced in predicates (including join predicates) and aggregation functions referenced in queries, and
  3. All columns referenced in informational constraints.

There is no need to gather statistics for columns which only appear in the SELECT list of a query.

Collecting statistics for those columns that meet the criteria above is the most efficient way to improve the efficiency of the ANALYZE command.

In addition, if you intend to query the same table from both Big SQL and Hive, then because Big SQL’s ANALYZE command also stores the statistics in the Hive catalog, then there is no need to run the Hive ANALYZE command – as the statistics are already present in the Hive catalog.

Depending on the properties and number of columns analyzed, ANALYZE can be a relatively costly operation – often scanning large portions of the table in order to gather the statistics it needs. By default, ANALYZE collects table level statistics such as cardinality, number and size of files, and column level statistics such as min and max values, column cardinality, distribution and histogram statistics, number of nulls, etc….

ANALYZE will also collect frequency and histogram statistics – on all columns specified. The number of buckets collected and stored for these distribution statistics can be controlled by the following properties:

  SET HADOOP PROPERTY biginsights.stats.hist.num=80;    SET HADOOP PROPERTY biginsights.stats.mfv.num=80;

The example above sets the number of buckets to 80. The default number of buckets is 100. If more detailed distribution statistics will benefit your queries, the number of buckets should be increased. If distribution statistics are not benefiting your queries, then the number of buckets can be lowered, or set to 0 to disable collection of distribution statistics altogether. Collection of distribution statistics can be a costly and memory intensive process and therefore lowering the number of buckets can help reduce the elapsed time of the ANALYZE command – but may also have a negative impact on query performance. In the majority of situations, the default of 100 is recommended.

Column Group Statistics:
Data stored in different columns within your Big SQL tables may have implicit relationships, and if these columns are often referenced in query predicates then it would benefit the optimizer to be aware of such relationships.

For example, consider a table storing order information which contains the state and country of origin of each order:

CREATE HADOOP TABLE orders  (      orderkey       int             not null,      custkey        int             not null,      orderstatus    char(1)      totalprice     double          not null,      orderdate      date            not null,      orderpriority  char(1)         not null,      state          varchar(20)             ,      country        varchar(20)             ,      comment        varchar(100)  )  STORED AS PARQUETFILE;

There is an implied relationship between the STATE and COUNTRY columns – for example, all orders which originated in the STATE of ‘California’, will have ‘USA’ for COUNTRY, and those orders originating in the STATE of ‘Victoria’ will have ‘CANADA’ or ‘AUSTRALIA’ for COUNTRY. By gathering column group statistics on STATE and COUNTRY, the Big SQL optimizer is able to infer this relationship and use the information to more efficiently plan the execution of queries.

The following example collects table and column level statistics on the ORDERS table, along with column group statistics on STATE and COUNTRY (note that the COMMENT column has been omitted from the ANALYZE command since it is not referenced in query predicates):

ANALYZE TABLE SIMON.ORDERS     COMPUTE STATISTICS     FOR COLUMNS ORDERKEY, CUSTKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE,                ORDERPRIORITY, STATE, COUNTRY, (STATE, COUNTRY);  

It is most important to collect column group statistics when such a relationship exists between multiple columns, and those columns are often used together as predicates in queries.

Analyzing partitioned tables:
The first time a partitioned table is analyzed, all the partitions in the table will be read and statistics gathered for the specified columns. Subsequent ANALYZE statements will only collect statistics for partitions that have been added or changed since the last ANALYZE command was run – and may therefore be substantially quicker. This incremental feature can be specified using the “INCREMENTAL” keyword of the ANALYZE command and is also the default for partitioned tables.

When analyzing partitioned tables, Big SQL will spawn a number of Map-Reduce jobs. By default, each Map-reduce job will analyze 100 partitions of the table. For example, if a table has 261 partitions then ANALYZE will create 3 Map-Reduce jobs, the first two jobs will have 100 mappers (each mapper processing one partition), and the third will have 61 mappers. This batching is done because depending on your system configuration and resources, trying to analyze too many partitions at once will cause Hive to fail (Big SQL uses Hive under the covers to process the ANALYZE statement).

The Map-Reduce jobs created by ANALYZE are executed sequentially. In our example above, the third MR job (with 61 mappers) will not be started until the second MR job has completed, and the second job will not be started until the first job has completed. This sequential mode of executing Map-Reduce jobs can slow-down performance when analysing partitioned tables. However, if there are sufficient free resources on your cluster, then the number of partitions each job processes can be increased using the biginsights.stats.partition.batch property. By increasing the value of this property, you reduce the number of serial Map-Reduce jobs ANALYZE creates, and increase the number of mappers in each job (and hence increase the parallelism). The net effect is to reduce the elapsed time of the ANALYZE command. In our example above, if biginsights.stats.partition.batch=200, then only 2 jobs will be created, the first with 200 mappers, and the second with 61 mappers. In some cases, tuning this property has more than halved the execution time of ANALYZE against partitioned tables. The property can be set with the Big SQL SET HADOOP PROPERTY command as follows:

SET HADOOP PROPERTY biginsights.stats.partition.batch=200;

Other ANALYZE tuning guidelines:
Since the ANALYZE command spawns Map-Reduce jobs to collect the statistics, these jobs can be influenced and tuned using regular Map-Reduce techniques:

  • For large tables, with many columns to analyze, the bigsql.analyze.java.opts property can be used to increase the JVM Heap size of the Map-Reduce tasks created for the ANALYZE command.
    There may be situations when the default 3GB heap size is insufficient to collect all the statistics required – in these cases, the Map-Reduce job spawned by the ANALYZE command will fail with “Out of Memory” errors. In these situations, the JVM heap size of the map and reduce tasks can be increased – for example:

    SET HADOOP PROPERTY bigsql.analyze.java.opts="-Xmx4g"
  • Use mapreduce.input.fileinputformat.split.maxsize (mapred.max.split.size in Apache Hadoop 2.6 and below) and mapreduce.input.fileinputformat.split.minsize (mapred.min.split.size in Apache Hadoop 2.6 and below) to control the number of map & reduce tasks created to analyze the table. In particular, decreasing the value of mapreduce.input.fileinputformat.split.minsize can increase the number of mappers assigned to the Map-Reduce job spawned by ANALYZE – which may have a positive impact on the elapsed time of your ANALYZE command.
  • hive.input.format property can also be used to control the number of mappers assigned to the analyze Map-Reduce job. By default the property is set to “CombineHiveInputFormat“ which allows a single mapper to process several small input files – which is usually more efficient when dealing with smaller files. By setting this property to “org.apache.hadoop.hive.ql.io.HiveInputFormat” the Map-Reduce job will create a single mapper for each individual file. In cases where the processing for each file is intensive (as can be the case with ANALYZE) this can be more efficient – particularly if your cluster also has many idle Map-Reduce slots.

Queries to work out what stats you have:
The following queries can be used to find the tables and columns within the Big SQL database which have no statistics.
To identify those tables in a schema that have no statistics (card=-1):

select substr(tabname,1,20) as table, stats_time, card from syscat.tables where tabschema='SIMON' and card=-1;

And to identify the columns within a table that have no statistics (colcard=-1):

select substr(colname,1,20) as column, colcard from syscat.columns where tabschema='SIMON' and tabname='ORDERS' and colcard=-1;

And finally !
One of the most common questions I’m asked regarding ANALYZE is how often should the statistics be updated – once a day, once a week, once a month? The answer is, “it depends”. There is no single rule that can be applied to all situations. But what I can say is:

  1. ANALYZE should be run at least once against all Big SQL tables.
  2. ANALYZE can be run more frequently for rapidly changing partitioned fact tables, than for non-partitioned fact tables because ANALYZE will only gather statistics on the new and changed partitions.
  3. For slowly changing dimension tables, once a month or once a year may be sufficient.

The frequency of running ANALYZE will be something you hone over a period of time. Too frequent gatherings waste system resources with no benefit to query time, and too infrequent gatherings may cause the optimizer to make poor planning decisions that will impact query performance.

Scheduling regular executions of ANALYZE at low peak times is also a good tip to limit the impact ANALYZE will have on Big SQL transactions and other Map-Reduce workloads.

Additional information:
Visit the Knowledge Center for more details on the ANALYZE command.

To get started, download the IBM Open Platform with Apache Hadoop.

Visit the Knowledge Center for installation and configuration information.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259987