Support My IBM Log in

IBM Support

Big SQL 4.2 Performance - Overview of Partition on Column Expression Feature - Hadoop Dev

Technical Blog Post


Abstract

Big SQL 4.2 Performance - Overview of Partition on Column Expression Feature - Hadoop Dev

Body

Feature Overview

New in Big SQL 4.2, is the capability to allow an expression to be specified in the CREATE HADOOP TABLE statement with the PARTITIONED BY clause. This is a unique feature which allows more flexibility to table partitioning.

Partitioned tables can improve performance, however, it is recommended to avoid partitioning on columns that have many distinct values. Partitioning on such columns would produce lots of HDFS directories and the cost of scanning the directories can eventually become more expensive than a full table scan. But what if you have queries on these columns? If you can not partition on this column, each scan on HDFS could result in a full table scan, which is not very efficient. This is one of the use cases of the partition on column expression feature, you can partition on an expression on a column with highly distinct values.

Another use case of this feature is for time based queries .e.g. you can partition on an expression on a timestamp column, for more information on such transformations look at this post.

Consider the following example:

    Example 1:  CREATE HADOOP TABLE INVENTORY (  trans_id int, product varchar(50), trans_ts timestamp  )    

This table has highly distinct and in some cases unique values so it is not recommended to partition on any of these columns. However, with Big SQL 4.2, you can partition on a column expression on any one of these columns. Consider the alternative in the example below, the new syntax is shown in green.

    Example 1-2:  CREATE HADOOP TABLE INVENTORY_A (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY (   INT(trans_id/10000) AS trans_part,  YEAR(trans_ts) AS year_part   )    

The equivalent table in Hive would look like:

    hive> describe table bigsql.inventory_a  trans_id                int  product                 varchar(50)  trans_ts                timestamp    # Partition Information  # col_name              data_type               trans_part              int  year_part               int    

Data Ingest

You can use the INSERT..SELECT statement to automatically generate values for the new partitioning column if data is already in an existing table. For example, if the data is in the INVENTORY table you can use the INSERT…SELECT statement to automatically generate the values and insert the data into the INVENTORY_A table. The LOAD HADOOP statement does not generate the column values automatically therefore do not use LOAD HADOOP statement if the data is not generated ahead of time. For more information on data ingest using this feature have a look at this post.

Query Execution and Automatic Generation of Predicates

The coolest thing about this feature is automatic predicate generation. Big SQL 4.2 will automatically generate a new predicate on the partition expression column if possible. This means that there will be no need to manually change the existing queries to see added performance benefit. So for the queries illustrated above, Big SQL will generate extra predicates on the new trans_part and year_part columns.

There are cases that Big SQL can not generate the new predicate based on the partition expression, for example only local predicates are automatically generated, also >,< or between predicates can not be generated for non-monotonic functions and local predicates are not automatically generated for non-deterministic functions. Take a look at the query execution plan to determine if the predicates are automatically added. For more information on this topic, look at this post.

Removal of external applications

Another reason why this feature is appealing is due to the removal of dependency on external applications that are used to generate the partitioning keys. Prior to 4.2 you could partition the table such as this:

    Example 1-3  CREATE HADOOP TABLE INVENTORY_B (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY (   year_part int  )    

And some other application had to be used to extract the year portion from the timestamp values prior to data ingest. Now in Big SQL 4.2, the column partitioning values will automatically be generated when the INSERT statement is used, so dependency on these applications can be removed.

Performance

Queries which have the following operators in the local predicate on the trans_id and/or trans_ts columns will see performance benefit with the addition of the new partitioning columns (provided that the expression used in the partitioning clause is deterministic and monotonic):

  • Logical operators: AND, OR, IN, NOT IN, NOT
  • Comparison operators: (equal-to, less-than, greater-than; and their negations)

For example these queries will be more efficient compared to the case where there are no partitions:

    select count(*) from INVENTORY_A where trans_id = 10  ;  select count(*) from INVENTORY_A where trans_ts > '2015-06-14' ;  select count(*) from INVENTORY_A where trans_id = 20 and trans_ts > '2015-06-15' ;    

Big SQL will only scan those directories that qualify instead of scanning all the directories on HDFS for that table. However, you still do not want to create too many partitions. Therefore, where possible take a look at the number of partitions that will be created when you partition the tables. For our example above, for the INVENTORY_A table, suppose you have 10,000,000 rows in this table and trans_id is a unique column, if we partition on an expression which does a division by 100,000, it would mean that you would have 10,000,000/10,000=1000 partition trans_part directories. And suppose there is 10 years worth of data, then in total there would be there are (1000*10) 10000 partitions for the INVENTORY_A table. But how many partitions would be too much? This is a good question, it is good practice to keep the number of partitions in the tens of thousands range. On an internal workload we measured an average performance improvement of 30% when we partitioned on an expression of a column with unique keys. But it really depends on your workload, how much performance improvements you could see.

Thanks to the folks from the Big SQL development team who developed and tested this feature: Bing Xin Cao, Nailah Bissoon, Qi Cheng, Rakebul Hasan, Tony Lai, Ying Cao

See Paul Yip’s Utube video on this feature.


[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259937

Overview Annual report Corporate social responsibility Inclusion@IBM Financing Investor Newsroom Security, privacy & trust Senior leadership Careers with IBM Website Blog Publications Automotive Banking Consumer Good Energy Government Healthcare Insurance Life Sciences Manufacturing Retail Telecommunications Travel Our strategic partners Find a partner Become a partner - Partner Plus Partner Plus log in IBM TechXChange Community LinkedIn X Instagram YouTube Subscription Center Participate in user experience research Podcasts Contact IBM Privacy Terms of use Accessibility United States — English