Big SQL Performance - Partition on Column Expression Feature for Improvements to Time Based Analysis

New in Big SQL 4.2 is the capability to allow an expression to be specified in the CREATE HADOOP TABLE statement in the PARTITIONED BY clause. This feature is very flexible, you can choose to partition based on any allowed expression. For an overview of this feature, please look at the following post.

This blog will focus on transformations for time and date analysis, as it is a common use case, but you can tailor this feature to your needs. This feature is especially attractive for time and date analysis because:

Partitioning on timestamps is not allowed but with this feature you can specify an expression to transform timestamp values into a supported partitioning data type e.g. a date column
Some other external application is usually used to extract the year, month and/or day portions of timestamp values prior to data ingest. Now in Big SQL 4.2, these column partitioning values will automatically be generated when the INSERT statement is used, so dependency on these applications can be removed, review this post for more details on data ingest using this feature
Queries will not need to be rewritten because local query predicates will automatically be generated where possible on the new partitioning columns
Performance can improve dramatically because the additional predicates will allow more partition elimination to occur and less of the data needs to be scanned during query execution

The following tables will be used for demonstration purposes in this blog, the correct syntax to use the new feature is highlighted in green:

    CREATE HADOOP TABLE INVENTORY_A (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY (   INT(cast(trans_ts as date))/10000 AS year_part,  INT(cast(trans_ts as date))/100 AS month_part,  INT(cast(trans_ts as date)) AS day_part  )    CREATE HADOOP TABLE INVENTORY_B (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY ( ;   YEAR(trans_ts) AS year_part,  MONTH(trans_ts) AS month_part,  DAY(trans_ts) AS day_part  )    CREATE HADOOP TABLE INVENTORY_C (  trans_id int, product varchar(50), trans_ts timestamp  )   PARTITIONED BY (   year_part int,  month_part int,  day_part int  )

Big SQL will create the following Hive table for all these tables:

    trans_id                int  product                 varchar(50)  trans_ts               timestamp    # Partition Information  # col_name              data_type               year_part               int  month_part              int  day_part                int

A sample of the data stored in these tables is shown below:

    select * from INVENTORY_A limit 6;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |      2016 |     201606 | 20160615 |  |     1000 | Hat          | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |      2016 |     201606 | 20160613 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |      2016 |     201606 | 20160613 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |      2016 |     201606 | 20160613 |  |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |      2016 |     201606 | 20160613 |  +----------+--------------+-------------------------+-----------+------------+----------+  select * from INVENTORY_B limit 6;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1000 | Hat          | 2016-06-13 15:30:00.000 |      2016 |          6 |       13 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |      2016 |          6 |       13 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |      2016 |          6 |       13 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |      2016 |          6 |       13 |  |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |      2016 |          6 |       13 |  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |      2016 |          6 |       15 |  +----------+--------------+-------------------------+-----------+------------+----------+  select * from INVENTORY_C limit 6;  +----------+--------------+-------------------------+-----------+------------+----------+  | TRANS_ID | PRODUCT      | TRANS_TS                | YEAR_PART | MONTH_PART | DAY_PART |  +----------+--------------+-------------------------+-----------+------------+----------+  |     1000 | Hat          | 2016-06-13 15:30:00.000 |      2016 |          6 |       13 |  |     1001 | Sun Screen   | 2016-06-13 16:30:00.000 |      2016 |          6 |       13 |  |     1002 | Water Bottle | 2016-06-13 17:30:00.000 |      2016 |          6 |       13 |  |     1003 | Sandals      | 2016-06-13 15:30:00.000 |      2016 |          6 |       13 |  |     1004 | Lip Balm     | 2016-06-13 15:31:00.000 |      2016 |          6 |       13 |  |     1005 | Beach Towel  | 2016-06-15 15:30:00.000 |      2016 |          6 |       15 |  +----------+--------------+-------------------------+-----------+------------+----------+

Note that the data stored in INVENTORY_B and INVENTORY_C is exactly the same. However the data stored in INVENTORY_A is slightly different. This is because MONTH and DAY are non-monotonic scalar functions and for these types of functions, Big SQL will not be able to generate > and < predicates. There can be performance advantages of using a table as defined as in INVENTORY_A because local predicates can be automatically generated. For more details of when Big SQL can and can not generate additional predicates look at this post.

Here is another way to create the table this time using the DAYS scalar function. This function returns an integer representation of a date. If you wanted to partition the table based on a weeks worth of data you can use this syntax:

    CREATE HADOOP TABLE INVENTORY_D (  trans_id int, product varchar(50), trans_ts timestamp  )  PARTITIONED BY (  INT(days(trans_ts)/7) AS week_part  )    select * from INVENTORY_D limit 6;  +----------+-------------+-------------------------+-----------+  | TRANS_ID | PRODUCT     | TRANS_TS                | WEEK_PART |  +----------+-------------+-------------------------+-----------+  |        1 | Shoes       | 2008-01-01 00:00:00.000 |    104720 |  |        2 | Dress       | 2008-01-09 00:00:00.000 |    104721 |  |        3 | Hat         | 2008-01-12 00:00:00.000 |    104721 |  |        4 | Gloves      | 2008-01-06 00:00:00.000 |    104721 |  |        5 | Socks       | 2008-01-07 00:00:00.000 |    104721 |  |        6 | Sandals     | 2008-01-10 00:00:00.000 |    104721 |  |        7 | Nail Polish | 2008-01-08 00:00:00.000 |    104721 |  |        8 | Slippers    | 2008-01-11 00:00:00.000 |    104721 |  |        9 | Lotion      | 2008-01-13 00:00:00.000 |    104722 |  |       10 | Necklace    | 2009-01-08 00:00:00.000 |    104773 |  |       11 | Watch       | 2009-01-04 00:00:00.000 |    104773 |  +----------+---------------+-------------------------+-----------+

Try out other fun transformations to partition your data on time series data using this partition on column expression feature.

Thanks to the following contributors to this blog: Paul Yip, Qi Cheng
Thanks to the folks from the Big SQL development team: Bing Xin Cao, Nailah Bissoon, Qi Cheng, Rakebul Hasan, Tony Lai, Ying Cao

See the following video for a demonstration of this capability:
Paul Yip’s u-tube video.

Tips

Big SQL Performance - Partition on Column Expression Feature for Improvements to Time Based Analysis - Hadoop Dev

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?