IBM Support

Useful tips on ETL processing in Big SQL - Hadoop Dev

Technical Blog Post


Abstract

Useful tips on ETL processing in Big SQL - Hadoop Dev

Body

Extract-Transform-Load (ETL) is a process in data warehouse system to improve the quality of the raw data feed. An ETL process typically requires massaging raw data in multiple steps where each of the steps usually results in the creation of intermediary results. ETL processes are typically done to improve the quality of the data before any meaningful analysis can be done on the data.

Some approaches will be presented in this blog as well as some recommendations on when to choose which approach.

The diagram below shows an example of a typical (and somewhat simplified) ETL process where IM_TAB* represent intermediary results from some transformation on the original data. The intermediary results and paths to the “Scrubbed inventory data” object are highlighted by the red dashed line boxes and lines in Figure 1 below.

ETL-fig2
Figure 1: Example of a typical ETL process

Approach 1:

Replace the CREATE TABLE statements with CREATE EXTERNAL HADOOP TABLE. This migration process can take a small amount of effort but writing into external Hadoop tables is an expensive operation, and since the ETL process deals with massive data, hundreds of GBs, if not TBs, each day this can be a very time consuming approach.

Here is the skeleton of a stored procedure after migration:

    CREATE OR REPLACE PROCEDURE ETL1(IN YYYYMMDD VARCHAR(8))  LANGUAGE SQL  BEGIN  DECLARE SQLSTMT VARCHAR(8192);  /* Step 1A: Create intermediary table IM_TAB_1A */  …  /* Step 1B: Load the data into IM_TAB_1A */  ..  /* Step 2A: Create intermediary table IM_TAB_1B */  …  /* Step 2B: Load the data into IM_TAB_1B */  ..    /* Load the data into final tables */  …    /* Drop IM_TAB_1A */  /* Drop IM_TAB_1B */  END;    

Listing 1: Main ETL procedure

The first ETL step looks like this:

    /* Step 1A: Create intermediary table IM_TAB_1A */    SET SQLSTMT = 'DROP TABLE IF EXISTS IM_TAB_1A_' || YYYYMMDD;  EXECUTE IMMEDIATE SQLSTMT;    SET SQLSTMT = '  CREATE EXTERNAL HADOOP TABLE IM_TAB_1A_' || YYYYMMDD || '  STORED AS PARQUET  LOCATION ''/tmp/etl/im_tab_1a_' || YYYYMMDD || '/''  AS  SELECT  …  FROM    SALES_DATA_FEED  WHERE  …  ′;    /* Generate the data for table IM_TAB_1A */  EXECUTE IMMEDIATE SQLSTMT;    

Listing 2: Intermediary table creation

At the end of the stored procedure, all the intermediary tables will be dropped.

    /* Drop IM_TAB_1A */  SET SQLSTMT = 'DROP TABLE IF EXISTS IM_TAB_1A_' || YYYYMMDD;  EXECUTE IMMEDIATE SQLSTMT;    /* Drop IM_TAB_1B */  SET SQLSTMT = 'DROP TABLE IF EXISTS IM_TAB_1B_' || YYYYMMDD;  EXECUTE IMMEDIATE SQLSTMT;    

Listing 3: Dropping of intermediary tables

Approach 2:

Big SQL, like most systems that support SQL, supports the concept of VIEW objects. A view provides a definition-only object. Its syntax is similar to the CREATE TABLE AS SELECT ... statement shown in Listing 2. A view is evaluated lazily. It is processed when its result is needed.

The benefits of using a view instead of an intermediary table include

  • Reduction in the amount of data written to HDFS
  • Reduction in the time taken to run the ETL process due to better SQL Optimization

Changing the external table for an intermediary result in Listing 2 to a view will look like this:

    SET SQLSTMT = '  CREATE VIEW IM_TAB_1A_' || YYYYMMDD || '  STORED AS PARQUET  LOCATION ''/tmp/etl/im_tab_1a_' || YYYYMMDD || '/''  AS  SELECT  …'      

Listing 4: View creation for intermediary results

The benefits of converting the objects IM_TAB_1B, IM_TAB_1C, IM_TAB_2B, and IM_TAB_3B for intermediary results in Figure 1 from external tables to views include:

  • Reduction in the amount of data written to HDFS – if each of the objects produces 100 GBs of data, we will reduce the data written to HDFS by 400 GBs.
  • Reduction in the time taken to run the ETL process – the processing of IM_TAB_1B and IM_TAB_1C from the 3 data feeds, IM_TAB_2B from IM_TAB_1B and IM_TAB_1C, and IM_TAB_3B from IM_TAB_2B to produce the end result “Scrubbed inventory data” is done in one SQL processing step. This allows Big SQL to do more filtering to restrict the number of rows read from the data feeds. In addition, optimization techniques in Big SQL such as join reordering which processes multiple-table joins with the lowest access plan cost, or redundant join elimination via primary key-foreign key relationships which removes some meaningless joins can greatly reduce the process time.

Approach 3:

Big SQL also supports local, non-HDFS tables. These are persistent tables that reside only on the head node. Unlike HDFS tables where the data is accessible via the HDFS layer and written to N copies for redundancy, these local tables are suitable for the use cases of intermediary results. It is particularly useful when an intermediary result is used in the ETL process more than once, such as the object IM_TAB_2A, which is referenced by both IM_TAB_3A and IM_TAB_3C shown in Figure 1. Making IM_TAB_2A as a view costs the evaluation of IM_TAB_2A twice. Defining IM_TAB_2A as a local table will avoid this, with the costing of writing the data into a local table. The syntax for creating a local table is CREATE TABLE instead of CREATE EXTERNAL HADOOP TABLE. Big SQL also supports the concept of in-memory table via the DECLARE GLOBAL TEMPORARY TABLE statement. These can be used to house the intermediary tables.

The limitation of local tables or in-memory tables is that the data resides only in the head node of the Big SQL system. If the intermediary tables are very large, they may not be able to fit into memory or temporary disk space on the head node. We can overcome this limitation by dividing the in-coming data into smaller batches, with the cost of increasing complexity in the stored procedure logic.

Conclusion

Here is a general guideline when deciding which approach to deploy in an ETL process.

  1. Use views where the objects will be dropped at the end of a stored procedure and the objects are referenced only once
  2. Use local tables where the objects are referenced more than once across the ETL process to avoid the re-processing of the data in the objects
  3. Use in-memory tables for the same conditions in 2) but when the data in the intermediary objects is relatively small
  4. Only use HDFS tables where the objects are required to be persistent at the end of the ETL process so that they can be referenced after the ETL process

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259907