IBM Support

Big SQL Ingest - Adding files directly to HDFS - Hadoop Dev

Technical Blog Post


Abstract

Big SQL Ingest - Adding files directly to HDFS - Hadoop Dev

Body

There are several ingestion techniques that can be used to add data to Big SQL and Hive. This blog will give an overview of the various flavors of adding files or appending to files already in HDFS and the commands to execute from either Hive or Big SQL. The technique of adding files directly to HDFS can offer one of the fastest ways to ingest data into Hadoop.

Create External Hadoop Tables

If files already reside on HDFS, Big SQL tables can be created with the location of these files specified. These tables are referred to as ‘external’ tables. There are several external ETL tools to Big SQL that can be used to generate these types of files such as Apache Storm, Ab Initio and DataStage. The CREATE HADOOP TABLE statement with the LOCATION clause specified can be used to inform Big SQL of the location of the data files on HDFS. In this example, hadoop commands are used to add the files to HDFS and Big SQL commands are shown to create the external table.

    $hadoop fs -mkdir '/user/hadoop/t1'  $hadoop fs -put t1_1.txt '/user/hadoop/t1  $hadoop fs -ls '/user/hadoop/t1'  Found 1 items  -rw-r--r--   3 nbissoon hdfs         87 2016-12-16 13:41 /user/hadoop/t1/t1_1.txt    $hadoop fs –cat /user/hadoop/t1/t1_1.txt  1,1  2,1  3,1  4,1    jsqsh> CREATE HADOOP TABLE t1      (c1 int, c2 int)      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','      LOCATION '/user/hadoop/t1';    jsqsh> select * from t1;  +----+---+  | C1 | C2|  +----+---+  |  1 | 1 |  |  2 | 1 |  |  3 | 1 |  |  4 | 1 |  +----+---+    

Add/remove/append to files on HDFS into existing external tables

If Big SQL external tables are already created and new data files are being added/removed or data is being appended into the original HDFS files, Big SQL will be able to recognize that these files/updated files are associated with the existing table. The Big SQL scheduler cache refreshes occur by default every 20 minutes or after any DDL operation such as create/drop/alter is done from Big SQL (since Big SQL 4.2). Not until the refresh will Big SQL be made aware of the new/changed files on HDFS. There is a stored procedure which can be executed manually to instruct Big SQL to refresh its cache. For example, using the same table t1 created in the example above, the commands below can be used to add a new file (t1_2.txt) to the location specified in the create Hadoop table clause and the HCAT-CACHE-SYNC stored procedure can be called to refresh the scheduler cache:

    $hadoop fs -mkdir '/user/hadoop/t1  $hadoop fs -put t1_2.txt '/user/hadoop/t1’  $hadoop fs –cat /user/hadoop/t1/t1_2.txt  1,2  2,2  3,2  4,2    jsqsh> select * from t1;  +----+---+  | C1 | C2|  +----+---+  |  1 | 1 |  |  2 | 1 |  |  3 | 1 |  |  4 | 1 |  +----+---+    –-Tell the Big SQL Scheduler to flush its cache for table t1  jsqsh> CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’,’t1’);    jsqsh> select * from t1;  +----+---+  | C1 | C2|  +----+---+  |  1 | 1 |  |  2 | 1 |  |  3 | 1 |  |  4 | 1 |  |  1 | 2 |  |  2 | 2 |  |  3 | 2 |  |  4 | 2 |  +----+---+    

The Big SQL scheduler.tableMetaDataCache.timeToLive property in bigsql-conf.xml can be used to configure a new cache refresh time from the default of 20 minutes. However, there is a performance penalty refreshing the cache too often.

Add partitions into partitioned tables

Partitioned tables are recommended for performance reasons when tables are large. Data corresponding to each partition resides in a separate directory on HDFS. You can use the Hive or Big SQL ALTER TABLE… ADD PARTITION command to add entire partition directories if the data is already on HDFS. Or the MSCK REPAIR TABLE command can be used from Hive instead of the ALTER TABLE … ADD PARTITION command. For example, you can use the following Big SQL commands to add the new partition 2017_part to an existing t1_part table:

    jsqsh> CREATE EXTERNAL HADOOP TABLE t1_part      (c1 int, c2 int)  	partitioned by (year_part int)      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','      LOCATION '/user/hadoop/t1_part';    $hadoop fs -mkdir '/user/hadoop/t1_part/2017_part'  $hadoop fs -put 2017.txt '/user/hadoop/t1_part/2017_part’    jsqsh> ALTER TABLE t1_part  ADD PARTITION (year_part='2017') location ‘/user/hadoop/t1_part/2017_part'    

The alter table can also be done from Hive using the same syntax. The HCAT_CACHE_SYNC stored procedure will need to be executed from Big SQL so that the partition can be immediately accessible. If the table was created and partitions were added from Hive, the following commands would be executed:

    hive> CREATE TABLE t1_part      (c1 int, c2 int)  	partitioned by (year_part int)      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','      LOCATION '/user/hadoop/t1_part';    $hadoop fs -mkdir '/user/hadoop/t1_part/2017_part'  $hadoop fs -put 2017.txt '/user/hadoop/t1_part/2017_part’    hive > ALTER TABLE t1_part  ADD PARTITION (year_part='2017') location ‘/user/hadoop/t1_part/2017_part'  –-Tell the Big SQL Scheduler to refresh its cache for table t1_part  jsqsh>  CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’,’t1_part’);    

An alternative to using the ALTER TABLE…ADD PARTITION command is to create the partitioning directories on HDFS and use the MSCK REPAIR TABLE command from Hive. For example, issue the MSCK REPAIR TABLE command from Hive after the directories have been created on HDFS:

    hive> CREATE TABLE t1_part      (c1 int, c2 int)  	partitioned by (year_part int)      ROW FORMAT DELIMITED FIELDS TERMINATED BY ','      LOCATION '/user/hadoop/t1_part';    $hadoop fs -mkdir '/user/hadoop/t1_part/2017_part'  $hadoop fs -put 2017.txt '/user/hadoop/t1_part/2017_part’  hive> MSCK REPAIR TABLE t1_part;    

After issuing the command above, Hive will be made aware of the added partitioning directories. Big SQL will need to be made aware of the new partition by issuing an additional stored procedure call:

    –-Tell the Big SQL Scheduler to refresh its cache for table t1_part  jsqsh> CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’,’t1_part’);    

Summary

This blog outlined the ingestion technique of adding files directly to HDFS and informing Big SQL of these files. This technique can offer very fast ingestion into Hadoop.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259817