Partitioning a table on one or more columns allows data to be organized in such a way that querying the table with predicates which reference the partitioning columns results in better performance. Let’s take a look at how Big SQL’s LOAD HADOOP statement can be used to load data into a partitioned table.

Consider a partitioned Hadoop table defined as follows:

CREATE HADOOP TABLE employees
(
   empno INTEGER,
   lastname VARCHAR(15),
   dept INTEGER
)
partitioned by (province VARCHAR(20), city VARCHAR(15))

Data can be loaded into this table using either dynamic or static partitioning.

Dynamic Partitioning

This method does not require the values for partition columns to be known prior to loading data into a table using the LOAD HADOOP statement. The values are obtained from the input data. This can be very useful when dealing with large amounts of data where the full set of partition column values is not known and would be time-consuming to determine.

For example, let’s say we have input file /my_hdfs/data/employee_data2.csv which contains the following data:

(*** empno, lastname, province, city, dept ***)
100,Plotz,Quebec,Montreal,850
110,Ngan,British Columbia,Vancouver,750
120,Naughton,Ontario,Toronto,650
130,Yamaguchi,Ontario,Toronto,750
140,Fraye,Ontario,Toronto,650
150,Williams,Ontario,Ottawa,750
160,Molinare,British Columbia,Vancouver,550
170,Kermisch,Quebec,Montreal,650
180,Abrahams,British Columbia,Vancouver,850
190,Sneider,Ontario,Ottawa,650
200,Sanders,Ontario,Toronto,550

As each row of the input file is read, it is loaded into a 2-part partition corresponding to the values for the partition columns, province and city:

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv'
WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4')
INTO TABLE employees

The partition column values must be the last values specified for each record in the source file. In this case, we have reordered the source fields using ‘field.indexes’ so that they match the column order of the target table with the partition columns specified last.

To show all of the partitions defined for the table after the LOAD HADOOP statement has been executed, the SYSHADOOP.HCAT_TABLEPARTS view can be queried:

SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition
FROM SYSHADOOP.HCAT_TABLEPARTS
WHERE tabname='EMPLOYEES'

PARTITION                                         
--------------------------------------------------
province=British Columbia/city=Vancouver          
province=Ontario/city=Ottawa                      
province=Ontario/city=Toronto                     
province=Quebec/city=Montreal                     

  4 record(s) selected.

As well, the table can be queried to verify that the values for province and city remain as specified in the input file:

SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno

EMPNO       LASTNAME        DEPT        PROVINCE             CITY
----------- --------------- ----------- -------------------- ---------------
        100 Plotz                   850 Quebec               Montreal
        110 Ngan                    750 British Columbia     Vancouver
        120 Naughton                650 Ontario              Toronto
        130 Yamaguchi               750 Ontario              Toronto
        140 Fraye                   650 Ontario              Toronto
        150 Williams                750 Ontario              Ottawa
        160 Molinare                550 British Columbia     Vancouver
        170 Kermisch                650 Quebec               Montreal
        180 Abrahams                850 British Columbia     Vancouver
        190 Sneider                 650 Ontario              Ottawa
        200 Sanders                 550 Ontario              Toronto

  11 record(s) selected.

The HDFS directories that are created as the result of partitioning are as follows:

hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees

Found 4 items
drwxrwxrwx   - bigsql hadoop          0 2016-07-25 07:34 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats
drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=British Columbia
drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Ontario
drwxrwx---   - bigsql hadoop          0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Quebec

Static Partitioning

This method requires that the values for partition columns be known ahead of time and they can then be specified in the PARTITION clause of the LOAD HADOOP statement.

For example, let’s say we have input file /my_hdfs/data/employee_data1.csv and it contains the data below:

(*** empno, lastname, dept ***)
100,Plotz,850
110,Ngan,750
120,Naughton,650
130,Yamaguchi,750
140,Fraye,650
150,Williams,750
160,Molinare,550
170,Kermisch,650
180,Abrahams,850
190,Sneider,650
200,Sanders,550

Note that values for the partition columns are not included above.

By executing the statement below, all of the data will be loaded into a 2-part partition (province=’Ontario’, city=’Toronto’). Assume that no partitions exist yet for the table.

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data1.csv'
INTO TABLE employees
PARTITION (province = 'Ontario', city = 'Toronto')

This can be verified by querying the SYSHADOOP.HCAT_TABLEPARTS view to show all of the partitions defined for the table:

SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition
FROM SYSHADOOP.HCAT_TABLEPARTS
WHERE tabname='EMPLOYEES'

PARTITION
--------------------------------------------------
province=Ontario/city=Toronto

1 record(s) selected.

As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:

SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno
   
EMPNO       LASTNAME        DEPT        PROVINCE             CITY       
----------- --------------- ----------- -------------------- ---------------
        100 Plotz                   850 Ontario              Toronto    
        110 Ngan                    750 Ontario              Toronto
        120 Naughton                650 Ontario              Toronto
        130 Yamaguchi               750 Ontario              Toronto        
        140 Fraye                   650 Ontario              Toronto        
        150 Williams                750 Ontario              Toronto
        160 Molinare                550 Ontario              Toronto
        170 Kermisch                650 Ontario              Toronto
        180 Abrahams                850 Ontario              Toronto
        190 Sneider                 650 Ontario              Toronto
        200 Sanders                 550 Ontario              Toronto

  11 record(s) selected.

The HDFS directories that are created as the result of partitioning are as follows:

hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees

Found 2 items
drwxrwxrwx   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats
drwx------   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario

In practice, when using static partitioning, the partition column values should not be contained in the source file.
Now if the data does already include values for the partition columns, province and city, these values will be overridden by the values specified in the PARTITION clause of the LOAD HADOOP statement.

For example, let’s say input file /my_hdfs/data/employee_data2.csv contains the following data:

(*** empno, lastname, province, city, dept ***)
100,Plotz,Quebec,Montreal,850
110,Ngan,British Columbia,Vancouver,750
120,Naughton,Ontario,Toronto,650
130,Yamaguchi,Ontario,Toronto,750
140,Fraye,Ontario,Toronto,650
150,Williams,Ontario,Ottawa,750
160,Molinare,British Columbia,Vancouver,550
170,Kermisch,Quebec,Montreal,650
180,Abrahams,British Columbia,Vancouver,850
190,Sneider,Ontario,Ottawa,650
200,Sanders,Ontario,Toronto,550

Regardless of the values indicated above for the partition columns, all of the data will be loaded into the same 2-part partition (province=’Ontario’, city=’Toronto’) with the statement below. Again, assume that no partitions exist yet for the table.

LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv'
WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4')
INTO TABLE employees
PARTITION (province = 'Ontario', city = 'Toronto')

The SYSHADOOP.HCAT_TABLEPARTS view can be queried to verify that only one 2-part partition is defined for the table:

SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition
FROM SYSHADOOP.HCAT_TABLEPARTS
WHERE tabname='EMPLOYEES'

PARTITION
--------------------------------------------------
province=Ontario/city=Toronto

1 record(s) selected.

As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:

SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno

EMPNO       LASTNAME        DEPT        PROVINCE             CITY
----------- --------------- ----------- -------------------- ---------------
        100 Plotz                   850 Ontario              Toronto
        110 Ngan                    750 Ontario              Toronto
        120 Naughton                650 Ontario              Toronto
        130 Yamaguchi               750 Ontario              Toronto
        140 Fraye                   650 Ontario              Toronto
        150 Williams                750 Ontario              Toronto
        160 Molinare                550 Ontario              Toronto
        170 Kermisch                650 Ontario              Toronto
        180 Abrahams                850 Ontario              Toronto
        190 Sneider                 650 Ontario              Toronto
        200 Sanders                 550 Ontario              Toronto

  11 record(s) selected.

The HDFS directories that are created as the result of partitioning are as follows:

hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees

Found 2 items
drwxrwxrwx   - bigsql hadoop          0 2016-07-25 08:30 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats
drwx------   - bigsql hadoop          0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario

1 comment on"Big SQL LOAD: Loading data into a partitioned table"

  1. […] partitioned tables via static and dynamic partitioning is also supported. Examples can be found in loading data into partitioned tables. This blog will give some best practice for data ingestion with LOAD HADOOP. The Big SQL LOAD […]

Join The Discussion

Your email address will not be published. Required fields are marked *