Partitioning a table on one or more columns allows data to be organized in such a way that querying the table with predicates which reference the partitioning columns results in better performance. Let’s take a look at how Big SQL’s LOAD HADOOP statement can be used to load data into a partitioned table.
Consider a partitioned Hadoop table defined as follows:
CREATE HADOOP TABLE employees ( empno INTEGER, lastname VARCHAR(15), dept INTEGER ) partitioned by (province VARCHAR(20), city VARCHAR(15))
Data can be loaded into this table using either dynamic or static partitioning.
Dynamic Partitioning
This method does not require the values for partition columns to be known prior to loading data into a table using the LOAD HADOOP statement. The values are obtained from the input data. This can be very useful when dealing with large amounts of data where the full set of partition column values is not known and would be time-consuming to determine.
For example, let’s say we have input file /my_hdfs/data/employee_data2.csv which contains the following data:
(*** empno, lastname, province, city, dept ***) 100,Plotz,Quebec,Montreal,850 110,Ngan,British Columbia,Vancouver,750 120,Naughton,Ontario,Toronto,650 130,Yamaguchi,Ontario,Toronto,750 140,Fraye,Ontario,Toronto,650 150,Williams,Ontario,Ottawa,750 160,Molinare,British Columbia,Vancouver,550 170,Kermisch,Quebec,Montreal,650 180,Abrahams,British Columbia,Vancouver,850 190,Sneider,Ontario,Ottawa,650 200,Sanders,Ontario,Toronto,550
As each row of the input file is read, it is loaded into a 2-part partition corresponding to the values for the partition columns, province and city:
LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv' WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4') INTO TABLE employees
The partition column values must be the last values specified for each record in the source file. In this case, we have reordered the source fields using ‘field.indexes’ so that they match the column order of the target table with the partition columns specified last.
To show all of the partitions defined for the table after the LOAD HADOOP statement has been executed, the SYSHADOOP.HCAT_TABLEPARTS view can be queried:
SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition FROM SYSHADOOP.HCAT_TABLEPARTS WHERE tabname='EMPLOYEES' PARTITION -------------------------------------------------- province=British Columbia/city=Vancouver province=Ontario/city=Ottawa province=Ontario/city=Toronto province=Quebec/city=Montreal 4 record(s) selected.
As well, the table can be queried to verify that the values for province and city remain as specified in the input file:
SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno EMPNO LASTNAME DEPT PROVINCE CITY ----------- --------------- ----------- -------------------- --------------- 100 Plotz 850 Quebec Montreal 110 Ngan 750 British Columbia Vancouver 120 Naughton 650 Ontario Toronto 130 Yamaguchi 750 Ontario Toronto 140 Fraye 650 Ontario Toronto 150 Williams 750 Ontario Ottawa 160 Molinare 550 British Columbia Vancouver 170 Kermisch 650 Quebec Montreal 180 Abrahams 850 British Columbia Vancouver 190 Sneider 650 Ontario Ottawa 200 Sanders 550 Ontario Toronto 11 record(s) selected.
The HDFS directories that are created as the result of partitioning are as follows:
hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees Found 4 items drwxrwxrwx - bigsql hadoop 0 2016-07-25 07:34 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats drwxrwx--- - bigsql hadoop 0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=British Columbia drwxrwx--- - bigsql hadoop 0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Ontario drwxrwx--- - bigsql hadoop 0 2016-07-25 07:33 /apps/hive/warehouse/bigsql.db/employees/province=Quebec
Static Partitioning
This method requires that the values for partition columns be known ahead of time and they can then be specified in the PARTITION clause of the LOAD HADOOP statement.
For example, let’s say we have input file /my_hdfs/data/employee_data1.csv and it contains the data below:
(*** empno, lastname, dept ***) 100,Plotz,850 110,Ngan,750 120,Naughton,650 130,Yamaguchi,750 140,Fraye,650 150,Williams,750 160,Molinare,550 170,Kermisch,650 180,Abrahams,850 190,Sneider,650 200,Sanders,550
Note that values for the partition columns are not included above.
By executing the statement below, all of the data will be loaded into a 2-part partition (province=’Ontario’, city=’Toronto’). Assume that no partitions exist yet for the table.
LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data1.csv' INTO TABLE employees PARTITION (province = 'Ontario', city = 'Toronto')
This can be verified by querying the SYSHADOOP.HCAT_TABLEPARTS view to show all of the partitions defined for the table:
SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition FROM SYSHADOOP.HCAT_TABLEPARTS WHERE tabname='EMPLOYEES' PARTITION -------------------------------------------------- province=Ontario/city=Toronto 1 record(s) selected.
As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:
SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno EMPNO LASTNAME DEPT PROVINCE CITY ----------- --------------- ----------- -------------------- --------------- 100 Plotz 850 Ontario Toronto 110 Ngan 750 Ontario Toronto 120 Naughton 650 Ontario Toronto 130 Yamaguchi 750 Ontario Toronto 140 Fraye 650 Ontario Toronto 150 Williams 750 Ontario Toronto 160 Molinare 550 Ontario Toronto 170 Kermisch 650 Ontario Toronto 180 Abrahams 850 Ontario Toronto 190 Sneider 650 Ontario Toronto 200 Sanders 550 Ontario Toronto 11 record(s) selected.
The HDFS directories that are created as the result of partitioning are as follows:
hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees Found 2 items drwxrwxrwx - bigsql hadoop 0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats drwx------ - bigsql hadoop 0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario
In practice, when using static partitioning, the partition column values should not be contained in the source file.
Now if the data does already include values for the partition columns, province and city, these values will be overridden by the values specified in the PARTITION clause of the LOAD HADOOP statement.
For example, let’s say input file /my_hdfs/data/employee_data2.csv contains the following data:
(*** empno, lastname, province, city, dept ***) 100,Plotz,Quebec,Montreal,850 110,Ngan,British Columbia,Vancouver,750 120,Naughton,Ontario,Toronto,650 130,Yamaguchi,Ontario,Toronto,750 140,Fraye,Ontario,Toronto,650 150,Williams,Ontario,Ottawa,750 160,Molinare,British Columbia,Vancouver,550 170,Kermisch,Quebec,Montreal,650 180,Abrahams,British Columbia,Vancouver,850 190,Sneider,Ontario,Ottawa,650 200,Sanders,Ontario,Toronto,550
Regardless of the values indicated above for the partition columns, all of the data will be loaded into the same 2-part partition (province=’Ontario’, city=’Toronto’) with the statement below. Again, assume that no partitions exist yet for the table.
LOAD HADOOP USING FILE URL '/my_hdfs/data/employee_data2.csv' WITH SOURCE PROPERTIES ('field.indexes'='1,2,5,3,4') INTO TABLE employees PARTITION (province = 'Ontario', city = 'Toronto')
The SYSHADOOP.HCAT_TABLEPARTS view can be queried to verify that only one 2-part partition is defined for the table:
SELECT SUBSTRB(location, POSSTR(location, 'province'), 50) as partition FROM SYSHADOOP.HCAT_TABLEPARTS WHERE tabname='EMPLOYEES' PARTITION -------------------------------------------------- province=Ontario/city=Toronto 1 record(s) selected.
As well, the table can be queried to verify that the values for province and city are set to the values specified in the PARTITION clause of the LOAD HADOOP statement:
SELECT empno, lastname, dept, province, city FROM employees ORDER BY empno EMPNO LASTNAME DEPT PROVINCE CITY ----------- --------------- ----------- -------------------- --------------- 100 Plotz 850 Ontario Toronto 110 Ngan 750 Ontario Toronto 120 Naughton 650 Ontario Toronto 130 Yamaguchi 750 Ontario Toronto 140 Fraye 650 Ontario Toronto 150 Williams 750 Ontario Toronto 160 Molinare 550 Ontario Toronto 170 Kermisch 650 Ontario Toronto 180 Abrahams 850 Ontario Toronto 190 Sneider 650 Ontario Toronto 200 Sanders 550 Ontario Toronto 11 record(s) selected.
The HDFS directories that are created as the result of partitioning are as follows:
hdfs dfs -ls /apps/hive/warehouse/bigsql.db/employees Found 2 items drwxrwxrwx - bigsql hadoop 0 2016-07-25 08:30 /apps/hive/warehouse/bigsql.db/employees/._biginsights_stats drwx------ - bigsql hadoop 0 2016-07-25 08:29 /apps/hive/warehouse/bigsql.db/employees/province=Ontario