IBM Support

Best Practice and Guidelines - Data Ingestion LOAD - Hadoop Dev

Technical Blog Post


Abstract

Best Practice and Guidelines - Data Ingestion LOAD - Hadoop Dev

Body

In some cases, data is in a certain format which needs to be converted. For example, if the data is coming from the warehouse in text format and must be changed to a different format. For example, the data in the data lake should be converted to ORC or Parquet format and the table should be partitioned. Transforming the data in this way can be done in Big SQL via the LOAD HADOOP or the INSERT…SELECT commands. Loading data into partitioned tables via static and dynamic partitioning is also supported. Examples can be found in loading data into partitioned tables.
This blog will give some best practice in terms of data ingestion with LOAD HADOOP. The Big SQL LOAD HADOOP and the Hive INSERT statements internally use the MapReduce framework to perform parallel processing. We will compare performance of the LOAD HADOOP and Hive INSERT statements in this blog.

Best practices of LOAD HADOOP statement

1. Data Source

The LOAD HADOOP statement can fetch data from an external data source and load into a Big SQL table. However, the best practice recommendation is that these files should first be copied to HDFS before executing the LOAD HADOOP statement. Even though this technique temporarily requires more space on HDFS, the LOAD can be better parallelized by using this technique.
Files on a remote data source are not splittable. Therefore, a map task processes one or more whole files, which can take longer to perform. For example, if you have n terrabytes of data to load that is contained in one file, the data is loaded by a single map task, which takes some time to load. If that same data is split between m tasks, then m map tasks are used to load the data in parallel, which is more efficient. Moving the large files to DFS first and then running LOAD will allow the files to be split, and takes advantage of the ability to run parallel map tasks. Loading directly from a remote file system might conserve space in the distributed file system, but you lose the benefits of LOAD processing in parallel. If space is a key consideration than try to load it from remote file system.

2. LOAD HADOOP Options

There are LOAD Hadoop options that can be tuned to improve performance. The num.map.tasks property can be increased to improve the performance of load. During LOAD processing, the Hadoop MapReduce framework is used, which runs jobs by using multiple mappers and reducer tasks. You can control the number of map tasks that are created during the LOAD with the num.map.tasks parameter.
Increase num.map.tasks but keep it less than the MaxStartups property that is present in the /etc/sshd-config file on your machine. Note that if you try to increase num.map.tasks to be equal to or greater than MaxStartups, the map task jobs will probably fail due to too many concurrent SFTP server connections. Apart from this, increasing num.map.tasks to be greater than the number of files that are being loaded will not help, because Big SQL will only use 1 map task per file.

3. Number of files and partitions

Before loading data consider the number of files and partitions that will be created. Too many small files or too many partitions can have an adverse impact on load and query performance. As a best practice recommendation try to generate 256MB files. For parquet Big SQL always generates 256MB file sizes.
For partitioned tables, the number of files depend on the number of partitions. For example, if a table is 20GB is size and the number of partitions is 10 then 2GB files will be loaded in each partition directory. This is because the reduce tasks aggregate all the data for each partition into one file. It is important to not generate too many partitions when loading data. Our best practice recommendation is not to exceed 100K partitions per table.
For unpartitioned tables, the number of files generated depend on the number of map tasks. For example, for the same 20GB table, if 10 map tasks are used, then each file can be 2GB is size. For unpartitioned tables, more map tasks can be used to parallelize the load and get closer to the recommended file size.

4. YARN Memory

The LOAD HADOOP statement uses memory from YARN. One strategy that can be used when loading lots of data via the LOAD HADOOP statement is to dedicate more resources to YARN than Big SQL. By default, Big SQL uses 25% of the memory on the cluster. If the bigsql memory allocation was changed during or after install then restore the defaults using the commands in section 4 of adjusting Big SQL memory allocation. If YARN is enabled for Big SQL then there is no need to restore defaults. How to: Tune YARN and MapReduce Memory to Speed Up Big SQL LOAD gives some recommendations on the memory settings that need to be increased from the default to improve LOAD performance.

5. LOAD from file vs LOAD from SQL

The recommended best practice is to load data from file, however data can also be loaded via a SQL query. The query must have a WHERE clause that contains a $CONDITIONS marker and a SPLIT column. The recommended best practice is to choose a SPLIT column that has an even distribution of values. By default, the number of map tasks used is 4 unless the SPLIT column has less than 4 distinct values. Increasing the num.map.tasks can improve performance of the LOAD using SQL statement. Try to increase num.map.tasks, but keep it less than the MaxStartups property that is present in the /etc/sshd-config file on your machine.
Note that if you try to increase num.map.tasks to be equal to or greater than MaxStartups, the map task jobs will probably fail due to too many concurrent SFTP server connections. Apart from this, increasing num.map.tasks to be greater than the number of files that are being loaded will not help, because Big SQL will only use 1 map task per file.
The following is an example of a load from SQL query:
e.g. LOAD HADOOP USING JDBC CONNECTION URL ‘jdbc:db2://xhadoop.ibmcom:32051/BIGSQL’ WITH PARAMETERS (user=’bigsql’ password=’bigsql’) FROM SQL QUERY
‘SELECT col1, col2, (col3/1000) as pcol … FROM BIGSQL.tab1
WHERE $CONDITIONS’ SPLIT COLUMN col1 INTO TABLE BIGSQL.tab2 APPEND
WITH LOAD PROPERTIES (‘rejected.records.dir’ = ‘/tmp/rejected’, ‘num.map.tasks’ = 16)

Experimental Results

Load time depends on factors such as cluster topology, configurations, and resource availability.

1. Load data that is present in the Hadoop distributed file system (HDFS)

If the source files are on the distributed file system (for example, HDFS) to the cluster, then the default value for num.map.task is 80% of the cluster map task capacity. The maximum cluster map task capacity value is found in http://hostname:50030/jobtracker.jsp

optimalmaptasks

From the graph above, we understand that having too few or too many map tasks can make LOAD slower. There is sweet spot that depends on many environmental factors. In this case, based on our environment, our sweet spot is 182 mappers.

2. LOAD into partitioned vs unpartitioned tables

From our tests, we found that loading a partitioned table takes more time than loading a non-partitioned table because it has to do a sort to group the data. As the number of partition increase the number of generated files increase for the same data set.

3. LOAD vs. Hive INSERT performance

The following graph shows that for small tables, LOAD is slower because MR startup times heavily impact the performance of LOAD for small tables. For large tables, LOAD is much faster.

IMG1

We performed the standard benchmark tests to highlight the query time difference on both the tables, where each table is loaded with Hive INSERT and LOAD HADOOP statement.

From the tests, we noticed that both the data ingestion statements had similar query times.

IMG2

Conclusion

This blog discusses the best practice recommendation when loading data using the Big SQL LOAD HADOOP statement. Number of files and partitions are important factors for load performance and for query performance. Try to generate files close to 256MB when possible which is the default for parquet files in Big SQL. To speed up load performance, allocate more memory to YARN than Big SQL and increase the number of map tasks. Performance of Hive vs Load is slower is some cases but faster in others.

Acknowledgments

Many thanks to Simon Harris and Rama chandra raju Alluri for their contributions to this article.
Last refreshed by Nailah Bissoon, Oct 26th 2017.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259991