In some cases, data is in a certain format which needs to be converted. If the data is coming from the warehouse in text format and must be changed to a different format, LOAD HADOOP or INSERT…SELECT can be used to do this transformation. For example, if the data in the data lake and needs to be converted to ORC or Parquet format or the table needs to be partitioned. Loading data with LOAD HADOOP into partitioned tables via static and dynamic partitioning is also supported. Examples can be found in loading data into partitioned tables. This blog will give some best practice for data ingestion with LOAD HADOOP. The Big SQL LOAD HADOOP and the Hive INSERT statements internally use the MapReduce framework to perform parallel processing. We will compare performance of the LOAD HADOOP and Hive INSERT statements in this blog.
Best practices of LOAD HADOOP statement
1. Data Source
The LOAD HADOOP statement can fetch data from an external data source and load into a Big SQL table. However, the best practice recommendation is that these files should first be copied to the distributed file system before executing the LOAD HADOOP statement. Even though this technique temporarily requires more space on HDFS, the LOAD can be better parallelized by using this technique.
Files on a remote data source are not splittable and a single map task is used to process one or more whole files. For example, if you have n terabytes of data to load that is contained in one file, the data is loaded by a single map task, which takes some time to load. If that same data is split between m tasks, then m map tasks are used to load the data in parallel, which is more efficient. Moving the large files to HDFS first and then running LOAD will allow the files to be split, and takes advantage of the ability to run parallel map tasks. Loading directly from a remote file system might conserve space in the distributed file system, but you lose the benefits of LOAD processing in parallel. If space is a key consideration than try to load it from remote file system.
2. LOAD HADOOP Options
There are LOAD Hadoop options that can be tuned to improve performance. The num.map.tasks property can be increased to improve the performance of load. During LOAD processing, the Hadoop MapReduce framework is used, which runs jobs by using multiple mappers and reducer tasks. You can control the number of map tasks that are created during the LOAD with the num.map.tasks parameter. Increase num.map.tasks but keep it less than the MaxStartups property that is present in the /etc/sshd-config file on your machine. Note that if you try to increase num.map.tasks to be equal to or greater than MaxStartups, the map task jobs will probably fail due to too many concurrent SFTP server connections. Apart from this, increasing num.map.tasks to be greater than the number of files that are being loaded will not help, because Big SQL will only use 1 map task per file.
3. Number of files and partitions
Before loading data consider the number of files and partitions that will be created. Too many small files or too many partitions can have an adverse impact on load and query performance.
For partitioned tables, the number of files depend on the number of partitions. For example, if a table is 20GB is size and the number of partitions is 10 then 2GB files will be loaded in each partition directory. This is because the reduce tasks aggregate all the data for each partition into one file. It is important to not generate too many partitions when loading data. Our best practice recommendation is not to exceed 100K partitions per table. For un-partitioned tables, the number of files generated depend on the number of map tasks. For example, for the same 20GB table, if 10 map tasks are used, then each file can be 2GB is size.
4. YARN Memory
The LOAD HADOOP statement uses memory from YARN. One strategy that can be used when loading lots of data via the LOAD HADOOP statement is to dedicate more resources to YARN than Big SQL. By default, Big SQL uses 25% of the memory on the cluster. If the bigsql memory allocation was changed during or after install then restore the defaults using the commands in section 4 of adjusting Big SQL memory allocation. If YARN is enabled for Big SQL then there is no need to restore defaults. How to: Tune YARN and MapReduce Memory to Speed Up Big SQL LOAD gives some recommendations on the memory settings that need to be increased from the default to improve LOAD performance.
5. LOAD from file vs LOAD from SQL
The recommended best practice is to load data from file when possible, however data can also be loaded via a SQL query. The query must have a WHERE clause that contains a $CONDITIONS marker and a SPLIT column. The recommended best practice is to choose a SPLIT column that has an even distribution of values. By default, the number of map tasks used is 4 unless the SPLIT column has less than 4 distinct values. Increasing the num.map.tasks can improve performance of the LOAD using SQL statement. Try to increase num.map.tasks, but keep it less than the MaxStartups property that is present in the /etc/sshd-config file on your machine.
The following is an example of a load from SQL query: e.g. LOAD HADOOP USING JDBC CONNECTION URL 'jdbc:db2://xhadoop.ibmcom:32051/BIGSQL' WITH PARAMETERS (user='bigsqlâ€™ password='bigsql') FROM SQL QUERY 'SELECT col1, col2, (col3/1000) as pcol â€¦ FROM BIGSQL.tab1 WHERE $CONDITIONS' SPLIT COLUMN col1 INTO TABLE BIGSQL.tab2 APPEND WITH LOAD PROPERTIES ('rejected.records.dir' = '/tmp/rejected', 'num.map.tasks' = 16)
Load time depends on factors such as cluster topology, configurations, and resource availability.
1. Load data that is present in the Hadoop distributed file system (HDFS)
If the source files are on the distributed file system (for example, HDFS) to the cluster, then the default value for num.map.task is 80% of the cluster map task capacity.
From the graph above, the conclusion is that having too few or too many map tasks can make LOAD slower. There is sweet spot that depends on many environmental factors. In this case, based on our environment, our sweet spot is 182 mappers.
2. LOAD into partitioned vs un-partitioned tables
From our tests, we found that loading a partitioned table takes more time than loading a non-partitioned table because it has to do a sort to group the data. As the number of partition increase the number of generated files increase for the same data set.
3. LOAD vs. Hive INSERT performance
The following graph shows that for small tables, LOAD is slower because MR startup times heavily impact the performance of LOAD for small tables. For large tables, LOAD is much faster.
We performed the standard benchmark tests to highlight the query time difference on both the tables, where each table is loaded with Hive INSERT and LOAD HADOOP statement. From the tests, we noticed that both the data ingestion statements had similar query times.
Since LOAD HADOOP is map reduce based, study the map reduce logs when analyzing LOAD HADOOP errors.
This blog discusses the best practice recommendation when loading data using the Big SQL LOAD HADOOP statement. Number of files and partitions are important factors for load performance and for query performance. Try to generate files close to 256MB when possible which is the default for parquet files in Big SQL. To speed up load performance, allocate more memory to YARN than Big SQL and increase the number of map tasks. Performance of Hive vs Load is slower is some cases but faster in others.
Many thanks to Simon Harris and Rama chandra raju Alluri for their initial contributions to this article.