There are various methods to ingest data into Big SQL. This blog gives an overview of each of these options and provide some best practices for data ingestion in Big SQL. The method used to ingest the data, the size of the data files and the file format do have an impact on ingestion and query performance. This blog provides some best practices for data ingestion with query performance in mind.
Big SQL Data Ingestion Techniques
Some of the data ingestion techniques include:
- CREATE EXTERNAL HADOOP TABLE, add files directly to HDFS and/or ALTER TABLEâ€¦ ADD PARTITION
- CREATE HADOOP TABLE from Big SQL or CREATE TABLE from Hive
- INSERTâ€¦SELECT/ CTAS from Big SQL or Hive
- Big SQL LOAD HADOOP
Big SQL EXTERNAL HADOOP tables
Tables can be created with the EXTERNAL or LOCATION clause in Big SQL and files can be directly added to HDFS, this is one of the fastest methods to ingest data into Big SQL. This method of ingestion described more in Creating external hadoop tables and adding directly to HDFS. Most of the time is spent copying files to HDFS in this method of ingest.
Big SQL CREATE HADOOP TABLE/ CREATE TABLE from Hive
When tables are created from Big SQL, they are also accessible from Hive. When tables are created and populated from Hive, they are almost immediately accessible from Big SQL. Read more about automatic syncing of the Big SQL and Hive catalogs. When the syncing of the catalogs is done, the data that is accessible from Hive is also accessible by Big SQL.
When tables are created from Hive with the DATE type and syncâ€™ed into Big SQL the Java I/O interface is used. For cases where the Java I/O engine is used the recommended file format is ORC. Other data types that are only supported in the Java I/O interface include complex types and binary type. For these cases ORC should be used instead of parquet. Read more about data types supported in Big SQL v5.
In some cases, tables need to be transformed. For example, tables need to be partitioned or the file format of tables need to be changed. For example, the desired file format needs to be converted from text to ORC or Parquet. One way to transform the data in this way is to use the INSERT…SELECT statement. The CREATE HADOOP TABLEâ€¦ AS/ CREATE HBASE TABLE â€¦ AS (CTAS) statement, creates the table and issues a INSERTâ€¦SELECT statement in the background.
Big SQL INSERTâ€¦SELECT/CTAS statement does not use memory from YARN unless YARN is enabled for Big SQL. To speed up Big SQL INSERTâ€¦SELECT performance increase the resources allocated for Big SQL and add logical Big SQL workers.
Hive INSERTâ€¦SELECT/CTAS is map reduced based and does use memory from YARN. When using Hive INSERTâ€¦SELECT/CTAS ensure that there is sufficient resources allocated to YARN. For parquet files, due to a kernel issue, if timestamp or date stored as timestamp columns are ingested there can be runtime performance penalties with parquet files having timestamp data so consider using Big SQL INSERTâ€¦SELECT/CTAS in this case. When using INSERT..SELECT/CTAS from Hive, snappy compression is not enabled by default so enable snappy compression before ingesting parquet files using Hive INSERT. Using the SORT-BY clause of the Hive INSERT statement can improve query performance when accessing ORC and Parquet files. Meta-data information in the files can be used to prevent scanning of unnecessary rows of the data set when the data is sorted.
The use of INSERT with the VALUES clause is not recommended in Big SQL because each insert statement will produce 1 file. As a result, there can be a lot of small files which can adversely impact performance.
Big SQL LOAD HADOOP
Another way to transform the data in this way is to use the LOAD HADOOP command. In most cases Big SQL LOAD HADOOP is the fastest way to transform and ingest data into Big SQL. Read more about Best Practices using LOAD HADOOP. There can be runtime performance penalties when accessing timestamp fields of parquet-mr files. For this case, it is better to use the Big SQL INSERTâ€¦SELECT statement to get better query performance, however the performance of the ingest may not be as fast as compared to LOAD HADOOP or Hive INSERT…SELECT.
Guidelines when Ingesting data
Number of files/partitions
When ingesting data into HDFS, consider the number of files and file sizes. Too many small files can have negative impact on ingest and query performance on HDFS. Partitioned tables can improve performance but consider the number of partitions that will be created when partitioning tables are used.
If the chosen partitioning column will create more than 10,000 partitions then consider using another partitioning column or consider using the partition on column expression feature. For example, partitioned tables cannot be created on timestamp columns due to the number of partitions that can be generated. Consider partitioning columns that do not have a lot of distinct values. The number of partitions should also not be too small. For queries with predicates on the partitioning column, entire partitions can be eliminated during query processing. Partition elimination can greatly improve performance of these queries.
Whichever ingest technique is used, the data is distributed across the HDFS data nodes. If nodes are added the data may not be evenly distributed across the nodes. If a node goes bad, HDFS will try to move data away from this bad node. Use â€˜hdfs fsckâ€™ command to get info about the files and blocks. Aim for an even distribution of DFS blocks across all data nodes. Some small/medium size tables may not have files spread across all data nodes in the cluster as the files are not large enough. You can use the hdfs rebalancer utility to redistribute the data. Note that re-balancing can take a few hours to run but it is an online operation however ‘hdfs fsck’ is a relatively quick operation.
Best practices on ETL can be found in useful tips on ETL processing in Big SQL.
This blog outlines the various ways to ingest data into Big SQL which include adding files directly to HDFS, Big SQL LOAD HADOOP and INSERTâ€¦SELECT/CTAS from Big SQL and Hive. If the data needs to be transformed in any way such as changing the file format or partitioning the table, the fastest way to achieve this is by using the Big SQL LOAD HADOOP command. For parquet files that generate timestamp data, there can be performance penalties during query processing. Consider using INSERT…SELECT/CTAS from Big SQL in this case. If there are no timestamp data columns in the table, Hive INSERTâ€¦SELECT/CTAS can be used for generating parquet files but ensure that compression is enabled before generating these files. The number of files, number of partitions and file placement is important for ingest and query performance.