Big SQL supports different file formats. Read this paper for more information on the different file formats supported by Big SQL. The distinction of what type of file format is to be used is done during table creation. Big SQL supports table creation and population from Big SQL as well as from Hive. One of the biggest advantages of Big SQL is that Big SQL syncs with the Hive Metastore. This means that Big SQL tables can be created and populated in Big SQL or created in Big SQL and populated from Hive. Hive tables can also be populated from Hive and then accessed from Big SQL after the catalogs are synced.
When loading data into Parquet tables Big SQL will use SNAPPY compression by default. For Hive, by default compression is not enabled, as a result the table could be significantly larger if created and/or populated in Hive. The next sections will describe how to enable SNAPPY compression for tables populated in Hive on IBM Open Platform (prior to Big SQL v5) and HortonWorks Data Platform (from Big SQL v5 and going forward).
Creating Big SQL Table using Parquet Format
When tables are created in Big SQL, the Parquet format can be chosen by using the STORED AS PARQUET clause in the CREATE HADOOP TABLE statement as in this example:
jsqsh>CREATE HADOOP TABLE inventory ( trans_id int, product varchar(50), trans_dt date ) PARTITIONED BY ( year int) STORED AS PARQUET
By default Big SQL will use SNAPPY compression when writing into Parquet tables. This means that if data is loaded into Big SQL using either the LOAD HADOOP or INSERT…SELECT commands, then SNAPPY compression is enabled by default.
Creating Hive Table using Parquet Format
If Parquet tables are created using Hive then the default is not to have any compression enabled. The following example shows the syntax for a Parquet table created in Hive:
hive> CREATE TABLE inv_hive ( trans_id int, product varchar(50), trans_dt date ) PARTITIONED BY ( year int) STORED AS PARQUET
Note that the syntax is the same yet the behavior is different. By default Hive will not use any compression when writing into Parquet tables.
Comparing Big SQL and Hive Parquet Table Sizes
The following table shows the table size for one table using the Parquet file format when the table is populated using Big SQL LOAD HADOOP and Big SQL INSERT…SELECT vs Hive INSERT…SELECT:
|Big SQL LOAD||Big SQL INSERT..SELECT||Hive INSERT..SELECT|
|164 GB||165 GB||280 GB|
Since the Parquet files created with Big SQL are compressed the overall table size is much smaller. The Big SQL table created and populated in Big SQL is almost half the size of the table created in Big SQL and then populated from Hive.
Enabling SNAPPY compression in Hive
Starting with Hive 0.13, the ‘PARQUET.COMPRESS’=’SNAPPY’ table property can be set to enable SNAPPY compression. You can alternatively set parquet.compression=SNAPPY in the “Custom hive-site settings” section in Ambari for either IOP or HDP which will ensure that Hive always compresses any Parquet file it produces. Here is an example of using the table property during a table creation statement in Hive:
hive> CREATE TABLE inv_hive_parquet( trans_id int, product varchar(50), trans_dt date ) PARTITIONED BY ( year int) STORED AS PARQUET TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');
Note that if the table is created in Big SQL and then populated in Hive, then this table property can also be used to enable SNAPPY compression. For example this is the syntax to create a Big SQL table with SNAPPY compression enabled. This can be useful if INSERT…SELECT statements are to be driven from Hive.
jsqsh> CREATE HADOOP TABLE inv_bigsql_parquet( trans_id int, product varchar(50), trans_dt date ) PARTITIONED BY ( year int) STORED AS PARQUET TBLPROPERTIES ('PARQUET.COMPRESS'='SNAPPY');
With snappy compression enabled in Hive we observed the following table sizes:
|Big SQL LOAD||Big SQL INSERT..SELECT||Hive INSERT..SELECT||164 GB||165 GB||163 GB|
With this property table size dropped from 280GB to 163GB, this is an approximate compression of almost two times. Not only will the table take up less space on HDFS but there can also be significant performance gain when accessing the data for either Big SQL or Hive. The recommendation is to either set ‘parquet.compress=SNAPPY’ in the TBLPROPERTIES when creating a Parquet table or set ‘parquet.compression.SNAPPY’ in hive-site through Ambari. This ensures that all Parquet files produced through Hive related to this table will be compressed.
Many thanks to Abhayan Sundararajan from Big SQL Performance team for discovery and contributions towards this paper.