Data files in Big SQL

Big SQL in BigInsights 3.0 supports multiple file formats, from the simplest, such as Text files, to other more powerful formats, such as the row or column-oriented binary formats (Sequence file, RC file, ORC file, Parquet and AVRO).

Table 1: File formats supported in Big SQL

Format Name Description
Text File The text file format is the human-readable format, which is very widely used. The most common types of this format include the pipe, comma, or tab delimited data. In this format, records are stored in one row per line and are terminated by the new line character (\n).
Sequence File The sequence file format follows a row-oriented layout, which means that the fields in each row are stored together as the contents of a single sequence-file record. In Big SQL, there are two variations of this format: Text Sequence file format and Binary Sequence file format.
RC File The RC (Record Columnar) file format follows a column-oriented layout in which columns of multiple rows are stored as a record by splitting a logical row of a table.
ORC File The ORC file format is a columnar format, like the RC file format.   This format is efficient in reading selective columns. It also supports efficient compression schemes. .
PARQUET File The PARQUET file format is also a columnar file format. Like the ORC file format, it also supports efficient compression schemes. It is specially designed to handle complex nested data structures and is based on Google’s “Dremel: Interactive Analysis of Web-Scale Datasets”.
AVRO File The AVRO file format is similar to the Sequence file format in that it follows row-oriented layouts, supports compression, and splittability. Avro file formats have the added advantage of schema resolution capabilities and can be processed in various languages.




Text data files in Big SQL

Text files are a very popular format.They are used extensively as a hand-shake mechanism for integrated systems because the data is in a human readable format and is stored in Unicode text.

This article focuses on the text file format and provides explanation of how it can be used in the Big SQL environment.

 

Create text tables

This article will use the following SUPPLIER table of TPCH to show-case the DDL, LOAD , INSERT, and other examples.

[code language=”language="sql”]CREATE HADOOP TABLE SUPPLIER ( S_SUPPKEY INTEGER, S_NAME VARCHAR(25), S_ADDRESS VARCHAR(40), S_NATIONKEY INTEGER, S_PHONE VARCHAR(15), S_ACCTBAL FLOAT, S_COMMENT VARCHAR(101) )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
ESCAPED BY ‘\\’
NULL DEFINED AS ‘\N’
STORED AS TEXTFILE;[/code]

The following clauses are the relevant clauses that are available in the
CREATE HADOOP TABLE statement:

  • ROW FORMAT DELIMITED FIELDS TERMINATED BY   TERMINATED BY indicates column delimiters. The default value is ‘ 01’. You can specify delimiters in the following ways:
    • Characters (such as ‘|’ or ‘,’ )
    • Common escape sequences (such as ‘\t’ )
    • Octal value (such as ’01’)
  • ESCAPED BY  This clause indicates the escape character that will have special meaning in the data files. It is a character that you use in your source data to make the field delimiter character part of the data itself.
    Requirement: Fields that are terminated MUST NOT be a part of the data unless you escape it and provide the ESCAPED BY clause.
  • NULL DEFINED AS  This clause is used to denote the NULL in the data files. Big SQL in InfoSphere BigInsights v3.0 uses the Hive default of '\N'
  • STORED AS  This clause specifies the storage format, which pertains to the file formats that were introduced in Table 1.

Note: The new line character is the only supported value for line termination in the context of Big SQL.

Example 1

This section of examples uses the BigInsights JSqsh client to illustrate how you can work with text files.

The BigInsights default installation has a preconfigured connection to the Big SQL database. By using this configuration, you can connect to the default bigsql database.

The following examples are DDL examples that use different delimiters, like comma, tab, pipe, and octal values.

[code language=”sql”]CREATE HADOOP TABLE SUPPLIER_CSV ( S_SUPPKEY INTEGER, S_NAME VARCHAR(25), S_ADDRESS VARCHAR(40), S_NATIONKEY INTEGER, S_PHONE VARCHAR(15), S_ACCTBAL FLOAT, S_COMMENT VARCHAR(101) )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE;

CREATE HADOOP TABLE SUPPLIER_TSV ( S_SUPPKEY INTEGER, S_NAME VARCHAR(25), S_ADDRESS VARCHAR(40), S_NATIONKEY INTEGER, S_PHONE VARCHAR(15), S_ACCTBAL FLOAT, S_COMMENT VARCHAR(101) )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’;
The next set of examples uses the NULL DEFINED AS clause, which uses \N to store the NULL value in the data files.

CREATE HADOOP TABLE SUPPLIER_PIPE ( S_SUPPKEY INTEGER, S_NAME VARCHAR(25), S_ADDRESS VARCHAR(40), S_NATIONKEY INTEGER, S_PHONE VARCHAR(15), S_ACCTBAL FLOAT, S_COMMENT VARCHAR(101) )
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
NULL DEFINED AS ‘\N’
STORED AS TEXTFILE;[/code]

 

Data files for text tables

The SUPPLIER table will be populated from the SUPPLIER.TXT data file. The first five records of the example data file contain the following content:

[code language=”bash”]hadoop fs -cat /tmp/load/SUPPLIER.TXT | head -5

1|Supplier#000000001| N kD4on9OM Ipw3,gf0JBoQDd7tgrzrddZ|17|27-918-335-1736|5755.94|requests haggle carefully. accounts sublate finally. carefully ironic pa

2|Supplier#000000002|89eJ5ksX3ImxJQBvxObC,|5|15-679-861-2259|4032.68|furiously stealthy frays thrash alongside of the slyly express deposits. blithely regular req

3|Supplier#000000003|q1,G3Pj6OjIuUYfUoH18BFTKP5aU9bEV3|1|11-383-516-1199|4192.40|furiously regular instructions impress slyly! carefu

4|Supplier#000000004|Bk7ah4CK8SYQTepEmvMkkgMwg|15|25-843-787-7479|4641.08|final ideas cajole. furiously close dep

5|Supplier#000000005|Gcdm2rJRzl5qlTVzc|11|21-151-690-3663|-283.84|carefully silent instructions are slyly according t[/code]

 

Loading data into text tables

Now that you have successfully created the tables, and you can access the data file, the next step is to ensure that the tables successfully store the data. Populate the tables with either the
LOAD HADOOP command or the INSERT statement.

Note: Before you try the LOAD command, make sure that the data file is available in the specified location on the distributed file system of Hadoop (HDFS).

[code language=”sql”]load hadoop using file url ‘/tmp/load/SUPPLIER.TXT’
with source properties (‘field.delimiter’=’|’, ‘ignore.extra.fields’=’true’)
into table SUPPLIER;[/code]

 

Enabling compression for text files

In the big data environment, the data is usually in terabytes. So, processing that data, and managing large data sets, can be challenging. You can overcome these issues by enabling compression for your text files.

Compression reduces the space that is needed to store files. Since the file size becomes smaller, it is faster to transfer files across the network. As a result, using compression offers two benefits:

  • Less space used for storage.
  • Faster network processing and improved disk I/O.

However, compression always comes with a price. When you enable compression, along with the benefits, you will see that some compression formats cannot be split for parallel processing, and some other formats are slow during decompression, which results in increased CPU utilization, and reduces the impact of decreased I/O.

DEFLATE, BZIP2, GZIP, LZO, LZ4, CMX & SNAPPY are names of the compression formats that are common in the Hadoop world. The following table shows the summary of compression formats for text tables.

Tip: UPDATE and DELETE operations are not supported in Big SQL. Therefore, those operations are not covered as a part of the supported DML operations in this table.

Table 2: Summary of compression formats for text tables

Compression algorithm File extension Splittable DML Load DML CTAS DML Insert Compression codecs
BZIP2 .bz2 YES YES YES YES org.apache.hadoop.io.compress.BZip2Codec
SNAPPY .snappy NO YES YES YES org.apache.hadoop.io.compress.SnappyCodec

For the text file formats, BZIP2 and SNAPPY are the supported compression schemes.

The BZIP2 compression scheme is slower in compression but faster in decompression. When it is compared to other formats, it is considered slow.

The SNAPPY compression scheme is faster in compression and decompression, but it is less efficient in terms of the compression ratio.

Splittablity plays a key role in faster processing, because multiple blocks of a file can be decompressed by multiple parallel tasks. Of the two supported compression schemes from Table 2, only BZIP2 is splittable. The other algorithms (DEFLATE, GZIP, LZO, LZ4, CMX and Snappy) require all blocks in a file to be together for decompression to work. Therefore, at any one time, only one task does the decompression, which beats parallel processing.

After you enable the codecs that are listed in Table 2, they will support LOAD operations, CREATE TABLE AS operations, and INSERT operations. That means that all of the DML operations result in compressed files on HDFS, with their corresponding file extensions:

  • BZIP2 files – stored as *.bz2
  • SNAPPY files – stored as *.snappy

Tip: Be aware each small INSERT ..Values operation creates a small compressed file that can adversely impact performance because of fragmentation.

To compress the data, follow these steps:

  1. Create a Hadoop table.
  2. Set Hadoop properties to enable compression and to specify the compression codec.To enable compression, set the property 'mapred.output.compress'='true'.
    You must also set the required compression codecs, as shown in Table 2.
  3. Populate the table using LOAD or INSERT or even CREATE TABLE AS… (CTAS).
  4. Verify the storage on HDFS ( optional ).

Example 2

The examples in this section use the same SUPPLIER table from the previous section. The example shows how to load the same data in BZIP2 and SNAPPY formats.

First, create the tables. Because the structure of the table is the same as in the previous section, use the LIKE clause as a short-cut method to create the tables.

[code language=”sql”]CREATE HADOOP TABLE SUPPLIER_BZ LIKE SUPPLIER;

CREATE HADOOP TABLE SUPPLIER_SNAPPY LIKE SUPPLIER;[/code]

The following examples show how to set the compression-related properties for BZIP2, which is one of the supported compression formats for text files:

[code language=”sql”]SET HADOOP PROPERTY ‘mapred.output.compress’=’true’;
SET HADOOP PROPERTY ‘mapred.output.compression.type’=’block’;
SET HADOOP PROPERTY ‘mapred.output.compression.codec’=’org.apache.hadoop.io.compress.BZip2Codec’;[/code]

You can verify that the values are set properly by selecting from the SYSHADOOP.BIGSQL_PROPERTIES view:

[code language=”sql”]SELECT * from SYSHADOOP.BIGSQL_PROPERTIES;[/code]
+---------------------------------+------------------------------------------+
| NAME                            | VALUE                                    |
+---------------------------------+------------------------------------------+
| mapred.output.compress          | true                                     |
| mapred.output.compression.codec | org.apache.hadoop.io.compress.BZip2Codec |
| mapred.output.compression.type  | block                                    |
+---------------------------------+------------------------------------------+

After you enable the compression mode and set the required codec, use the following LOAD, INSERT, and CREATE TABLE AS statements to populate the table:

[code language=”language="sql”]load hadoop using file url ‘/tmp/load/SUPPLIER.TXT’
with source properties (‘field.delimiter’=’|’, ‘ignore.extra.fields’=’true’)
into table SUPPLIER_BZ;[/code]

Verify the LOAD by running a SELECT COUNT:

[code language=”sql”]select count(*) from SUPPLIER_BZ;[/code]
+-------+ 
| 1     | 
+-------+ 
| 10000 | 
+-------+

The next example populates the table by doing an INSERT:

[code language=”sql”]INSERT INTO SUPPLIER_BZ VALUES(10000,’Supplier#000010000′,’aTGLEusCiL4F PDBdv665XBJhPyCOB0i’,19,’29-578-432-2146′,8968.42,’furiously final ideas believe furiously. furiously final ideas’);[/code]

Now, disable the compression and run a simple INSERT command:

[code language=”sql”]SET HADOOP PROPERTY ‘mapred.output.compress’=’false’;

SELECT * from SYSHADOOP.BIGSQL_PROPERTIES;[/code]

+---------------------------------+-------------------------------------------+
| NAME                            | VALUE                                     |
+---------------------------------+-------------------------------------------+
| mapred.output.compress          | false                                     |
| mapred.output.compression.codec | org.apache.hadoop.io.compress.BZip2Codec  |
| mapred.output.compression.type  | block                                     |
+---------------------------------+-------------------------------------------+
[code language=”sql”]INSERT INTO SUPPLIER_BZ VALUES(10000,’Supplier#000010000′,’aTGLEusCiL4F PDBdv665XBJhPyCOB0i’,19,’29-578-432-2146′,8968.42,’furiously final ideas believe furiously. furiously final ideas’);[/code]

Query the table to be sure that the INSERT worked:

[code language=”sql”]select count(*) from SUPPLIER_BZ;[/code]
+-------+
| 1     |
+-------+
| 10002 |
+-------+
[code language=”sql”]SELECT S_NAME FROM SUPPLIER_BZ where S_SUPPKEY<10;[/code]
+--------------------+
| S_NAME             |
+--------------------+
| Supplier#000000001 |
| Supplier#000000002 |
| Supplier#000000003 |
| Supplier#000000004 |
| Supplier#000000005 |
| Supplier#000000006 |
| Supplier#000000007 |
| Supplier#000000008 |
| Supplier#000000009 |
+--------------------+
[code language=”sql”]SELECT S_NAME FROM SUPPLIER_BZ where S_SUPPKEY between 1 and 10;[/code]
+--------------------+
| S_NAME             |
+--------------------+
| Supplier#000000001 |
| Supplier#000000002 |
| Supplier#000000003 |
| Supplier#000000004 |
| Supplier#000000005 |
| Supplier#000000006 |
| Supplier#000000007 |
| Supplier#000000008 |
| Supplier#000000009 |
| Supplier#000000010 |
+--------------------+

Finally, you can see how the data is stored on HDFS. In Big SQL, the data is stored by default in HDFS in the following path: /biginsights/hive/warehouse/schemaname.db/tablename.

When you examine the tables, you see that the file names are actually suffixed with .bz2 when compression is enabled.

[code language=”bash”]hadoop dfs -ls -h /biginsights/hive/warehouse/bigsql.db/supplier_bz*
Found 3 items

-rw-r–r– 1 bigsql biadmin 156 2014-07-30 03:00 /biginsights/hive/warehouse/bigsql.db/supplier_bz/i_1406713892390_160_201407300300434_0.bz2

-rw-r–r– 1 bigsql biadmin 148 2014-07-30 03:02 /biginsights/hive/warehouse/bigsql.db/supplier_bz/i_1406713892390_171_201407300302547_0

-rwxr-xr-x 1 bigsql biadmin 414.3 K 2014-07-30 02:50 /biginsights/hive/warehouse/bigsql.db/supplier_bz/part-m-00000.bz2[/code]

After you set a codec, you can easily switch to another codec. The following example shows how to switch to compression that is enabled with Snappy, which is another codec that is supported for text files:
SET HADOOP PROPERTY ‘mapred.output.compression.codec’=’org.apache.hadoop.io.compress.SnappyCodec’;
Run the LOAD operation.

[code language=”sql”]load hadoop using file url ‘/tmp/load/SUPPLIER.TXT’
with source properties (‘field.delimiter’=’|’, ‘ignore.extra.fields’=’true’)
into table SUPPLIER_SNAPPY;[/code]

INSERT as you did before with the BZIP2 example, but use the SNAPPY compression scheme. Then you can compare how the same data is stored on the HDFS file-system.

[code language=”bash”]hadoop dfs -ls -h /biginsights/hive/warehouse/bigsql.db/supplier_snappy*
Found 3 items

-rw-r–r– 1 bigsql biadmin 143 2014-07-30 03:10 /biginsights/hive/warehouse/bigsql.db/supplier_snappy/i_1406713892390_235_201407300310595_0.snappy

-rw-r–r– 1 bigsql biadmin 148 2014-07-30 03:13 /biginsights/hive/warehouse/bigsql.db/supplier_snappy/i_1406713892390_39_201407300313151_0

-rwxr-xr-x 1 bigsql biadmin 799.3 K 2014-07-30 03:09 /biginsights/hive/warehouse/bigsql.db/supplier_snappy/part-m-00000.snappy[/code]

The following examples uses a CTAS statement with compression schemes, using the previously created BZIP table, and storing the same data with the SNAPPY format.

You can do that by using the CREATE TABLE AS … SELECT statement. By using this statement, you can create the table and load the data in a single statement.

[code language=”sql”]CREATE HADOOP TABLE SUPPLIER_SNAPPY2 AS
SELECT * FROM SUPPLIER_BZ;

SELECT COUNT(*) from SUPPLIER_SNAPPY2;[/code]

+-------+
| 1     |
+-------+
| 10002 |
+-------+

When you examine the HDFS storage, you see that the new table is in SNAPPY format with a file supplier_snappy2 that has the .snappy extension and no small fragmented files.

[code language=”bash”]hadoop dfs -ls -h /biginsights/hive/warehouse/bigsql.db/supplier_snappy2*

-rw-r–r– 1 bigsql biadmin 801.0 K 2014-07-30 08:33 /biginsights/hive/warehouse/bigsql.db/supplier_snappy2/i_1406713892390_44_201407300833718_0.snappy[/code]

Examine storage in the BigInsights 3.0 console by going to the DFS Files tab on the Files page.

BigInsights_console_Files_page

 

Performance

Table 3: Comparison of compression formats for text tables

TEXT codec Compression ratio HDFS size (Bytes) LOAD time (Seconds) ANALYZE time (Seconds)
Plain text N/A 581,276,292,567 2,307 3,194
BZIP2 82% 107,297,679,462 3,984 4,423
SNAPPY 51% 284,924,553,315 2,239 2,858

Compression ratio: The formula is (Uncompressed data size – compressed data size)/uncompressed data size percentage. For example: (101795673868 – 68827306956)/101795673868 = 0.3238680550879852, or 32%.

Load time: The time spent in seconds to load the data from a HDFS source into all TPCH tables created in Big SQL.

Analyze time: The time spent in seconds to generate statistics for all TPCH tables within a schema.

The above data shows the following results:

  • BZIP2 yields the best compression, at 82% compression ratio. As an application developer, choose this compression scheme if saving storage is the biggest concern.
  • SNAPPY offers quick load and analyze times, even though the compression ratio is only about 51%. As an application developer, choose this compression scheme if speedy setup is the biggest requirement.

 

Conclusion

 

As the examples show, text files are a good exchange format and are easy to work with. However, these format types can use more space, as compared to other formats. In addition, the read operations can be costly. Therefore, the text file format might not be the best choice if performance is a concern.

 

Acknowledgments

 

Many thanks to the co-authors who contributed materials and reviewed this article : Scott C Gray & Jesse F Chen.
Thanks to the following people who gave their precious time and skills to review this article, in alphabetical order: Cindy Saracco, Deirdre Longo, Ellen Patterson & Seeling Cheung.

Join The Discussion

Your email address will not be published. Required fields are marked *