IBM Support

Big SQL v4 and v5 on HDP Supported Data Types

Technical Blog Post


Abstract

Big SQL v4 and v5 on HDP Supported Data Types

Body

Big SQL Supported Data Types in v4 and v5 will be outlined in this blog. Big SQL uses the Hive Metastore as its underlying data representation and access method. The table below gives an overview of the data types that are supported in Big SQL 4.x and Big SQL 5.x. As well as outlines their representation in the Hive Metastore and the Big SQL DB2 catalogs.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
TINYINT TINYINT TINYINT SMALLINT
SMALLINT SMALLINT SMALLINT SMALLINT
INT INT INT INT
BIGINT BIGINT BIGINT BIGINT
REAL REAL FLOAT REAL
FLOAT FLOAT DOUBLE DOUBLE
DOUBLE DOUBLE DOUBLE DOUBLE
DECIMAL(p,s) DECIMAL(p,s) DECIMAL(p,s) DECIMAL(p,s)
STRING* STRING* STRING VARCHAR of 32,672 characters
VARCHAR(x) VARCHAR(x) VARCHAR(x) VARCHAR(x)
CHAR CHAR CHAR CHAR
TIMESTAMP TIMESTAMP TIMESTAMP(9) TIMESTAMP
DATE DATE TIMESTAMP(9) DATE
DATE STORED AS DATE DATE STORED AS DATE DATE DATE
BOOLEAN BOOLEAN BOOLEAN SMALLINT
VARBINARY N/A BINARY VARBINARY
BINARY N/A BINARY BINARY
ARRAY ARRAY ARRAY ARRAY
MAP/ ARRAY[INT] /ARRAY[VARCHAR(x)] MAP/ ARRAY[INT] /ARRAY[VARCHAR(x)] MAP ASSOCIATIVE_ARRAY
STRUCT /ROW STRUCT /ROW STRUCT ROW

*For performance reasons the usage of STRING in Big SQL is strongly discouraged. An appropriately sized varchar type should be chosen instead more details are in the STRING section below

Java and CPP I/O interfaces

There are 2 I/O interfaces used in Big SQL. The Java I/O interface supports more data types that the CPP I/O interface. For those data types that are not supported in the CPP I/O interface the recommended file format is ORC. These data types include: DATE stored as DATE, all complex types and var/binary types.

Numeric DataTypes

There are no differences between Big SQL 4.x and Big SQL 5.x with respect to numeric data types. Supported numeric types in Hive include TINYINT, SMALLINT, INT, BIGINT, FLOAT, DOUBLE and DECIMAL.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
TINYINT TINYINT TINYINT SMALLINT
SMALLINT SMALLINT SMALLINT SMALLINT
INT INT INT INT
BIGINT BIGINT BIGINT BIGINT
REAL REAL FLOAT REAL
FLOAT FLOAT DOUBLE DOUBLE
DOUBLE DOUBLE DOUBLE DOUBLE
DECIMAL(p,s) N/A DECIMAL(p,s) DECIMAL(p,s)

TINYINT is supported only in the context of CREATE HADOOP TABLE statement in all other contexts it is unavailable with Big SQL.
Big SQL defines a REAL data type as a 32-bit floating point value. With Hive a FLOAT always refers to a 32-bit floating point value, which corresponds to the JAVA FLOAT data type. When a table is created using a REAL data-type in Big SQL, it is represented as a FLOAT data type in the Hive catalog.
In Hive, a DOUBLE always refers to a 64-bit double precision floating point value. With Big SQL the FLOAT or DOUBLE data-type is a 64-bit double precision floating point value. When a table is created with the FLOAT data-type in Big SQL, it is represented as a DOUBLE data-type in the Hive Metastore. With Big SQL the definition of a FLOAT data type is a synonym for DOUBLE.
Big SQL DECIMAL types are supported in Big SQL, however, Hive supports a maximum precision of 38 digits whereas Big SQL supports a maximum precision of 31 digits. Partitioning by decimal type is supported in Big SQL but prior to Big SQL 4.2 it is not recommended because partition elimination is not supported until Big SQL 4.2 for Decimal types.

String DataTypes

There are no differences between Big SQL 4.x and Big SQL 5.x with respect to string data types. Supported string data types in Hive include STRING, VARCHAR and CHAR.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
STRING* STRING* STRING VARCHAR of 32,672 characters
VARCHAR(x) VARCHAR(x) VARCHAR(x) VARCHAR(x)
CHAR CHAR CHAR CHAR

All these Hive string data types are supported in Big SQL but for performance reasons, it is recommended to use CREATE HADOOP TABLE with VARCHAR(x) data type instead of the STRING type or SET HADOOP PROPERTY ‘bigsql.string.size’. The Big SQL engine internally works with data in units of 32K pages and works most efficiently when the definition of a table allows a row to fit within 32k of memory. Once the calculated row size exceeds 32k, performance degradation can be seen for certain queries. Therefore, the use of STRING when defining columns of a table is strongly discouraged. Instead, modify references to STRING to explicit VARCHAR(x) that most appropriately fits the data size, or use the bigsql.string.size property to lower the default size of the VARCHAR to which the STRING is mapped when creating new tables.

Date/Time Types

There are no differences between Big SQL 4.x and Big SQL 5.x with respect to date/time data types. Supported data/time types in Hive include DATE and TIMESTAMP.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
TIMESTAMP TIMESTAMP TIMESTAMP(9) TIMESTAMP
DATE DATE TIMESTAMP(9) DATE
DATE STORED AS DATE DATE STORED AS DATE DATE DATE

By default, Big SQL stores the DATE type as a TIMESTAMP in Hive. In Big SQL 4.2, support was added to store the DATE as a DATE type in Hive by using the CREATE HADOOP TABLE …DATE STORED AS DATE clause.
You can not partition a Hadoop table using a TIMESTAMP type but you can partition a Hadoop table using a DATE type. When importing Hive DATE types (via HCAT_SYNC_OBJECTS) or when the DATE STORED AS DATE clause is used, DATE types in Hive are represented as DATE types in Big SQL. In this case the Java I/O interface is used when accessing and writing data to these tables. The usage of the ORC file format is encouraged when DATE types are used in this manner. More information on DATE types in Big SQL and Hive and be found in tips for populating tables with date types in Big SQL and Hive.

Misc Types

These types include BOOLEAN and BINARY in Hive. Big SQL 5.x introduces BINARY and VARBINARY data types.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
BOOLEAN BOOLEAN BOOLEAN SMALLINT
VARBINARY N/A BINARY VARBINARY
BINARY N/A BINARY BINARY

A Hadoop table can be created with the BOOLEAN type in Big SQL but it is mapped to a SMALLINT in the Big SQL DB2 catalog. BINARY and VARBINARY type is supported in Big SQL in Big SQL 5.x. A VAR/BINARY type is a sequence of bytes which can contain non-traditional data such as pictures or media data. The maximum size of VAR/BINARY type is 32767 bytes. Tables can not be partitioned on a VAR/BINARY data type. The Java I/O interface is used when reading and writing Big SQL tables with VAR/BINARY data type. ORC file format is recommended in Big SQL 5.x when these VAR/BINARY types are used.

Complex Types

There are no differences between Big SQL 4.x and Big SQL 5.x with respect to complex data types. Complex types in Hive include ARRAY, MAP, STRUCT and UNION type.

Big SQL 5.x Big SQL 4.x Hive Metastore Big SQL DB2 Catalog
ARRAY ARRAY ARRAY ARRAY
MAP/ ARRAY[INT] /ARRAY[VARCHAR(x)] MAP/ ARRAY[INT] /ARRAY[VARCHAR(x)] MAP ASSOCIATIVE_ARRAY
STRUCT /ROW STRUCT /ROW STRUCT ROW

Big SQL supports Hive as well as DB2 syntax for complex data types. The Java I/O interface is used to read and write data into a Big SQL table in which any complex types are defined. The usage of ORC file format is recommended when complex types are used. Tables can not be partitioned on a complex type. There is no supported Hive Union data-type in Big SQL.

Example: Create a table with an array column of varchars with no specified maximum size in Big SQL and in Hive

    jsqsh > create hadoop table complexTableArray     (product varchar(10), Big_SQL_version INT, complex_type varchar(15) ARRAY[])      row format delimited fields terminated by ‘,’ collection items terminated by ‘|’;    hive > create hadoop table complexTableArray     (product varchar(10), Big_SQL_version INT, complex_type ARRAY)     row format delimited fields terminated by ‘,’ collection items terminated by ‘|’;    

Note that one can also define a maximum array size in Big SQL by specifying a size of the array e.g.complex_type ARRAY[3] A MAP is an associative array such that you can query the array with INT or VARCHAR index keys whereas in the case of an ARRAY the key can only be an INT. With Big SQL you can create a table with a MAP column either by using the MAP or ARRAY keyword in the create table statement. This corresponds to the MAP data type in Hive catalog and the ASSOCIATIVE_ARRAY data type in the Big SQL DB2 catalog.
Example create a table with a varchar(8) key that is used to map to a varchar(15) value in Big SQL and equivalent statement in Hive.

    jsqsh > create hadoop table complexTableMap     (product varchar(10), Big_SQL_version INT, complex_type varchar(15) ARRAY[varchar(8)])      row format delimited fields terminated by ‘,’ collection items terminated by ‘|’ map keys terminated by ‘#’;    hive > create table complexTableMap      (product varchar(10), Big_SQL_version INT, complex_type MAP)     row format delimited fields terminated by ‘,’ collection items terminated by ‘|’ map keys terminated by ‘#’ ;    

A STRUCT also known as a structured object can be defined in the CREATE HADOOP TABLE statement of a Big SQL table. The STRUCT type is analogous to the Big SQL ROW type and table creation with both STRUCT and ROW types are supported with Big SQL . However in the Big SQL DB2 catalog, they are represented as ROW types.
Example create a table with a structure that have 3 varchar elements in it in Big SQL and equivalent statement in Hive.

    jsqsh > create hadoop table complexTableStruct     (product varchar(10), Big_SQL_version INT,      complex_type ROW(Big_SQL_ARR varchar(8), Big_SQL_MAP varchar(15), Big_SQL_STRUCT varchar(15)))      row format delimited fields terminated by ‘,’ collection items terminated by ‘|’ ;    hive > create hadoop table complexTableStruct     (product varchar(10), Big_SQL_version INT, complex_type STRUCT)      row format delimited fields terminated by ‘,’ collection items terminated by ‘|’;    

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259837