One of the challenges when accessing JSON data is sometimes referencing the fields in the JSON itself. The acknowledged strengths of JSON can also contribute to its complexity particularly in the area of parsing. In addition to the already built-in support for using SerDes, IBM Db2 Big SQL  now has built-in JSON SQL functions that will allow the user to store and query JSON data in the same manner as XML data can be handled using the set of XML functions provided in IBM Db2 Big SQL.

To demonstrate the use of these JSON SQL functions, examples are provided that show how to query JSON documents returned by YARN and HDFS REST API calls. Also, as part of this exercise, the ability to perform a REST API using a user defined function as well as reading (JSON) data using WebHDFS will be demonstrated.

IBM Db2 Big SQL Functionality Demonstrated In This Article

This article demonstrates the following features in IBM Db2 Big SQL:

  • Using the Db2 REST API user defined functions to perform YARN and WebHDFS REST API calls (all you do is pass in the URL and the rest is taken care of)
  • Use of stored procedures and user defined function (scalar and table)
  • Returning result sets using stored procedures
  • Storing and handling results in Binary Large Objects (BLOBS) and Character Large Objects (CLOBS) data types
  • Global variables
  • Array and row data types
  • Performing time series queries using the OVERLAPS predicate to provide results for questions such as the following:
    • Which jobs were executing on YARN during a specified time period
    • How many jobs were preempted during the execution of another job during the time it was executing
  • Declared global temporary tables
  • SQL JSON functions
  • SQL XML functions

Setup and References

The examples in this article require the creation of Db2 Big SQL user defined functions.  The links in this section provide instructions on how to access and create those function. Also a link is provided to an Ebook that goes into great detail about the Db2 SQL JSON functions.

Please note that the functions and procedures either in this article or other references and links are provided “AS-IS” without warranty of any kind.

The DB2 REST User Defined Functions

The Db2 REST User Defined Functions are used to make REST API calls. Information about the Db2 REST functions and how to create them are at the following link: https://developer.ibm.com/tutorials/dm-1105httprestdb2/

The DDL to create the Db2 REST UDFs and the associated jar file are here: DB2XML_jar_ddl. 

The Hive User Defined Function HIVE.FROM_UNIXTIME

The User Defined Function HIVE.FROM_UNIXTIME  converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a STRING that represents the equivalent TIMESTAMP value. The following Knowledge Center link provides directions for creating this and other HIVE user defined functions that are shipped as part of Big SQL:

https://www.ibm.com/support/knowledgecenter/en/SSCRJT_6.0.0/com.ibm.swg.im.bigsql.commsql.doc/doc/biga_hiveUDF.html

Ebook About the Db2 Version 11.1 JSON Functions

This eBook provides a wealth of additional information about the SQL JSON functions:  http://ibm.biz/db2json

Benefits of Using The Features and Options of The SQL JSON Functions

While this article is not intended to be an in-depth review of using the Big SQL JSON functions, here are a few features and options that are available:

Ability to Specify How Malformed JSON Data, Null Values, and Surrounding Quotes Are Handled

There are many options that can be specified with regards to how to handle the interpretation of JSON data. Examples Are:

  • How an error in parsing a JSON record should be handled or if a null or empty value is encountered:
    • Return an error
    • Return NULL
    • Return a default value
  • How arrays are handled if encountered but not expected
  • Whether or not to omit surrounding quotes

Successfully Handling of JSON Records with Multiple Newline Characters

JSON data files are routinely stored as plain text files. Typically, when using a JSON SerDe for successful parsing to occur each JSON record must appear as a single line since carriage return/linefeed characters define the end of each line. For example, the following JSON record has multiple newline characters (where each line that contains data ends with a newline and each blank line is a newline) and cannot be successfully read using most SerDes:

{ “Currency”:”EUR”

,”date”:

“2018-04-18″

 

,”rates”: {

“NZD”: 1.975

,”CAD”: 1.233

 

 

,”AUS”: 1.238       }

}

 

When using the Big SQL JSON functions (such as JSON_TABLE, JSON_VALUE, and JSON_QUERY) there is no requirement that each JSON record has only a single newline character.

No Need to Worry About Jar File Security, Setting of CLASSPATH or Permissions

With regards to security, it is standard practice to be concerned about jar files. Depending on your SerDe implementation, jar files may need to be downloaded that may not be able to be verified. Using the Big SQL JSON functions this is not a concern. None of the following actions need to be taken as part of using the Big SQL JSON functions:

  • Download and/or copying of jar files
  • Maintaining of CLASSPATH
  • Setting of permissions

Querying JSON Data Returned From a REST API Call

Querying JSON data is always performed using the same set of SQL JSON functions. In the case of JSON data returned from a REST API call, there must be a REST API call performed first to return the data.  The Db2 REST function httpGetClob returns the results of the URL passed in by executing an HTTP GET.  Before diving further into the SQL JSON functions lets first demonstrate how to the use the HTTPGETCLOB function to perform a YARN Resource Manager REST API call.

In the section titled ‘Setup’ are the steps for downloading and creating the Db2 REST functions.

Returning Application Statistics Using the YARN Resource Manager REST API

NOTE: The URI used in the examples provides the functionality as described by the link https://hadoop.apache.org/docs/r3.1.0/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html#Cluster_Application_Statistics_API.

Let’s start by first demonstrating a Resource Manager REST API call that returns statistics about an application with the application id ‘application_1544877054077_0008′ :

WITH APPLICATION_STATISTICS (DOCUMENT_DATA) AS
(
VALUES DB2XML.HTTPGETCLOB (‘http://oceania.machine.com:8088/ws/v1/cluster/apps/application_1544877054077_0008’,”)
)
SELECT DOCUMENT_DATA FROM APPLICATION_STATISTICS

 

Below is the output that is returned:

 

{“app”:{“id”:”application_1544877054077_0008″,”user”:”census1″,”name”:”LOAD HADOOP USING FILE ‘/census/data’
INTO TABLE CENSUS_1991″,”queue”:”default”,”state”:”FINISHED”,”finalStatus”:”SUCCEEDED”,
“progress”:100.0,”trackingUI”:”History”,”trackingUrl”:”http://oceania.machine.com:8088/proxy/application_1544877054077_0008/”,”diagnostics”:””,”clusterId”:1544877054077,”applicationType”:”MAPREDUCE”,”applicationTags”:””,
“priority”:0,startedTime”:1545176117231, “finishedTime”:1545176134787,”elapsedTime”:17556,
“amContainerLogs”:”http://oceania.machine.com:8042/node/containerlogs/container_e09_1544877054077_0008_01_000001/census1l”, “amHostHttpAddress”:”noosa4.oceania.machine.com:8042″, “amRPCAddress”:”oceania.machine.com:35799″,
“masterNodeId”:”oceania.machine.com:45454″,”allocatedMB”:-1,”allocatedVCores”:-1,”reservedMB”:-1,”reservedVCores”:-1,”runningContainers”:-1,”memorySeconds”:95495,”vcoreSeconds”:30,”queueUsagePercentage”:0.0, “clusterUsagePercentage”:0.0,”resourceSecondsMap”:{“entry”:{“key”:”memory-mb”,”value”:”95495″}, “entry”: {“key”:”vcores”,”value”:”30″}},”preemptedResourceMB”:0,
“preemptedResourceVCores”:0,”numNonAMContainerPreempted”:0,”numAMContainerPreempted”:0,”preemptedMemorySeconds”:0,
“preemptedVcoreSeconds”:0,”preemptedResourceSecondsMap”:{},”logAggregationStatus”:”SUCCEEDED”,
“unmanagedApplication”:false,”amNodeLabelExpression”:””,”timeouts”:{“timeout”:[{“type”:”LIFETIME”,”expiryTime”:”UNLIMITED”,”remainingTimeInSeconds”:-1}]}}}
Before continuing with how to transform the JSON data into relational format, let’s first describe (using the Db2 command line processor) the query that was executed:
DESCRIBE
WITH APPLICATION_STATISTICS (DOCUMENT_DATA) AS
(
VALUES DB2XML.HTTPGETCLOB(‘http://oceania.machine.com:8088/ws/v1/cluster/apps/application_1544877054077_0008’,”)
)
SELECT DOCUMENT_DATA FROM APPLICATION_STATISTICS

Below is the output (some fields have been omitted)

 

SQL type       Column name                       Lob length      Schema      Data type

——————————————————————————————————————–

409 CLOB    DOCUMENT_DATA             5242880         SYSIBM     CLOB

Notice that the data type returned is a CLOB! That’s right Big SQL supports the CLOB and LOB data type and has built-in functionality for returning and manipulating them. However, what Big SQL does not support for HADOOP or HBase tables (Tables created using ‘CREATE HADOOP/HBASE TABLE’) is creating a table with a column of type BLOB or CLOB. The following table summarizes the use of the data types VARCHAR, BLOB, and CLOB in Big SQL:

 

Datatype Column in a Hadoop/Hbase Table Column in a Local DB2 Table in Big SQL Column in Declared and Global Temporary Tables
VARCHAR and VARBINARY Yes Yes Yes
BLOB No Yes Yes
CLOB No Yes Yes

The following table summarizes the maximum lengths for each column, global variable, or declared type created with the VARCHAR, BLOB, and CLOB data types:

Datatype Maximum Length
VARCHAR and VARBINARY 32672
BLOB 2 GB
CLOB 2 GB

Returning JSON Data In Relational Format Using the SQL Function JSON_TABLE

The example above just shows how to perform a REST API call using SQL and have it’s results returned. Next, lets perform that same query but this time use the SQL JSON function JSON_TABLE to return the following fields that are in the returned JSON document:

  • Application id
  • Username
  • Application name
  • Queue that application was submitted to
  • The final status of the application
  • Time that the application started
  • Time that the application finished
  • Type of application (map reduce, Spark, etc.)

Remember – While the examples are fully functional, limited explanations are provided about the JSON functions. Please refer to the Big SQL Knowledge Center for more information.

To assist with readability, a global variable named HDFS_HOSTNAME will be defined where the name node hostname will be assigned:

CREATE OR REPLACE VARIABLE HDFS_HOSTNAME VARCHAR(254)

Below is a query that calls the same REST API call as above. The global variable hostname once set is resolved as part of involing the DB2XML.HTTPGETCLOB UDF:

SET HDFS_HOSTNAME = ‘oceania.machine.com’;

WITH APPLICATION_STATISTICS (DOCUMENT_DATA)
AS
(
VALUES DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME||’:8088/ws/v1/cluster/apps/application_1544877054077_0008′,”)
)
SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS,
TIMESTAMP_FORMAT( hive.from_unixtime( startedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS STARTTIME,
TIMESTAMP_FORMAT(hive.from_unixtime( finishedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS FINISHTIME,
APP_TYPE
FROM APPLICATION_STATISTICS,
JSON_TABLE(DOCUMENT_DATA, ‘strict $’
COLUMNS (
id VARCHAR(34) FORMAT JSON PATH ‘$.app.id’ ,
username varchar (10) FORMAT JSON PATH ‘$.app.user’,
name varchar (90) FORMAT JSON PATH ‘$.app.name’,
queue varchar (10) FORMAT JSON PATH ‘$.app.queue’,
state varchar (10) FORMAT JSON PATH ‘$.app.state’,
status varchar (12) FORMAT JSON PATH ‘$.app.finalStatus’,
startedTime varchar (15) FORMAT JSON PATH ‘$.app.startedTime’,
finishedTime varchar (15) FORMAT JSON PATH ‘$.app.finishedTime’,
app_Type varchar (12) FORMAT JSON PATH ‘$.app.applicationType’
)
ERROR ON ERROR) AS APPLICATION_STATISTICS_REST_API_QUERY

 

Below is the output of the query:

 

ID USERNAME NAME QUEUE STATE STATUS STARTTIME FINISHTIME APP_TYPE
“application_1544877” “nevada” “LOAD HADOOP…’/json_grocery_data'” “westcoast” “FINISHED” “SUCCEEDED” 2018-12-18 15:35:17 2018-12-18 15:45:34 “MAPREDUCE”

 

In the case of when the JSON document has most of the data at the root ($.) level another form of query syntax can be considered:

SET HDFS_HOSTNAME = ‘oceania.machine.com’;

WITH APPLICATION_STATISTICS (DOCUMENT_DATA)
AS
(
VALUES DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME||’:8088/ws/v1/cluster/apps/application_1565525816716_0006′,”)
)
SELECT APPLICATION_STATISTICS_REST_API_QUERY.*
FROM APPLICATION_STATISTICS,
JSON_TABLE(DOCUMENT_DATA, ‘strict $’
COLUMNS (
“app.id” VARCHAR(34) ,
“app.user” varchar (10) ,
“app.name” varchar (90) ,
“app.queue” varchar (10) ,
“app.state” varchar (10) ,
“app.finalStatus” varchar (12) ,
“app.startedTime” varchar (15) ,
“app.finishedTime” varchar (15) ,
“app.applicationType” varchar (12)
)ERROR ON ERROR) AS APPLICATION_STATISTICS_REST_API_QUERY;

Omitting Surrounding Quotes from JSON Data

As stated earlier, there are many options that can be specified when returning JSON values. For example, in our original query all character fields were returned contained in double quotes.  For the rest of the examples in this article quotes will be omitted. Here is the same query only this time the clause ‘OMIT QUOTES’ is specified:

WITH APPLICATION_STATISTICS (DOCUMENT_DATA)
AS
(
VALUES DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME’||:8088/ws/v1/cluster/apps/application_1544877054077_0008′,”)
)
SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS,
TIMESTAMP_FORMAT( hive.from_unixtime( startedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS STARTTIME,
TIMESTAMP_FORMAT(hive.from_unixtime( finishedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS FINISHTIME,
APP_TYPE
FROM APPLICATION_STATISTICS,
JSON_TABLE(DOCUMENT_DATA, ‘strict $’
COLUMNS (
id VARCHAR(34) FORMAT JSON PATH ‘$.app.id’ OMIT QUOTES,
username varchar (10) FORMAT JSON PATH ‘$.app.user’ OMIT QUOTES,
name varchar (90) FORMAT JSON PATH ‘$.app.name’ OMIT QUOTES,
queue varchar (10) FORMAT JSON PATH ‘$.app.queue’ OMIT QUOTES,
state varchar (10) FORMAT JSON PATH ‘$.app.state’ OMIT QUOTES,
Status varchar (12) FORMAT JSON PATH ‘$.app.finalStatus’ OMIT QUOTES,
startedTime varchar (15) FORMAT JSON PATH ‘$.app.startedTime’ OMIT QUOTES,
finishedTime varchar (15) FORMAT JSON PATH ‘$.app.finishedTime’ OMIT QUOTES,
app_Type varchar (12) FORMAT JSON PATH ‘$.app.applicationType’ OMIT QUOTES
)
ERROR ON ERROR) AS APPLICATION_STATISTICS_REST_API_QUERY

Below is the output of the query:

 

ID USERNAME NAME QUEUE STATE STATUS STARTTIME FINISHTIME APP_TYPE
application_1544877 nevada LOAD HADOOP…’/json_grocery_data’ westcoast FINISHED SUCCEEDED 2018-12-18 15:35:17 2018-12-18 15:45:34 MAPREDUCE
It should be noted that there is no requirement that the OMIT QUOTES clause be specified for every field that is returned.

Returning and Querying an Array Containing JSON Data

Instead issuing a REST API call that returns data about a single application lets return data about all applications. When returning data about multiple applications, the resource manager REST API returns this data as a JSON array. While the JSON_TABLE specification provides support for handling of arrays and nested arrays, the Big SQL implementation does not provide that syntax. To parse the nested array that is returned the following SQL statements and functionality are used:

  • CREATE PROCEDURE
  • DECLARE statement for variables
  • DECLARE CURSOR WITH RETURN which allows results sets to be returned to the caller or to a client application
  • IF … THEN
  • WHILE … DO
  • SIGNAL
  • CREATE TYPE
  • CREATE VARIABLE
  • CREATE VIEW
  • ROW types
  • ARRAY types
  • The JSON_LEN function to determine the size of a JSON array
  • The JSON_TABLE function to convert the data into relational format
  • Global Temporary Tables

Returning JSON Data Result Sets Using a Stored Procedure

Below are the steps used to develop the stored procedures and SQL table functions to return JSON and XML data stored in HDFS:

Create a GLOBAL TEMPORARY TABLE to store the contents of each JSON record individually that is in the array returned:

CREATE GLOBAL TEMPORARY TABLE TEMP.DOCUMENT_DATA
(FILENAME VARCHAR (64) NOT NULL,
INSERT_TIME TIMESTAMP NOT NULL WITH DEFAULT,
UNIQUE_IDENTIFIER BIGINT GENERATED ALWAYS AS IDENTITY (START WITH 1, INCREMENT BY 1),
DOCUMENT_DATA CLOB
)
DISTRIBUTE BY (UNIQUE_IDENTIFIER)
ON COMMIT DELETE ROWS

 

If you are not familiar with creating Global Temporary Tables (CGTT) here are some of their key benefits:

  • The global temporary table can be defined up front and used just like a regular table by any session with each session having their own copy
  • Any data inserted, updated, or deleted is local to each session – meaning the same table can be referenced by different users and each user only sees the data from their session
  • They can be defined with no logging (default behavior)
  • No row locking is ever required
  • They can be referenced in SQL statements, functions, triggers, indexes and views
  • Supports identity columns
  • Supports NOT NULL and NOT NULL WITH DEFAULT
  • Can be hash partitioned (CREATE …. DISTRIBUTE BY <partitioning key> )

Below is a description of the Global Temporary Table created:

Column Name Description
FILENAME Name of the JSON that contains the JSON data
INSERT_TIME Date and Time that the data was inserted
UNIQUE_IDENTIFIER Identity column value used to guarantee each row can easily be uniquely identified if required
HDFS_DATA The data in the file

 

Define a few views and data types to make the stored procedure logic easier to read: (The variable names may have ‘JSON’ in their name but that has no effect on the logic)

/* Create a global variable called ‘HDFS_HOSTNAME’ so that the hostname can be easily changed depending on the HDFS nodename: */

CREATE OR REPLACE GLOBAL VARIABLE HDFS_HOSTNAME VARCHAR(254);

/* Create a view to return the results of the REST API call: */

CREATE OR REPLACE VIEW VCLOB(REST_API_RESULTS)
AS
(
SELECT DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME’||:8088/ws/v1/cluster/apps’,”) FROM DUAL
)

/* Create a Global Variable to store the results returned from the view VCLOB: */

CREATE OR REPLACE VARIABLE V_JSON CLOB (5M)

/* Create an Associative Array for storing each JSON record individually: */

CREATE OR REPLACE TYPE JSON_ARRAY AS VARCHAR (32000) ARRAY[1000]

CREATE OR REPLACE VARIABLE JSON_RESULTS_ARRAY JSON_ARRAY

/* Create a view that ‘explodes’ (unnest) the JSON_RESULTS_ARRAY into individual rows using the UNNEST function: */

CREATE OR REPLACE VIEW V_DOCUMENT_DATA (JSON_FILE)
AS
(
SELECT JSON_FILE FROM UNNEST(JSON_RESULTS_ARRAY) AS (JSON_FILE)
)

 

Create the stored procedure specifying the ‘RESULT SETS’ clause and ‘WITH RETURN’ so that the stored procedure can return data:

CREATE OR REPLACE PROCEDURE READ_REST_API_STORED_PROC
RESULT SETS 1
LANGUAGE SQL
BEGIN
DECLARE v_numRecords INT DEFAULT 1;
DECLARE SQLCODE INTEGER;
DECLARE SQLSTATE CHAR(5);
DECLARE v_num_json_files INT DEFAULT 1;
DECLARE v_counter INT DEFAULT 1;
DECLARE V_JSON CLOB (5M);
/* Declare a cursor using the WITH RETURN clause so that the results will be returned when the stored procedure is invoked */
DECLARE REST_API_RESULTS CURSOR WITH RETURN FOR
SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS,
TIMESTAMP_FORMAT( hive.from_unixtime( startedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS STARTTIME,
TIMESTAMP_FORMAT(hive.from_unixtime( finishedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS FINISHTIME,
APP_TYPE
FROM TEMP.HDFS_DATA, JSON_TABLE(HDFS_DATA,’strict $’
COLUMNS (
id VARCHAR(34) FORMAT JSON PATH ‘$.id’ OMIT QUOTES,
username varchar (10) FORMAT JSON PATH ‘$.user’ OMIT QUOTES,
name varchar (90) FORMAT JSON PATH ‘$.name’ OMIT QUOTES,
queue varchar (10) FORMAT JSON PATH ‘$.queue’ OMIT QUOTES,
state varchar (10) FORMAT JSON PATH ‘$.state’ OMIT QUOTES,
Status varchar (12) FORMAT JSON PATH ‘$.finalStatus’ OMIT QUOTES,
startedTime varchar (15) FORMAT JSON PATH ‘$.startedTime’ OMIT QUOTES,
finishedTime varchar (15) FORMAT JSON PATH ‘$.finishedTime’ OMIT QUOTES,
app_Type varchar (12) FORMAT JSON PATH ‘$.applicationType’ OMIT QUOTES
) ERROR ON ERROR) AS  CUR_API_QUERY;

/*Retrieve all json records returned from the REST API call */
SET V_JSON = (SELECT REST_API_RESULTS FROM VCLOB);
/* Determine the length of the JSON array results */

SET V_NUM_JSON_FILES = (SELECT SYSTOOLS.JSON_LEN(JSON_TO_BSON(V_JSON),’apps.app’) FROM DUAL);

/* Assign each individual JSON record to an array location */

WHILE v_counter < v_num_json_files
DO
SET JSON_RESULTS_ARRAY[v_counter] = JSON_QUERY(v_json,’$.apps.app[‘||v_counter||’]’);

SET v_counter = v_counter + 1;

END WHILE;

/* Insert the results into TEMP.HDFS_DATA by ‘exploding’  (unnesting) the array */

INSERT INTO TEMP.HDFS_DATA (HDFS_DATA) SELECT JSON_RECORD FROM UNNEST(JSON_RESULTS_ARRAY) AS (JSON_RECORD);

/* Return the results in relational format using the cursor definition defined earlier */
OPEN REST_API_RESULTS;

END @

 

 

Call the stored procedure. The results are returned in relational format:

CALL READ_REST_API_STORED_PROC()

 

Returning JSON (or other types of data) Data Using a SQL Table Function

In the previous example the JSON result set was returned using a stored procedure where the stored procedure encapsulated the SQL required to parse the JSON data and return it in relational format. But suppose we wanted a more generic way to query the JSON data where the data is returned no different than any other column data? And then upon return apply the SQL JSON functions to convert the data into relational format. The creation of an SQL Table Function provides this capability. The steps are as follows:

Create the definitions used for returning the JSON data and the number of elements in the array:

CREATE OR REPLACE VARIABLE V_JSON_VARIABLE CLOB (5M) DEFAULT ((SELECT DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME||’:8088/ws/v1/cluster/apps’ ,”) FROM DUAL));
CREATE OR REPLACE VARIABLE V_NUM_JSON_RECORDS INT DEFAULT ((SELECT SYSTOOLS.JSON_LEN(JSON_TO_BSON(V_JSON_VARIABLE),’apps.app’) FROM DUAL));

Next, define a ROW type for storing each JSON record individually:

CREATE TYPE JSON_ROW AS ROW (DOCUMENT_DATA CLOB (5M))

CREATE TYPE JSON_ROWS AS JSON_ROW ARRAY[]

Execute the following (compiled) CREATE FUNCTION statement which returns each JSON record as an array entry:

CREATE OR REPLACE FUNCTION READ_REST_API_ALL_APPS ()
RETURNS JSON_ROWS
LANGUAGE SQL
NO EXTERNAL ACTION
SPECIFIC READ_JSON_ALL_APPS_USING_REST_API
BEGIN
DECLARE v_counter INT DEFAULT 1;
DECLARE V_JSON CLOB (5M);
DECLARE JSON_RESULTS JSON_ROWS;

/* Copy the result set to the variable V_JSON */
SET V_JSON = v_json_variable;

/* Assign each individual JSON record to an array location */
WHILE v_counter < v_num_json_records
DO
SET JSON_RESULTS[v_counter]=JSON_QUERY(v_json,’$.apps.app[‘||v_counter||’]’);
SET v_counter = v_counter + 1;
END WHILE;
/* Return the results */
RETURN (JSON_RESULTS);
END @

 

Below is a query calling the table function READ_REST_API_ALL_APPS and referencing the JSON output:

/* Remember to set HDFS_HOSTNAME */

WITH DOCUMENT_DATA_CTE (DOCUMENT_DATA)
AS
(
SELECT * FROM TABLE  (READ_REST_API_ALL_APPS () )
)

SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS,
TIMESTAMP_FORMAT( hive.from_unixtime( startedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS STARTTIME, TIMESTAMP_FORMAT(hive.from_unixtime( finishedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS FINISHTIME ,
APP_TYPE
FROM DOCUMENT_DATA_CTE, JSON_TABLE(DOCUMENT_DATA,’strict $’
COLUMNS (id VARCHAR(34) FORMAT JSON PATH ‘$.id’  OMIT QUOTES,
username varchar (10) FORMAT JSON PATH ‘$.user’ OMIT QUOTES,
name varchar (90) FORMAT JSON PATH ‘$.name’ OMIT QUOTES,
queue varchar (10) FORMAT JSON PATH ‘$.queue’ OMIT QUOTES,
state varchar (10) FORMAT JSON PATH ‘$.state’ OMIT QUOTES,
Status varchar (12) FORMAT JSON PATH ‘$.finalStatus’ OMIT QUOTES,
startedTime varchar (15) FORMAT JSON PATH ‘$.startedTime’ OMIT QUOTES,
finishedTime varchar (15) FORMAT JSON PATH ‘$.finishedTime’ OMIT QUOTES,
app_Type varchar (12) FORMAT JSON PATH ‘$.applicationType’ OMIT QUOTES
) ERROR ON ERROR
)

 

A Few Key Points About SQL Functions

  • There are two types of SQL table functions – inline (BEGIN ATOMIC … END) and compiled (BEGIN … END). The table function READ_REST_API_ALL_APPS created is a compiled table function.
  • Compiled compound statements are also referred to as anonymous blocks
  • Compiled table functions allow scalar functions that can return an ARRAY of ROW as part of the FROM clause
  • When using compiled table functions the results are not pipelined – which means the results are not returned one row at a time as requested

Accessing XML, JSON (or any other type of) Data at Rest Using WebHDFS

Suppose you just wanted to retrieve data from a filesystem and perform “schema on read” as part of executing a SQL query that converts the XML or JSON data into relational format? Using the WebHDFS REST API this is possible. To assist with retrieving data from a location supported by the WebHDFS REST API, the following functions are created:

  • A function that will list the filenames in an HDFS directory (or any URI supported by WebHDFS)
  • A function that then returns the contents of each file as a separate row

Once each file becomes an individual row you can then apply any and all types of SQL functionality against it. In this case SQL JSON functions.

Define a few data types for handling the filenames returned:

/* Create a datatype of type ROW (a datatype that returns a row): */

CREATE OR REPLACE TYPE FILENAME_ROW AS ROW (FILENAME VARCHAR(60), MODIFICATIONTIME DATE)

/* Create a datatype called FILENAMES that is an array of rows: */

CREATE OR REPLACE TYPE FILENAMES AS FILENAME_ROW ARRAY[]

The same Global Declared Temporary Table defined earlier is also used. Its definition is repeated here:

CREATE GLOBAL TEMPORARY TABLE TEMP.HDFS_DATA
(
FILENAME VARCHAR (64) NOT NULL,
INSERT_TIME TIMESTAMP NOT NULL WITH DEFAULT,
UNIQUE_IDENTIFIER BIGINT GENERATED ALWAYS AS IDENTITY    (START WITH 1 INCREMENT BY 1),
HDFS_DATA CLOB
)
DISTRIBUTE BY (UNIQUE_IDENTIFIER)
ON COMMIT DELETE ROWS

 

Create an SQL Function that list the filenames in a directory using the WebHDFS REST API:

CREATE OR REPLACE FUNCTION LIST_FILENAMES
(IN DIRECTORY_PATH VARCHAR(500) , IN START_DATE DATE DEFAULT CURRENT DATE,IN END_DATE DATE DEFAULT CURRENT DATE)
RETURNS FILENAMES
LANGUAGE SQL
READS SQL DATA
NO EXTERNAL ACTION
DETERMINISTIC
SPECIFIC LIST_FILENAMES_BY_DATE
BEGIN
DECLARE SQLSTATE CHAR(5) DEFAULT ‘00000’;
DECLARE code, SQLCODE INTEGER DEFAULT 0;
DECLARE error_label CHAR(50) DEFAULT ”;
DECLARE invalid_date_range CONDITION FOR SQLSTATE ‘70001’;
DECLARE v_number_of_files INT DEFAULT 1;
DECLARE v_counter INT DEFAULT 0;
DECLARE results_counter INT DEFAULT 1;
DECLARE results FILENAMES;
DECLARE V_DATA CLOB (5M);
DECLARE HDFS_PATH VARCHAR(500);
DECLARE EPOCH_IN_DATE_FORMAT  DATE;
DECLARE V_MODIFICATION_TIME VARCHAR(64);
/* Validate that START_DATE and END_DATE contains a valid range */
IF END_DATE < START_DATE
THEN
SIGNAL invalid_date_range SET MESSAGE_TEXT =  ‘Date range is not valid’;
END IF ;

SET HDFS_PATH = DIRECTORY_PATH;
/* Retrieve the file names in the directory */
SET V_DATA = (VALUES  DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME||’:50070/webhdfs/v1’||hdfs_path||’/?op=LISTSTATUS’,”) );
/* Count the number of files in the directory */

SET V_NUMBER_OF_FILES = (VALUES (SYSTOOLS.JSON_LEN(JSON_TO_BSON(V_DATA),’FileStatuses.FileStatus’)));

/* Assign each qualifying row that has a modification time (date) that matches the date passed into the results array */

WHILE v_counter <= v_number_of_files DO

SET V_MODIFICATION_TIME = (VALUES json_query(V_DATA,’$.FileStatuses.FileStatus[‘||v_counter||’].modificationTime’ ));

SET EPOCH_IN_DATE_FORMAT = (VALUES (DATE( (hive.from_unixtime(V_MODIFICATION_TIME/1000)))));

IF (EPOCH_IN_DATE_FORMAT >= START_DATE)

THEN /* Since the row qualifies store the row in the results array */

SET results[results_counter] = (json_query(V_DATA,’$.FileStatuses.FileStatus[‘||v_counter||’].pathSuffix’ OMIT QUOTES), EPOCH_IN_DATE_FORMAT);
SET v_counter = v_counter + 1;

SET results_counter = results_counter +1;

ELSE /* Move on to the next row and evaluate */

SET v_counter = v_counter + 1;

END IF;

END WHILE;

/* Return only the file names that matched */

RETURN RESULTS;
END@

 

Create an SQL function that uses the WebHDFS REST API to return the name of the file, its modification time, and the contents of the file

The function accepts the following inputs:

  • The directory to access (mandatory)
  • To assist with limiting the number of files returned a date range (if no date range is specified then the current date is used (see the logic in the LIST_FILENAMES function)
CREATE OR REPLACE FUNCTION READ_FILES (IN PATH VARCHAR(500), START_DATE DATE, END_DATE DATE DEFAULT NULL)
RETURNS TABLE (FILENAME VARCHAR(128),MODIFICATIONTIME DATE , HDFS_DATA CLOB)
LANGUAGE SQL
MODIFIES SQL DATA
NO EXTERNAL ACTION
SPECIFIC READ_FILES_DATE
BEGIN ATOMIC
DECLARE V_FILENAME VARCHAR(128);
DECLARE V_MODIFICATIONTIME VARCHAR(32);
DECLARE FILE_CONTENTS CLOB ;

/* For each filename name returned retrieve its JSON data and insert into the temporary table TEMP.HDFS_DATA */
FOR ROW AS SELECT FILENAME , MODIFICATIONTIME FROM TABLE (LIST_FILENAMES(PATH, START_DATE,END_DATE)) AS T

DO
SET  V_FILENAME = ROW.FILENAME;
SET  V_MODIFICATIONTIME = ROW.MODIFICATIONTIME;
/* Retrieve JSON data */

SET FILE_CONTENTS = (VALUES DB2XML.HTTPGETCLOB(‘http://’||HDFS_HOSTNAME||’:50070/webhdfs/v1’||PATH||’/’||v_filename||’?op=OPEN’,”));
/*Insert the file into the global temp table TEMP.DOCUMENT_DATA */
INSERT INTO TEMP.HDFS_DATA (FILENAME, MODIFICATIONTIME, DOCUMENT_DATA) VALUES (V_FILENAME, V_MODIFICATIONTIME, FILE_CONTENTS);

END FOR;

/* Return the contents stored in TEMP.HDFS_DATA */

RETURN (SELECT FILENAME, MODIFICATIONTIME, DOCUMENT_DATA FROM TEMP.HDFS_DATA);

END @

 

Below is a query that calls the READ_FILES table function and passes in the file directory path and the earliest modified date that a file can have to be returned:

WITH CTE ( FILENAME,MODIFICATIONTIME, DOCUMENT_DATA)
AS
(
SELECT * FROM TABLE(READ_FILES(PATH=> ‘/bank_derivatives/jsondata’, START_DATE  => (CURRENT DATE – 30 days )  ))
)SELECT FILENAME, MODIFICATIONTIME  FILE_OUTPUT.*FROM CTE,
JSON_TABLE(HDFS_DATA, ‘strict $’ COLUMNS
(
base varchar(14) FORMAT JSON PATH ‘$.base’ OMIT QUOTES,
date varchar(12)   FORMAT JSON PATH ‘$.date’ OMIT QUOTES,
rates varchar(192)   FORMAT JSON PATH ‘$.rates’ OMIT QUOTES
) ERROR ON ERROR) AS  BANK_DERIVATIVES

Accessing XML Data in Hadoop

The same approach used for accessing JSON also applies to XML data. Using the already created SQL table functions the following query returns XML in relational format. Notice that the functions THIS_MONTH and NEXT_MONTH are also used to define the date range of files to retrieve:

WITH CTE (FILENAME,MODIFICATIONTIME, HDFS_DATA)
AS
(
SELECT FILENAME, MODIFICATIONTIME,XMLPARSE (DOCUMENT HDFS_DATA)
FROM TABLE (READ_FILES (PATH=> ‘/insurance/claim_2018/data’,START_DATE  => THIS_MONTH(CURRENT DATE) ,END_DATE => NEXT_MONTH (CURRENT DATE)))
)
SELECT YYZ.”PO ID”, YYZ.”Part #”, YYZ.”Product Name”,  YYZ.”Quantity”, YYZ.”Price”, YYZ.”Order Date”
FROM CTE P,  XMLTABLE(‘$po/PurchaseOrder/item’ PASSING P.HDFS_DATA AS “po”
COLUMNS “PO ID”         INTEGER       PATH ‘../@PoNum’,
“Part #” CHAR(10)      PATH ‘partid’,
“Product Name”  VARCHAR(50)   PATH ‘name’,
“Quantity”  INTEGER       PATH ‘quantity’,
“Price”  DECIMAL(9,2)  PATH ‘price’,
“Order Date”  DATE  PATH ‘../@OrderDate’ ) AS YYZ

Using CREATE TABLE AS (CTAS) and CREATE VIEW

It also possible to include the Big SQL JSON functions in CREATE TABLE AS and CREATE VIEW statements. Note that only the storage formats PARQUET and ORC should be used as storage formats since JSON and XML data often have multiple newline characters which if stored in a table defined using ‘STORED AS TEXTFILE’ may affect identifying the end of each row:

CREATE TABLE example:

CREATE HADOOP TABLE JSON.DATA
STORED AS PARQUET
AS
(
SELECT  id, username, name, queue, state, finalStatus,
TIMESTAMP_FORMAT( hive.from_unixtime( cast(startedTime/1000 as bigint) ),’YYYY-MM-DD HH24:MI:SS’,0) as StartTime,
TIMESTAMP_FORMAT(hive.from_unixtime( cast(finishedTime/1000 as bigint) ),’YYYY-MM-DD HH24:MI:SS’,0) as FinishTime,
applicationType
FROM  TABLE (READ_REST_API_ALL_APPS()),
JSON_TABLE (HDFS_DATA,’strict $’
COLUMNS (
id VARCHAR(34) FORMAT JSON PATH ‘$.id’ ,
username varchar(10) FORMAT JSON PATH ‘$.user’,
name varchar(90) FORMAT JSON PATH ‘$.name’,
queue varchar(10) FORMAT JSON PATH ‘$.queue’,
state varchar(10) FORMAT JSON PATH ‘$.state’,
finalStatus varchar(12) FORMAT JSON PATH ‘$.finalStatus’,
startedTime varchar(15) FORMAT JSON PATH ‘$.startedTime’,
finishedTime varchar(15) FORMAT JSON PATH ‘$.finishedTime’,
applicationType varchar(12) FORMAT JSON PATH ‘$.applicationType’
)
ERROR ON ERROR) AS JUST_ANOTHER_QUERY
)

CREATE VIEW example:

CREATE VIEW VIEW.JSON AS
WITH DOCUMENT_DATA_CTE (DOCUMENT_DATA)
AS
(
SELECT * FORM TABLE (READ_REST_API_ALL_APPS () )
)
SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS,
TIMESTAMP_FORMAT( hive.from_unixtime( startedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS STARTTIME,
TIMESTAMP_FORMAT(hive.from_unixtime( finishedTime/1000),’YYYY-MM-DD HH24:MI:SS’,0) AS FINISHTIME ,
APP_TYPE
FROM
DOCUMENT_DATA_CTE, JSON_TABLE(DOCUMENT_DATA,’strict $’COLUMNS (
id VARCHAR(34) FORMAT JSON PATH ‘$.id’  OMIT QUOTES,
username varchar (10) FORMAT JSON PATH ‘$.user’ OMIT QUOTES,
name varchar (90) FORMAT JSON PATH ‘$.name’ OMIT QUOTES,
queue varchar (10) FORMAT JSON PATH ‘$.queue’ OMIT QUOTES,
state varchar (10) FORMAT JSON PATH ‘$.state’ OMIT QUOTES,
Status varchar (12) FORMAT JSON PATH ‘$.finalStatus’ OMIT QUOTES,
startedTime varchar (15) FORMAT JSON PATH ‘$.startedTime’ OMIT QUOTES,
finishedTime varchar (15) FORMAT JSON PATH ‘$.finishedTime’ OMIT QUOTES,
app_Type varchar (12) FORMAT JSON PATH ‘$.applicationType’ OMIT QUOTES
)
ERROR ON ERROR)

 

Performing Temporal Queries Using the Overlaps Predicate Against YARN Resource Manager JSON Data

It is very common to want to understand why a job executing on YARN is not executing with a consistent or a predictable elapsed time.  Or perhaps just wanting to find out what other jobs were executing at a particular time period. For example, suppose you had a job that started at 2:00 pm and completed at 4:00 pm and you wanted to know what other jobs where executing during that time period. You might construct a query with the following WHERE clause:

SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS …
FROM TAB, JSON_TABLE (….)
WHERE TIME(STARTEDTIME) BETWEEN TIME (’14:00:00′) AND TIME (’16:00:00′)
AND TIME(FINISHEDTIME) <= TIME (’16:00:00′)

The above query will not provide an accurate account of all activities that may have had an impact on the elapsed time of your job:

  • It will not return any job that started before 2 pm and completed by 4 pm
  • It will not return any job that started before 2 pm and completed after 4 pm
  • It will not return any job that started after 2pm (but before 4 pm) and completed after 4 pm

Rewriting the query to use the OVERLAPS predicate identifies all jobs that may have had an impact on a job’s elapsed execution time:

SELECT ID, USERNAME, NAME, QUEUE, STATE, STATUS …
FROM TAB, JSON_TABLE (….)
WHERE (TIME(’14:00:00′), TIME(’16:00:00′)) OVERLAPS (TIME(STARTEDTIME), TIME(FINISHEDTIME))”

To read more about the use of the OVERLAPS predicate you can refer to the Knowledge Center:  https://www.ibm.com/support/knowledgecenter/en/SSCRJT_6.0.0/com.ibm.swg.im.bigsql.commsql.doc/doc/r_overlaps_predicate.html

Querying JSON Records That Have Different Field Names for the Same Data

Let’s demonstrate one more benefit of using the SQL JSON functions: Specifying a default value when the result for a field name is an empty string.

In the world of mergers and acquisitions it is reasonable to expect that each corporation’s data does not share the same identical JSON schema or for that matter the same JSON field names. Let’s assume that grocery chain one has merged with grocery chain two and that the following JSON fields while storing the same type and meaning of the data have different field names for those same values:

Description of JSON Field Field name in JSON files from Grocery Chain One Field name in JSON files from Grocery Chain Two
Date of the transaction transaction_date trans_date
Store number that transaction was completed at storeno store_number
Merchant code that was used at the point of sale (pharmacy, deli, carpet cleaner, etc…) merchant_code Field name does not exist in the record
Was the transaction completed at a self-checkout kiosk self-checkout selfcheckout

 

Below are examples of a JSON record from each grocery chain:

 

Grocery Chain ONE Grocery Chain TWO
{……

“storeno”:”ONE”

,”transaction_date”: “2017-07-18″

,”merchant_code” : 2

,”self_checkout” : “Y”

}

{…………

“selfcheckout” : “N”

, “store_number”:”TWO”

,”trans_date”: “2019-04-28”

}

 

Let’s look at the field names “self_checkout” and “selfcheckout”. While the corresponding values are present for both grocery chains, the field names are different. This means if a query has ‘… PATH ‘$.self_checkout’ when reading a JSON record from grocery chain one a value will be returned but reading a record from grocery chain two a null value will be returned. (because that field name does not exist in the JSON record). Also, in the case of the field named ‘merchant_code’ all rows from grocery chain two will have a null returned (also because the field does not exist).

When a field name is missing in a JSON record the options available is to either return an error, a null or a default value instead of a null. The ability to return a default value instead of null means that the DEFAULT ON EMPTY clause can be used to return the value of another JSON field. This means we can perform a type of ‘IF THEN ELSE’ logic where if a field name from grocery chain one is returned as null then return the value for the corresponding field name for grocery chain two. Below is an example:

WITH DOCUMENT_DATA_CTE (JSON_GROCERY_DATA) AS
(
SELECT HDFS_DATA FROM TABLE (read_files(PATH=> ‘/grocery/json_records’, START_DATE => (CURRENT DATE – 730 days ) )) AS T
)
SELECT STORE_NUMBER, TRANSACTIONDATE, SELF_CHECKOUT, MERCHANT_CODE
FROM DOCUMENT_DATA_CTE,
JSON_TABLE (JSON_GROCERY_DATA, ‘strict $’ COLUMNS
(
store_number varchar(14) PATH ‘$.storeno’ DEFAULT json_value(JSON_GROCERY_DATA,’$.store_number’) ON EMPTY ,
transactiondate VARCHAR(11)   PATH ‘$.transaction_date’ DEFAULT json_value(JSON_GROCERY_DATA,’$.trans_date’) ON EMPTY,
self_checkout varchar(1) PATH ‘$.self_checkout’ DEFAULT json_value(JSON_GROCERY_DATA,’$.selfcheckout’) ON EMPTY,
merchant_code varchar(10) PATH ‘$.merchant_code’ DEFAULT ‘-999’ ON EMPTY
)
ERROR ON ERROR) AS GROCERY_RESULTS

Below is the output:

STORE_NUMBER TRANSACTIONDATE SELF_CHECKOUT MERCHANT_CODE
ONE 2017-07-18 Y 2
TWO 2019-04-28 N -999

Conclusion

The ability to use IBM Db2 Big SQL to access all types of data in particular XML and JSON data, whether in HDFS or returned via REST API allows for developers and users to access data and apply the complete set of Big SQL’s SQL functionality against the data no differently than if that data was stored in HBase or relationally in Hadoop. This is furthered enhanced by the ability to create stored procedures or user defined (aggregate) functions in SQL, java, and C++  which can be used to provide additional business value.

References

IBM DB2 Big SQL Knowledge Center: https://www.ibm.com/support/knowledgecenter/en/SSCRJT_6.0.0/com.ibm.swg.im.bigsql.welcome.doc/doc/welcome.html

Db2 REST User Defined Functions:  https://developer.ibm.com/tutorials/dm-1105httprestdb2/

eBook  about the IBM DB2 Big SQL JSON functions:  http://ibm.biz/db2json

Join The Discussion

Your email address will not be published. Required fields are marked *