Big SQL is integrated with Apache Spark as a technical preview starting in BigInsights 4.2, giving you a way to access Spark by using the SYSHADOOP.EXECSPARK built-in table function to invoke Spark jobs. Spark jobs that are launched by invoking SYSHADOOP.EXECSPARK run in a long-running Spark application called the Spark gateway, which is controlled by Big SQL. The Spark gateway is not enabled by default. To enable this feature, complete the following steps.

  1. Spark classes in a JAR file. For example: ptf-SparkJob.jar. Copy each JAR file to the $BIGSQL_HOME/userlib directory on all nodes. The bigsql user must hold the EXECUTE permission on each JAR file. A ptf-examples.jar file is provided as a template after you install the Big SQL service.
  2. Update the Big SQL Spark configuration file (bigsql-spark.conf). There is a template in the $BIGSQL_HOME/conf/template directory. Define the “ptf” JAR files in spark.driver.extraClassPath and spark.executor.extraClassPath, separated by a colon (:), as shown in the following example:
    
    spark.driver.extraClassPath /usr/ibmpacks/current/bigsql/bigsql/userlib/ptf-examples.jar:/usr/ibmpacks/current/bigsql/bigsql/userlib/spark-csv_2.10-1.3.0.jar:/usr/ibmpacks/current/bigsql/bigsql/userlib/commons-csv-1.1.jar
    
    spark.executor.extraClassPath /usr/ibmpacks/current/bigsql/bigsql/userlib/ptf-examples.jar:/usr/ibmpacks/current/bigsql/bigsql/userlib/spark-csv_2.10-1.3.0.jar:/usr/ibmpacks/current/bigsql/bigsql/userlib/commons-csv-1.1.jar

    Copy the updated bigsql-spark.conf file to the $BIGSQL_HOME/conf directory on all nodes.

  3. To start the Spark gateway, run the following command:
    $BIGSQL_HOME/bin/bigsql start –spark

    To determine the status of the Spark gateway, run the following command:

    $BIGSQL_HOME/bin/bigsql status –spark

    To stop the Spark gateway, run the following command:

    $BIGSQL_HOME/bin/bigsql stop –spark

    The format of these commands is consistent with Big SQL commands ($BIGSQL_HOME/bin/bigsql {actions} -{options}). Spark gateway commands are isolated to run with only one dependency on Big SQL: The Scheduler must be up and running before you can start the Spark gateway; otherwise, an error is returned.

Note: If you need to change the “ptf” JAR files that are defined in bigsql-spark.conf, remember to stop the Spark gateway before copying the new files to all nodes so that the changes can take effect when you restart the Spark gateway.

Example

In this example, a simple query reads customer data in CSV file format from the HDFS.
Note:
Ensure that the following JAR files are defined in spark.driver.extraClassPath and spark.executor.extraClassPath: ptf-examples.jar, spark-csv_2.10-1.3.0.jar, and commons-csv-1.1.jar. The spark-csv*jar (can be downloaded from spark-packages community https://spark-packages.org/package/databricks/spark-csv) which is dependent on common-csv*jar (can be downloaded from Apache commons https://commons.apache.org/proper/commons-csv). Also, the class argument ReadCsvFile provided in the example below is package in ptf-examples.jar which is built in Big SQL package.

a) A simple select query from existing CSV files in HDFS


SELECT
  CAST(c0 as VARCHAR(20)),
  CAST(c1 as VARCHAR(20)),
  CAST(c2 as VARCHAR(20)),
  CAST(c3 as VARCHAR(20)),
  CAST(c4 as VARCHAR(20)),
  CAST(c5 as VARCHAR(20))
FROM TABLE (SYSHADOOP.EXECSPARK(
  language => 'scala',
  class => 'com.ibm.biginsights.bigsql.examples.ReadCsvFile',
  'hdfs://host.ibm.com:port/tmp/item_tpcds_1k-rows_wtmp.csv'
)) AS item
LIMIT 3 

The output of this query will be similar to the following example:


1                    2                    3                    4                    5                    6
-------------------- -------------------- -------------------- -------------------- -------------------- ---------------
I_ITEM_ID            I_REC_START_DATE     I_CURRENT_PRICE      I_BRAND_ID           I_BRAND              I_CLASS
AAAAAAAABAAAAAAA     1997-10-27           27.02                5003002              exportischolar #2    pop
AAAAAAAACAAAAAAA     1997-10-27           1.12                 1001001              amalgamalg #1        dresses

  3 record(s) selected.

b) DESCRIBE statement can be useful to get the column data types from CSV files in HDFS
Note: column length appears to be 128 in size since from bigsql-spark.conf a parameter is set as e.g. bigsql.spark.string.size 128


DESCRIBE SELECT * 
FROM TABLE (SYSHADOOP.EXECSPARK( 
  language => 'scala', 
  class => 'com.ibm.biginsights.bigsql.examples.ReadCsvFile', 
  'hdfs://host.ibm.com:port/tmp/item_tpcds_1k-rows_wtmp.csv' )) AS item

The output of this query will be similar to the following example:


 Column Information

 Number of columns: 7

 SQL type              Type length  Column name                     Name length
 --------------------  -----------  ------------------------------  -----------
 449   VARCHAR                 128  C0                                        2
 393   TIMESTAMP                19  C1                                        2
 481   DOUBLE                    8  C2                                        2
 497   INTEGER                   4  C3                                        2
 449   VARCHAR                 128  C4                                        2
 497   INTEGER                   4  C5                                        2
 449   VARCHAR                 128  C6                                        2

The CSV file used in this example is based from TPC-DS data that can be downloaded from http://www.tpc.org/tpc_documents_current_versions/current_specifications.asp.
A mini version of ITEM table is attached here as well. Right-click on the this link
item_tpcds_1k-rows_wtmp then select “Save Link As”, save it as item_tpcds_1k-rows_wtmp.csv instead of jpeg.

Join The Discussion

Your email address will not be published. Required fields are marked *