Co Authored by Xiu Guo
This post is about how to enable analytic capabilities using spark technology for data that is present in Netezza. In IBM Spark Technology Center, we developed a Spark Connector for Netezza and published as part of the spark-packages, which can be used to access data in Netezza inside of the spark ecosystem and also to move data from Netezza tables to Spark. This post will demonstrate how one can define Netezza tables as a Spark Data Source using this package and perform analytics.
Overview of the package
Spark SQL provides data source API to plug in external data sources. Spark Netezza package is an implementation of data source API that enables users to map Netezza tables as Spark SQL data sources. This package supports all the language interfaces supported by Spark.
This package can be deployed as part of an application program or from Spark tools such as spark-shell, spark-sql. To use the package in the application, you have to specify it in your application’s build dependency. When using from Spark tools, add the package using –packages command line option. Netezza JDBC driver also should be added to the application dependencies.
For example, to use the Spark Netezza package with Spark’s interactive shell, start it as shown below:[code language=”bash” gutter=”false” highlight=”1″] $SPARK_HOME/bin/spark-shell –packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 –driver-class-path ~/nzjdbc.jar
Reading data from Netezza
A data frame can be defined corresponding to a Netezza table as shown below, and can be used to join with other data frames defined on other data sources.[code language=”scala” gutter=”false” highlight=”1-25″] scala> val nzoptions = Map("url" -> "jdbc:netezza://hostname:5480/gosales",
"user" -> "ugosales",
"password" -> "gosales",
"dbtable" -> "PRODUCT",
"numPartitions" -> "8")
scala> val productdf = sqlContext.read.format("com.ibm.spark.netezza").options(nzoptions).load()
|– PRODUCTNUMBER: integer (nullable = false)
|– INTRODUCTIONDATE: timestamp (nullable = false)
|– PRODUCTNAME: string (nullable = false)
|– PRODUCTTYPECODE: integer (nullable = false)
|– PRODUCTIONCOST: double (nullable = false)
|– MARGIN: double (nullable = false)
|– PICTURE: string (nullable = true)
|– PICTUREURL: string (nullable = true)
|– DESCRIPTION: string (nullable = true)
scala> productdf.filter("PRODUCTIONCOST > 100").select("PRODUCTNAME")
res15: org.apache.spark.sql.DataFrame = [PRODUCTNAME: string] [/code]
Saving data from Netezza into Spark
If you are using the data in the Netezza table interactively to perform advanced analytics, it is better to cache/persist
the data in the spark cluster to avoid repeated fetches of the same data. Data frames can be saved as tables in spark sql
or as parquet files.
[/code] [code language=”scala” gutter=”false” highlight=”1-25″] scala> productdf.saveAsTable("product")
scalla>val highcostProducts = sqlContext.sql("select * from product where productioncost > 100")
res17: Long = 23
Querying Netezza data source using Spark SQL CLI
Netezza data source can also be defined using SQL syntax with Spark SQL CLI as shown below:[code language=”bash” gutter=”false” highlight=”1″] $SPARK_HOME/bin/spark-sql –packages com.ibm.SparkTC:spark-netezza_2.10:0.1.1 –driver-class-path ~/nzjdbc.jar
[/code] [code language=”scala” gutter=”false” highlight=”1-25″]
CREATE TEMPORARY TABLE product_table
select * from product_table where productioncost > 100;
How data is transferred to spark?
This library uses the external table mechanism to stream the data from Netezza system to Spark nodes. Table data slices are mapped to RDD partitions to be read in parallel. User can control the number of partitions to create for a given data source. Maximum number of partitions will be the number of data slices of the Netezza table. If number of partitions specified are less than the data slices in Netezza, each RDD partition will be mapped to more than one data slice. The following diagram depicts how data is moved from Netezza appliance to Spark SQL datasource.
This post demonstrated how easy it is to transfer data from Netezza appliance to Spark and use the data stored in Netezza for big data analytics. Spark Netezza data source library is an open source GitHub project, you are welcome to report and fix any issues here.