Lab 5 Querying data with Big SQL

Now that you know how to work with HDFS and analyze your data with a spreadsheet-style tool, it’s a good time to explore how you can query your data with Big SQL. Big SQL provides broad SQL support based on the ISO SQL standard. You can issue queries using JDBC or ODBC drivers to access data that is stored in InfoSphere BigInsights in the same way that you access relational databases from your enterprise applications. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions.

This tutorial uses sales data from a fictional company that sells and distributes outdoor products to third-party retailer stores as well as directly to consumers through its online store. It maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments. In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the company’s performance and offerings. Note that BigInsights provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab.

To execute the queries in this lab, you will use the open source Eclipse environment provided with the BigInsights 3.0 Quick Start Edition VMware image. Of course, you can use other tools or interfaces to invoke Big SQL, such as the Java SQL Shell (JSqsh), a command-line facility provided with the BigInsights. However, Eclipse is a good choice for this lab, as it formats query results in a manner that’s easy to read and encourages you to collect your SQL statements into scripts for editing and testing.

After you complete the lessons in this module, you will understand how to:

  • Connect to the Big SQL server from Eclipse
  • Execute individual or multiple Big SQL statements
  • Create Big SQL tables in Hadoop
  • Populate Big SQL tables with data from local files
  • Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular expressions.
  • Create and query a view based on multiple Big SQL tables.
  • Create and run a JDBC client application for Big SQL using Eclipse.

Allow 45 – 60 minutes to complete this lab. Prior to starting this lab, you need to have a working BigInsights 3.0 environment. An earlier lab on Getting Started with Hadoop and BigInsights explains how you can acquire and launch the necessary software services.

If you want to learn even more about Big SQL, check out the SQL on Hadoop lab collection. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.

5.1. Creating a project and executing Big SQL statements

To begin, create a BigInsights project and Big SQL script.

__1. Launch Eclipse using the icon on your desktop. Accept the default workspace when prompted.

image001

__2. Create a BigInsights project for your work. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select BigInsights Project, and then click Next.

__3. Type myBigSQL in the Project name field, and then click Finish.

image002

__4. If you are not already in the BigInsights perspective, a Switch to the BigInsights perspective window opens. Click Yes to switch to the BigInsights perspective.

__5. Create a new SQL script file. From the Eclipse menu bar, click File > New > Other. Expand the BigInsights folder, and select SQL script, and then click Next.

__6. In the New SQL File window, in the Enter or select the parent folder field, select myBigSQL. Your new SQL file is stored in this project folder.

__7. In the File name field, type aFirstFile. The .sql extension is added automatically. Click Finish.

image003

In the elect Connection Profile window, locate the Big SQL JDBCconnection, which is the pre-defined connection to Big SQL 3.0 provided with the VMware image. Inspect the propertiesdisplayed in the Properties field. Verify that the connection uses the JDBC driver and database name shown in the Properties pane here.

image004

image005

About the driver selection

You may be wondering why you are using a connection that employs the com.ibm.com.db2.jcc.DB2 driver class. In 2014, IBM released a common SQL query engine as part of its DB2 and BigInsights offerings.Doing so provides for greater SQL commonality across its relational DBMS and Hadoop-based offerings.It also brings a greater breadth of SQL function to Hadoop (BigInsights) users.This common query engine is accessible through the DB2 driver.The Big SQL driver remains operational and offers connectivity to an earlier, BigInsights-specific SQL query engine.This lab focuses on using the common SQL query engine.

__8. Click Edit to edit this connection’s log in information.

image006

__9. Change the user name and password properties to match your user ID and password (e.g., biadmin / biadmin). Leave the remaining property values intact.

image007

__10. Click Test Connection to verify that you can successfully connect to the server.

__11. Check the Save password box and click OK.

__12. Click Finish to close the connection window. Your empty SQL script will be displayed.

__13. Copy the following statement into your SQL script:

[code language=”sql”]create hadoop table test1 (col1 int, col2 varchar(5));[/code]

Because you didn’t specify a schema name for the table, it will be created in your default schema, which is your user name (biadmin). Thus, the previous statement is equivalent to

[code language=”sql”]create hadoop table biadmin.test1 (col1 int, col2 varchar(5));[/code]
image005

In some cases, the Eclipse SQL editor may flag certain Big SQL statements as containing syntax errors. Ignore these false warnings and continue with your lab exercises.

__14. Save your file (press Ctrl + S or click File > Save).

__15. Right mouse click anywhere in the script to display a menu of options.

image008

__16. Select Run SQL or press F5. This causes all statements in your script to be executed.

__17. Inspect the SQL Results pane that appears towards the bottom of your display .(If desired, double click on the SQL Results tab to enlarge this pane. Then double click on the tab again to return the pane to its normal size.) Verify that the statement executed successfully. Your Big SQL database now contains a new table named BIADMIN.TEST1. Note that your schema and table name were folded into upper case.

 

image009

image005

For the remainder of this lab, you should execute each SQL statement individually. To do so, highlight the statement with your cursor and press F5.

When you’re developing a SQL script with multiple statements, it’s generally a good idea to test each statement one at a time to verify that each is working as expected.

__18. From your Eclipse project, query the system for meta data about your test1 table:

[code language=”sql”]select tabschema, colname, colno, typename, length
from syscat.columns where tabschema = USER and tabname= ‘TEST1’;[/code]

In case you’re wondering, syscat.columns is one of a number of views supplied over system catalog data automatically maintained for you by the Big SQL service.

__19. Inspect the SQL Results to verify that the query executed successfully, and click on the Result1 tab to view its output.

image010

__20. Finally, clean up the object you created in the database.

[code language=”sql”]drop table test1;[/code]

__21. Save your file. If desired, leave it open to execute statements for subsequent exercises.

Now that you’ve set up your Eclipse environment and know how to create SQL scripts and execute queries, you’re ready to develop more sophisticated scenarios using Big SQL. In the next lab, you will create a number of tables in your schema and use Eclipse to query them.

5.2. Creating sample tables and loading sample data

In this lesson, you will create several sample tables and load data into these tables from local files.

__1. Determine the location of the sample data in your local file system and make a note of it. You will need to use this path specification when issuing LOAD commands later in this lab.

image005

Subsequent examples in this section presume your sample data is in the /opt/ibm/biginsights/bigsql/samples/data directory. This is the location of the data on the BigInsights VMware image, and it is the default location in typical BigInsights installations.

__2. Create several tables to track information about sales.Issue each of the following CREATE TABLE statements one at a time, and verify that each completed successfully:

[code language=”sql”]– dimension table for region info

CREATE HADOOP TABLE IF NOT EXISTS go_region_dim
( country_key INT NOT NULL
, country_code INT NOT NULL
, flag_image VARCHAR(45)
, iso_three_letter_code VARCHAR(9) NOT NULL
, iso_two_letter_code VARCHAR(6) NOT NULL
, iso_three_digit_code VARCHAR(9) NOT NULL
, region_key INT NOT NULL
, region_code INT NOT NULL
, region_en VARCHAR(90) NOT NULL
, country_en VARCHAR(90) NOT NULL
, region_de VARCHAR(90), country_de VARCHAR(90), region_fr VARCHAR(90)
, country_fr VARCHAR(90), region_ja VARCHAR(90), country_ja VARCHAR(90)
, region_cs VARCHAR(90), country_cs VARCHAR(90), region_da VARCHAR(90)
, country_da VARCHAR(90), region_el VARCHAR(90), country_el VARCHAR(90)
, region_es VARCHAR(90), country_es VARCHAR(90), region_fi VARCHAR(90)
, country_fi VARCHAR(90), region_hu VARCHAR(90), country_hu VARCHAR(90)
, region_id VARCHAR(90), country_id VARCHAR(90), region_it VARCHAR(90)
, country_it VARCHAR(90), region_ko VARCHAR(90), country_ko VARCHAR(90)
, region_ms VARCHAR(90), country_ms VARCHAR(90), region_nl VARCHAR(90)
, country_nl VARCHAR(90), region_no VARCHAR(90), country_no VARCHAR(90)
, region_pl VARCHAR(90), country_pl VARCHAR(90), region_pt VARCHAR(90)
, country_pt VARCHAR(90), region_ru VARCHAR(90), country_ru VARCHAR(90)
, region_sc VARCHAR(90), country_sc VARCHAR(90), region_sv VARCHAR(90)
, country_sv VARCHAR(90), region_tc VARCHAR(90), country_tc VARCHAR(90)
, region_th VARCHAR(90), country_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
;

— dimension table tracking method of order for the sale (e.g., Web, fax)

CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim
( order_method_key INT NOT NULL
, order_method_code INT NOT NULL
, order_method_en VARCHAR(90) NOT NULL
, order_method_de VARCHAR(90), order_method_fr VARCHAR(90)
, order_method_ja VARCHAR(90), order_method_cs VARCHAR(90)
, order_method_da VARCHAR(90), order_method_el VARCHAR(90)
, order_method_es VARCHAR(90), order_method_fi VARCHAR(90)
, order_method_hu VARCHAR(90), order_method_id VARCHAR(90)
, order_method_it VARCHAR(90), order_method_ko VARCHAR(90)
, order_method_ms VARCHAR(90), order_method_nl VARCHAR(90)
, order_method_no VARCHAR(90), order_method_pl VARCHAR(90)
, order_method_pt VARCHAR(90), order_method_ru VARCHAR(90)
, order_method_sc VARCHAR(90), order_method_sv VARCHAR(90)
, order_method_tc VARCHAR(90), order_method_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
;

— look up table with product brand info in various languages

CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup
( product_brand_code INT NOT NULL
, product_brand_en VARCHAR(90) NOT NULL
, product_brand_de VARCHAR(90), product_brand_fr VARCHAR(90)
, product_brand_ja VARCHAR(90), product_brand_cs VARCHAR(90)
, product_brand_da VARCHAR(90), product_brand_el VARCHAR(90)
, product_brand_es VARCHAR(90), product_brand_fi VARCHAR(90)
, product_brand_hu VARCHAR(90), product_brand_id VARCHAR(90)
, product_brand_it VARCHAR(90), product_brand_ko VARCHAR(90)
, product_brand_ms VARCHAR(90), product_brand_nl VARCHAR(90)
, product_brand_no VARCHAR(90), product_brand_pl VARCHAR(90)
, product_brand_pt VARCHAR(90), product_brand_ru VARCHAR(90)
, product_brand_sc VARCHAR(90), product_brand_sv VARCHAR(90)
, product_brand_tc VARCHAR(90), product_brand_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
;

— product dimension table

CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim
( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
;

— look up table with product line info in various languages

CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup
( product_line_code INT NOT NULL
, product_line_en VARCHAR(90) NOT NULL
, product_line_de VARCHAR(90), product_line_fr VARCHAR(90)
, product_line_ja VARCHAR(90), product_line_cs VARCHAR(90)
, product_line_da VARCHAR(90), product_line_el VARCHAR(90)
, product_line_es VARCHAR(90), product_line_fi VARCHAR(90)
, product_line_hu VARCHAR(90), product_line_id VARCHAR(90)
, product_line_it VARCHAR(90), product_line_ko VARCHAR(90)
, product_line_ms VARCHAR(90), product_line_nl VARCHAR(90)
, product_line_no VARCHAR(90), product_line_pl VARCHAR(90)
, product_line_pt VARCHAR(90), product_line_ru VARCHAR(90)
, product_line_sc VARCHAR(90), product_line_sv VARCHAR(90)
, product_line_tc VARCHAR(90), product_line_th VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

— look up table for products
CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup
( product_number INT NOT NULL
, product_language VARCHAR(30) NOT NULL
, product_name VARCHAR(150) NOT NULL
, product_description VARCHAR(765)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;

— fact table for sales
CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key INT NOT NULL
, organization_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, retailer_site_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, order_method_key INT NOT NULL
, sales_order_key INT NOT NULL
, ship_day_key INT NOT NULL
, close_day_key INT NOT NULL
, quantity INT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE
;

— fact table for marketing promotions
CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact
( organization_key INT NOT NULL
, order_day_key INT NOT NULL
, rtl_country_key INT NOT NULL
, employee_key INT NOT NULL
, retailer_key INT NOT NULL
, product_key INT NOT NULL
, promotion_key INT NOT NULL
, sales_order_key INT NOT NULL
, quantity SMALLINT
, unit_cost DOUBLE
, unit_price DOUBLE
, unit_sale_price DOUBLE
, gross_margin DOUBLE
, sale_total DOUBLE
, gross_profit DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\t’
LINES TERMINATED BY ‘\n’
STORED AS TEXTFILE;[/code]

 

image005

Let’s briefly explore some aspects of the CREATE TABLE statements shown here. If you have a SQL background, the majority of these statements should be familiar to you. However, after the column specification, there are some additional clauses unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in this case, Hive). The ROW FORMAT clause specifies that fields are to be terminated by tabs (“’t”) and lines are to be terminated by new line characters (“’n”). The table will be stored in a TEXTFILE format, making it easy for a wide range of applications to work with.For details on these clauses, refer to the Apache Hive documentation.

__3. Load data into each of these tables using sample data provided in files.  One at a time, issue each of the following LOAD statements and verify that each completed successfully.  Remember to change the SFTP and file path specifications (if needed) to match your environment.  The statements will return a warning message providing details on the number of rows loaded, etc.

[code language=”sql”]load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE MRK_PROMOTION_FACT overwrite;

load hadoop using file url
‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE GO_REGION_DIM overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_ORDER_METHOD_DIM overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_PRODUCT_BRAND_LOOKUP overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_PRODUCT_DIM overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_PRODUCT_LOOKUP overwrite;

load hadoop using file url ‘sftp://biadmin:biadmin@bivm:22/opt/ibm/biginsights/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt’ with SOURCE PROPERTIES (‘field.delimiter’=’\t’) INTO TABLE SLS_SALES_FACT overwrite;[/code]

 

image001

Let’s explore the LOAD syntax shown in these examples briefly. Each example loads data into a table using a file URL specification that relies on SFTP to locate the source file (in this case, in a file on your local VM). In particular, the SFTP specification includes a valid user ID and password (biadmin/biadmin), the target host server and port (bivm:22), and the full path of the data file on that system. Note that the path is local to the Big SQL server (not your Eclispe client). The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (“\t”). The INTO TABLE clause identifies the target table for the LOAD operation. The OVERWRITE keyword indicates that any existing data in the table will be replaced by data contained in the source file. (If you wanted to simply add rows to the table’s content, you could specify APPEND instead.)

Using SFTP (or FTP) is one way in which you can invoke the LOAD command. If your target data already resides in your distributed file system, you can provide the DFS directory information in your file URL specification. Indeed, for optimal runtime performance, you may prefer to take that approach. In addition, you can load data directly from a remote relational DBMS via a JDBC connection. See the BigInsights Knowledge Center (product documentation) for details.

__4. Query the tables to verify that the expected number of rows was loaded into each table. Execute each query that follows individually and compare the results with the number of rows specified in the comment line preceding each query.

[code language=”sql”]– total rows in GO_REGION_DIM = 21
select count(*) from GO_REGION_DIM;

— total rows in sls_order_method_dim = 7
select count(*) from sls_order_method_dim;

— total rows in SLS_PRODUCT_BRAND_LOOKUP = 28
select count(*) from SLS_PRODUCT_BRAND_LOOKUP;

— total rows in SLS_PRODUCT_DIM = 274
select count(*) from SLS_PRODUCT_DIM;

— total rows in SLS_PRODUCT_LINE_LOOKUP = 5
select count(*) from SLS_PRODUCT_LINE_LOOKUP;

— total rows in SLS_PRODUCT_LOOKUP = 6302
select count(*) from SLS_PRODUCT_LOOKUP;

— total rows in SLS_SALES_FACT = 446023
select count(*) from SLS_SALES_FACT;

— total rows gosalesdw.MRK_PROMOTION_FACT = 11034
select count(*) from MRK_PROMOTION_FACT;[/code]

 

5.3. Querying tables with joins, aggregations and more

Now you’re ready to query your tables. Based on earlier exercises, you’ve already seen that you can perform basic SQL operations, including projections (to extract specific columns from your tables) and restrictions (to extract specific rows meeting certain conditions you specified). Let’s explore a few examples that are a bit more sophisticated.

In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations. Note that the queries included in this section are based on queries shipped with BigInsights as samples. Some of these queries return hundreds of thousands of rows; however, the Eclipse SQL Results page limits output to only 500 rows. Although you can change that value in the Data Management preferences section, retain the default setting for this lab.

__1. Join data from multiple tables to return the product name, quantity and order method of goods that have been sold. To do so, execute the following query.

[code language=”sql”]– Fetch the product name, quantity, and order method
— of products sold.
— Query 1
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language=’EN’
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key;[/code]

 

Let’s review a few aspects of this query briefly:

  • Data from four tables will be used to drive the results of this query (see the tables referenced in the FROM clause). Relationships between these tables are resolved through 3 join predicates specified as part of the WHERE clause. The query relies on 3 equi-joins to filter data from the referenced tables. (Predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables.)
  • For improved readability, this query uses aliases in the SELECT and FROM clauses when referencing tables. For example, pnumb.product_name refers to “pnumb,” which is the alias for the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used in the WHERE clause so that you do not need to repeat the complete table name.
  • The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization.

image011

__2. Modify the query to restrict the order method to one type – those involving a Sales visit. To do so, add the following query predicate just before the semi-colon:

[code language=”sql”]AND order_method_en=’Sales visit’[/code]

__3. Inspect the results, a subset of which is shown below:

image012

__4. To find out which sales method of all the methods has the greatest quantity of orders, add a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition, invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to substitute a more readable column header.

[code language=”sql”]– Query 3
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
sum(sf.QUANTITY) AS total
FROM
sls_order_method_dim AS md,
sls_product_dim AS pd,
sls_product_line_lookup AS pll,
sls_product_brand_lookup AS pbl,
sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;[/code]

 

__5. Inspect the results, which should contain 35 rows. A portion is shown below.

image013

5.4. Optional:Using SerDes for non-traditional data

While data structured in CSV and TSV columns are often stored in BigInsights and loaded into Big SQL tables, you may also need to work with other types of data – data that might require the use of a serializer / deserializer (SerDe). SerDes are common in the Hadoop environment. You’ll find a number of SerDes available in the public domain, or you can write your own following typical Hadoop practices.

Using a SerDe with Big SQL is pretty straightforward. Once you develop or locate the SerDe you need, just add its JAR file to the appropriate BigInsights subdirectories. Then stop and restart the Big SQL service, and specify the SerDe class name when you create your table.

In this lab exercise, you will use a SerDe to define a table for JSON-based blog data. The sample blog file for this exercise is the same blog file you used as input to BigSheets in a prior lab.

__1. Download the hive-json-serde-0.2.jar into a directory of your choice on your local file system, such as /home/biadmin/sampleData. (As of this writing, the full URL for this SerDe is https://code.google.com/p/hive-json-serde/downloads/detail?name=hive-json-serde-0.2.jar)

__2. Register the SerDe with BigInsights.

__a. Stop the Big SQL server.From a terminal window, issue this command:$BIGINSIGHTS_HOME/bin/stop.sh bigsql

__b. Copy the SerDe .jar file to the $BIGSQL_HOME/userlib and $HIVE_HOME/lib directories.

__c. Restart the Big SQL server. From a terminal window, issue this command: $BIGINSIGHTS_HOME/bin/start.sh bigsql

Now that you’ve registered your SerDe, you’re ready to use it. In this section, you will create a table that relies on the SerDe you just registered. For simplicity, this will be an externally managed table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory will contain the table’s data in files. As part of this exercise, you will upload the sample blogs-data.txt file into the target DFS directory.

Creating a Big SQL table over an existing DFS directory has the effect of populating this table with all the data in the directory. To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the table’s contents. This is consistent with the Hive concept of an externally managed table.

Once the table is created, you’ll query that table. In doing so, you’ll note that the presence of a SerDe is transparent to your queries.

__3. If necessary, download the .zip file containing the sample data from the bottom half of the article on Analyzing social media and structured data with InfoSphere BigInsights. Unzip the file into a directory on your local file system, such as /home/biadmin. You will be working with the blogs-data.txt file.

From the Files tab of the Web console, navigate to the /user/biadmin/sampleData directory of your distributed file system. Use the create directory button to create a subdirectory named SerDe-Test.

image014

__4. Upload the blogs-data.txt file into /user/biadmin/sampleData/SerDe-Test.

image015

__5. Return to the Big SQL execution environment of your choice (e.g., Eclipse).

__6. Execute the following statement, which creates a TESTBLOGS table that includes a LOCATION clause that specifies the DFS directory containing your sample blogs-data.txt file:

[code language=”sql”]create hadoop table if not exists testblogs (
Country String,
Crawled String,
FeedInfo String,
Inserted String,
IsAdult int,
Language String,
Postsize int,
Published String,
SubjectHtml String,
Tags String,
Type String,
Url String)
row format serde ‘org.apache.hadoop.hive.contrib.serde2.JsonSerde’
location ‘/user/biadmin/sampleData/SerDe-Test’;[/code]

 

5.5. Optional:Developing a JDBC client application with Big SQL

You can write a JDBC client application that uses Big SQL to open a database connection, execute queries, and process the results. In this optional exercise, you’ll see how writing a client JDBC application for Big SQL is like writing a client application for any relational DBMS that supports JDBC access.

__1. In the IBM InfoSphere BigInsights Eclipse environment, create a Java project by clicking File > New >Project. From the New Project window, select Java Project. Click Next.

image016

__2. Type a name for the project in the Project Name field, such as MyJavaProject. Click Next.

__3. Open the Libraries tab and click Add External Jars. Add the DB2 JDBC driver for BigInsights, located at /opt/ibm/biginsights/database/db2/java/db2jcc4.jar.

image017

__4. Click Finish. Click Yes when you are asked if you want to open the Java perspective.

__5. Right-click the MyJavaProject project, and click New > Package. In the Name field, in the New Java Package window, type a name for the package, such as aJavaPackage4me. Click Finish.

image018

__6. Right-click the aJavaPackage4me package, and click New > Class.

__7. In the New Java Class window, in the Name field, type SampApp. Select the public static void main(String[] args) check box. Click Finish.

image019

__8. Replace the default code for this class and copy or type the following code into the SampApp.java file (you’ll find the file in /opt/ibm/biginsights/bigsql/samples/data/SampApp.java):

[code language=”java”]package aJavaPackage4me;

//a. Import required package(s)
import java.sql.*;

public class SampApp {

/**
* @param args
*/

//b. set JDBC & database info
//change these as needed for your environment
static final String db = "jdbc:db2://YOUR_HOST_NAME:51000/bigsql";
static final String user = "YOUR_USER_ID";
static final String pwd = "YOUR_PASSWORD";

public static void main(String[] args) {
Connection conn = null;
Statement stmt = null;
System.out.println("Started sample JDBC application.");

try{
//c. Register JDBC driver — not needed for DB2 JDBC type 4 connection
// Class.forName("com.ibm.db2.jcc.DB2Driver");

//d. Get a connection
conn = DriverManager.getConnection(db, user, pwd);
System.out.println("Connected to the database.");

//e. Execute a query
stmt = conn.createStatement();
System.out.println("Created a statement.");
String sql;
sql = "select product_color_code, product_number from sls_product_dim " +
"where product_key=30001";
ResultSet rs = stmt.executeQuery(sql);
System.out.println("Executed a query.");

//f. Obtain results
System.out.println("Result set: ");
while(rs.next()){
//Retrieve by column name
int product_color = rs.getInt("PRODUCT_COLOR_CODE");
int product_number = rs.getInt("PRODUCT_NUMBER");
//Display values
System.out.print("* Product Color: " + product_color + "\n");
System.out.print("* Product Number: " + product_number + "\n");
}

//g. Close open resources
rs.close();
stmt.close();
conn.close();

}catch(SQLException sqlE){
// Process SQL errors
sqlE.printStackTrace();
}catch(Exception e){
// Process other errors
e.printStackTrace();
}

finally{
// Ensure resources are closed before exiting
try{
if(stmt!=null)
stmt.close();
}catch(SQLException sqle2){
} // nothing we can do

try{
if(conn!=null)
conn.close();
}
catch(SQLException sqlE){
sqlE.printStackTrace();
}// end finally block
}// end try block
System.out.println("Application complete");
}}[/code]

 

__a. After the package declaration, ensure that you include the packages that contain the JDBC classes that are needed for database programming (import java.sql.*;).

__b. Set up the database information so that you can refer to it. Be sure to change the user ID, password, and connection information as needed for your environment.

__c. Optionally, register the JDBC driver.The class name is provided here for your reference.When using the DB2 Type 4.0 JDBC driver, it’s not necessary to specify the class name.

__d. Open the connection.

__e. Run a query by submitting an SQL statement to the database.

__f. Extract data from result set.

__g. Clean up the environment by closing all of the database resources.

__9. Save the file and right-click the Java file and click Run > Run as > Java Application.

__10. The results show in the Console view of Eclipse:

Started sample JDBC application.

Connected to the database.

Created a statement.

Executed a query.

Result set:

* Product Color: 908

* Product Number: 1110

Application complete

 

To find the other tutorials in this series, go to Overview tutorial.

Join The Discussion

Your email address will not be published. Required fields are marked *