Lab 4: Querying structured data

In this lab, you will execute Big SQL queries to investigate data stored in Hadoop.  Big SQL provides broad SQL support based on the ISO SQL standard.  You can issue queries using JDBC or ODBC drivers to access data that is stored in Hadoop in the same way that you access relational databases from your enterprise applications. Multiple queries can be executed concurrently. The SQL query engine supports joins, unions, grouping, common table expressions, windowing functions, and other familiar SQL expressions.
 
This tutorial uses sales data from a fictional company sells outdoor products to retailers and directly to consumers through a web site. The firm maintains its data in a series of FACT and DIMENSION tables, as is common in relational data warehouse environments.  In this lab, you will explore how to create, populate, and query a subset of the star schema database to investigate the company’s performance and offerings.  Note that Big SQL provides scripts to create and populate the more than 60 tables that comprise the sample GOSALESDW database. You will use fewer than 10 of these tables in this lab.
 
Prior to starting this lab, you must know how to connect to your Big SQL server and execute SQL from a supported tool.  If necessary, complete the prior lab on JSqsh before proceeding.
 
After you complete the lessons in this module, you will understand how to:
 

  • Create Big SQL tables that use Hadoop text file and Parquet file formats.
  • Populate Big SQL tables from local files and from the results of queries.
  • Query Big SQL tables using projections, restrictions, joins, aggregations, and other popular expressions.
  • Create and query a view based on multiple Big SQL tables.

 
Allow 1.5 hours to complete this lab. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.
 

4.1.          Creating sample tables and loading sample data

 
In this lesson, you will create several sample tables and load data into these tables from local files.
 
__1.          Determine the location of the sample data in your local file system and make a note of it.  You will need to use this path specification when issuing LOAD commands later in this lab.
 

 2015-08-19 14_12_25-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Subsequent examples in this section presume your sample data is in the /usr/ibmpacks/bigsql/4.0/bigsql/samples/data directory. This is the location of the data in typical Big SQL installations.

 
__2.          If necessary, launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment.
 
__3.          Create several tables in your default schema.  Issue each of the following CREATE TABLE statements one at a time, and verify that each completed successfully:

 

-- dimension table for region info
CREATE HADOOP TABLE IF NOT EXISTS go_region_dim
( country_key         INT NOT NULL
, country_code        INT NOT NULL
, flag_image              VARCHAR(45)
, iso_three_letter_code   VARCHAR(9) NOT NULL
, iso_two_letter_code     VARCHAR(6) NOT NULL
, iso_three_digit_code    VARCHAR(9) NOT NULL
, region_key          INT NOT NULL
, region_code         INT NOT NULL
, region_en           VARCHAR(90) NOT NULL
, country_en          VARCHAR(90) NOT NULL
, region_de           VARCHAR(90), country_de   VARCHAR(90), region_fr        VARCHAR(90)
, country_fr          VARCHAR(90), region_ja    VARCHAR(90), country_ja        VARCHAR(90)
, region_cs           VARCHAR(90), country_cs   VARCHAR(90), region_da        VARCHAR(90)
, country_da          VARCHAR(90), region_el    VARCHAR(90), country_el        VARCHAR(90)
, region_es           VARCHAR(90), country_es   VARCHAR(90), region_fi        VARCHAR(90)
, country_fi          VARCHAR(90), region_hu    VARCHAR(90), country_hu        VARCHAR(90)
, region_id           VARCHAR(90), country_id   VARCHAR(90), region_it        VARCHAR(90)
, country_it          VARCHAR(90), region_ko    VARCHAR(90), country_ko        VARCHAR(90)
, region_ms           VARCHAR(90), country_ms   VARCHAR(90), region_nl        VARCHAR(90)
, country_nl          VARCHAR(90), region_no    VARCHAR(90), country_no        VARCHAR(90)
, region_pl           VARCHAR(90), country_pl   VARCHAR(90), region_pt        VARCHAR(90)
, country_pt          VARCHAR(90), region_ru    VARCHAR(90), country_ru        VARCHAR(90)
, region_sc           VARCHAR(90), country_sc   VARCHAR(90), region_sv        VARCHAR(90)
, country_sv          VARCHAR(90), region_tc    VARCHAR(90), country_tc        VARCHAR(90)
, region_th           VARCHAR(90), country_th   VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

 

-- dimension table tracking method of order for the sale (e.g., Web, fax) 
CREATE HADOOP TABLE IF NOT EXISTS sls_order_method_dim
( order_method_key    INT NOT NULL
, order_method_code   INT NOT NULL
, order_method_en     VARCHAR(90) NOT NULL
, order_method_de     VARCHAR(90)
, order_method_fr    VARCHAR(90)
, order_method_ja     VARCHAR(90)
, order_method_cs    VARCHAR(90)
, order_method_da     VARCHAR(90)
, order_method_el    VARCHAR(90)
, order_method_es     VARCHAR(90)
, order_method_fi    VARCHAR(90)
, order_method_hu     VARCHAR(90)
, order_method_id    VARCHAR(90)
, order_method_it     VARCHAR(90)
, order_method_ko    VARCHAR(90)
, order_method_ms     VARCHAR(90)
, order_method_nl    VARCHAR(90)
, order_method_no     VARCHAR(90)
, order_method_pl    VARCHAR(90)
, order_method_pt     VARCHAR(90)
, order_method_ru    VARCHAR(90)
, order_method_sc     VARCHAR(90)
, order_method_sv    VARCHAR(90)
, order_method_tc     VARCHAR(90)
, order_method_th    VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

 

-- look up table with product brand info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_brand_lookup
( product_brand_code INT NOT NULL
, product_brand_en    VARCHAR(90) NOT NULL
, product_brand_de    VARCHAR(90), product_brand_fr   VARCHAR(90)
, product_brand_ja    VARCHAR(90), product_brand_cs   VARCHAR(90)
, product_brand_da    VARCHAR(90), product_brand_el   VARCHAR(90)
, product_brand_es    VARCHAR(90), product_brand_fi   VARCHAR(90)
, product_brand_hu    VARCHAR(90), product_brand_id   VARCHAR(90)
, product_brand_it    VARCHAR(90), product_brand_ko   VARCHAR(90)
, product_brand_ms    VARCHAR(90), product_brand_nl   VARCHAR(90)
, product_brand_no    VARCHAR(90), product_brand_pl   VARCHAR(90)
, product_brand_pt    VARCHAR(90), product_brand_ru   VARCHAR(90)
, product_brand_sc    VARCHAR(90), product_brand_sv   VARCHAR(90)
, product_brand_tc    VARCHAR(90), product_brand_th   VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

 

-- product dimension table
CREATE HADOOP TABLE IF NOT EXISTS sls_product_dim
( product_key             INT NOT NULL
, product_line_code   INT NOT NULL, product_type_key    INT NOT NULL
, product_type_code   INT NOT NULL
, product_number      INT NOT NULL
, base_product_key    INT NOT NULL
, base_product_number     INT NOT NULL
, product_color_code INT
, product_size_code   INT
, product_brand_key   INT NOT NULL
, product_brand_code INT NOT NULL
, product_image           VARCHAR(60)
, introduction_date   TIMESTAMP
, discontinued_date   TIMESTAMP
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

 

-- look up table with product line info in various languages
CREATE HADOOP TABLE IF NOT EXISTS sls_product_line_lookup
( product_line_code   INT NOT NULL
, product_line_en     VARCHAR(90) NOT NULL
, product_line_de     VARCHAR(90), product_line_fr    VARCHAR(90)
, product_line_ja     VARCHAR(90), product_line_cs    VARCHAR(90)
, product_line_da     VARCHAR(90), product_line_el    VARCHAR(90)
, product_line_es     VARCHAR(90), product_line_fi    VARCHAR(90)
, product_line_hu     VARCHAR(90), product_line_id    VARCHAR(90)
, product_line_it     VARCHAR(90), product_line_ko    VARCHAR(90)
, product_line_ms     VARCHAR(90), product_line_nl    VARCHAR(90)
, product_line_no     VARCHAR(90), product_line_pl    VARCHAR(90)
, product_line_pt     VARCHAR(90), product_line_ru    VARCHAR(90)
, product_line_sc     VARCHAR(90), product_line_sv    VARCHAR(90)
, product_line_tc     VARCHAR(90), product_line_th    VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

 

-- look up table for products
CREATE HADOOP TABLE IF NOT EXISTS sls_product_lookup
( product_number      INT NOT NULL
, product_language    VARCHAR(30) NOT NULL
, product_name            VARCHAR(150) NOT NULL
, product_description VARCHAR(765)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

 

-- fact table for sales
CREATE HADOOP TABLE IF NOT EXISTS sls_sales_fact
( order_day_key           INT NOT NULL
, organization_key        INT NOT NULL
, employee_key            INT NOT NULL
, retailer_key            INT NOT NULL
, retailer_site_key       INT NOT NULL
, product_key             INT NOT NULL
, promotion_key           INT NOT NULL
, order_method_key        INT NOT NULL
, sales_order_key         INT NOT NULL
, ship_day_key            INT NOT NULL
, close_day_key           INT NOT NULL
, quantity                INT
, unit_cost               DOUBLE
, unit_price              DOUBLE
, unit_sale_price         DOUBLE
, gross_margin            DOUBLE
, sale_total              DOUBLE
, gross_profit            DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
;

 

— fact table for marketing promotions

CREATE HADOOP TABLE IF NOT EXISTS mrk_promotion_fact
( organization_key    INT NOT NULL
, order_day_key       INT NOT NULL
, rtl_country_key     INT NOT NULL
, employee_key            INT NOT NULL
, retailer_key            INT NOT NULL
, product_key             INT NOT NULL
, promotion_key       INT NOT NULL
, sales_order_key     INT NOT NULL
, quantity                SMALLINT
, unit_cost               DOUBLE
, unit_price              DOUBLE
, unit_sale_price     DOUBLE
, gross_margin            DOUBLE
, sale_total              DOUBLE
, gross_profit            DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

 

 2015-08-19 14_07_59-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Let’s briefly explore some aspects of the CREATE TABLE statements shown here.  If you have a SQL background, the majority of these statements should be familiar to you.  However, after the column specification, there are some additional clauses unique to Big SQL – clauses that enable it to exploit Hadoop storage mechanisms (in this case, Hive).  The ROW FORMAT clause specifies that fields are to be terminated by tabs (“\t”) and lines are to be terminated by new line characters (“\n”).  The table will be stored in a TEXTFILE format, making it easy for a wide range of applications to work with.  For details on these clauses, refer to the Apache Hive documentation.

 
__4.          Load data into each of these tables using sample data provided in files. Change the SFTP and file path specifications in each of the following examples to match your environment. Then, one at a time, issue each LOAD statement and verify that the operation completed successfully. LOAD returns a warning message providing details on the number of rows loaded, etc.

load hadoop using file url
'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.GO_REGION_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE GO_REGION_DIM overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_ORDER_METHOD_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_ORDER_METHOD_DIM overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_BRAND_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_BRAND_LOOKUP overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_DIM overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LINE_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LINE_LOOKUP overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_PRODUCT_LOOKUP.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_PRODUCT_LOOKUP overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.SLS_SALES_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE SLS_SALES_FACT overwrite;
load hadoop using file url 'sftp://yourID:yourPassword@myhost.ibm.com:22/usr/ibmpacks/bigsql/4.0/bigsql/samples/data/GOSALESDW.MRK_PROMOTION_FACT.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE MRK_PROMOTION_FACT overwrite;

 

 

 

 2015-08-19 14_12_25-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Let’s explore the LOAD syntax shown in these examples briefly.  Each example loads data into a table using a file URL specification that relies on SFTP to locate the source file.  In particular, the SFTP specification includes a valid user ID and password (yourID/yourPassword), the target host server and port (myhost.ibm.com:22), and the full path of the data file on that system.  The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (“\t”).  The INTO TABLE clause identifies the target table for the LOAD operation.  The OVERWRITE keyword indicates that any existing data in the table will be replaced by data contained in the source file.  (If you wanted to simply add rows to the table’s content, you could specify APPEND instead.)Using SFTP (or FTP) is one way in which you can invoke the LOAD command.  If your target data already resides in your distributed file system, you can provide the DFS directory information in your file URL specification.  Indeed, for optimal runtime performance, you may prefer to take that approach.  See the BigInsights Knowledge Center (product documentation) for details.  In addition, you can load data directly from a remote relational DBMS via a JDBC connection.  The Knowledge Center includes examples of that.

__5.          Query the tables to verify that the expected number of rows was loaded into each table. Execute each query that follows individually and compare the results with the number of rows specified in the comment line preceding each query.

 

-- total rows in GO_REGION_DIM = 21
select count(*) from GO_REGION_DIM;

-- total rows in sls_order_method_dim = 7
select count(*) from sls_order_method_dim;

-- total rows in SLS_PRODUCT_BRAND_LOOKUP = 28
select count(*) from SLS_PRODUCT_BRAND_LOOKUP;

-- total rows in SLS_PRODUCT_DIM = 274
select count(*) from SLS_PRODUCT_DIM;

-- total rows in SLS_PRODUCT_LINE_LOOKUP = 5
select count(*) from SLS_PRODUCT_LINE_LOOKUP;

-- total rows in SLS_PRODUCT_LOOKUP = 6302
select count(*) from SLS_PRODUCT_LOOKUP;

-- total rows in SLS_SALES_FACT = 446023
select count(*) from SLS_SALES_FACT;

-- total rows gosalesdw.MRK_PROMOTION_FACT = 11034
select count(*) from MRK_PROMOTION_FACT;

4.2.          Querying the data with Big SQL

Now you’re ready to query your tables.   Based on earlier exercises, you’ve already seen that you can perform basic SQL operations, including projections (to extract specific columns from your tables) and restrictions (to extract specific rows meeting certain conditions you specified).   Let’s explore a few examples that are a bit more sophisticated.
 
In this lesson, you will create and run Big SQL queries that join data from multiple tables as well as perform aggregations and other SQL operations.  Note that the queries included in this section are based on queries shipped with Big SQL client software as samples.
 
__1.          Join data from multiple tables to return the product name, quantity and order method of goods that have been sold.  For simplicity, limit the number of returns rows to 20. To achieve this, execute the following query:

 

-- Fetch the product name, quantity, and order method of products sold.
-- Query 1
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
fetch first 20 rows only;

 

Let’s review a few aspects of this query briefly:

  • Data from four tables will be used to drive the results of this query (see the tables referenced in the FROM clause).  Relationships between these tables are resolved through 3 join predicates specified as part of the WHERE clause.  The query relies on 3 equi-joins to filter data from the referenced tables.  (Predicates such as prod.product_number=pnumb.product_number help to narrow the results to product numbers that match in two tables.)
  • For improved readability, this query uses aliases in the SELECT and FROM clauses when referencing tables.  For example, pnumb.product_name  refers to “pnumb,” which is the alias for the gosalesdw.sls_product_lookup table. Once defined in the FROM clause, an alias can be used in the WHERE clause so that you do not need to repeat the complete table name.
  • The use of the predicate and pnumb.product_language=’EN’ helps to further narrow the result to only English output. This database contains thousands of rows of data in various languages, so restricting the language provides some optimization.

2015-09-03 02_05_03-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader
 
__2.          Modify the query to restrict the order method to one type – those involving a Sales visit. To do so, add the following query predicate just before the FETCH FIRST 20 ROWS clause: AND order_method_en=’Sales visit’

 

-- Fetch the product name, quantity, and order method
-- of products sold through sales visits. 
-- Query 2
SELECT pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key
AND order_method_en='Sales visit'
FETCH FIRST 20 ROWS ONLY;

 

__3.          Inspect the results:
 
2015-09-03 02_07_35-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader
 
__4.          To find out which sales method of all the methods has the greatest quantity of orders, include a GROUP BY clause (group by pll.product_line_en, md.order_method_en). In addition, invoke the SUM aggregate function (sum(sf.quantity)) to total the orders by product and method. Finally, this query cleans up the output a bit by using aliases (e.g., as Product) to substitute a more readable column header.

 

-- Query 3
SELECT pll.product_line_en AS Product,
md.order_method_en AS Order_method,
sum(sf.QUANTITY) AS total
FROM 
sls_order_method_dim AS md,
sls_product_dim AS pd,
sls_product_line_lookup AS pll,
sls_product_brand_lookup AS pbl,
sls_sales_fact AS sf
WHERE
pd.product_key = sf.product_key
AND md.order_method_key = sf.order_method_key
AND pll.product_line_code = pd.product_line_code
AND pbl.product_brand_code = pd.product_brand_code
GROUP BY pll.product_line_en, md.order_method_en;

 

__5.          Inspect the results, which should contain 35 rows.  A portion is shown below.

2015-09-03 02_08_35-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader

4.3.          Creating and working with views

Big SQL supports views (virtual tables) based on one or more physical tables.  In this section, you will create a view that spans multiple tables. Then you’ll query this view using a simple SELECT statement.  In doing so, you’ll see that you can work with views in Big SQL much as you can work with views in a relational DBMS.
 
__1.          Create a view named MYVIEW that extracts information about product sales featured in marketing promotions.  By the way, since the schema name is omitted in both the CREATE and FROM object names, the current schema (your user name), is assumed.

 

create view myview as
select product_name, sales.product_key, mkt.quantity,
sales.order_day_key, sales.sales_order_key, order_method_en
from
mrk_promotion_fact mkt,
      sls_sales_fact sales,
      sls_product_dim prod,
      sls_product_lookup pnumb,
      sls_order_method_dim meth
where mkt.order_day_key=sales.order_day_key
      and sales.product_key=prod.product_key
      and prod.product_number=pnumb.product_number
      and pnumb.product_language='EN'
      and meth.order_method_key=sales.order_method_key;

 

__2.          Now query the view:

 

select * from myview  
order by product_key asc, order_day_key asc
fetch first 20 rows only;

 

__3.          Inspect the results:
 
2015-09-03 02_09_55-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader

 

4.4.          Populating a table with ‘INSERT INTO … SELECT’

Big SQL enables you to populate a table with data based on the results of a query.  In this exercise, you will use an INSERT INTO . . . SELECT statement to retrieve data from multiple tables and insert that data into another table.  Executing an INSERT INTO . . . SELECT exploits the machine resources of your cluster because Big SQL can parallelize both read (SELECT) and write (INSERT) operations.
 
__1.          Execute the following statement to create a sample table named sales_report:

 

-- create a sample sales_report table
CREATE HADOOP TABLE sales_report
(
product_key       INT NOT NULL,
product_name      VARCHAR(150),
quantity          INT,
order_method_en   VARCHAR(90)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

__2.          Now populate the newly created table with results from a query that joins data from multiple tables.

-- populate the sales_report data with results from a query
INSERT INTO sales_report
SELECT sales.product_key, pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key 
and sales.quantity > 1000;

 

__3.          Verify that the previous query was successful by executing the following query:

 

-- total number of rows should be 14441
select count(*) from sales_report;

 

4.5.          Optional:  Storing data in an alternate file format (Parquet)

Until now, you’ve instructed Big SQL to use the TEXTFILE format for storing data in the tables you’ve created.  This format is easy to read (both by people and most applications), as data is stored in a delimited form with one record per line and new lines separating individual records.  It’s also the default format for Big SQL tables.
 
However, if you’d prefer to use a different file format for data in your tables, Big SQL supports several formats popular in the Hadoop environment, including Avro, sequence files, RC (record columnar) and Parquet.  While it’s beyond the scope of this lab to explore these file formats, you’ll learn how you can easily override the default Big SQL file format to use another format — in this case, Parquet.  Parquet is a columnar storage format for Hadoop that’s popular because of its support for efficient compression and encoding schemes.  For more information on Parquet, visit http://parquet.io/.
 
__1.          Create a table named big_sales_parquet.

 

CREATE HADOOP TABLE IF NOT EXISTS big_sales_parquet
( product_key     INT NOT NULL,
product_name       VARCHAR(150),
quantity               INT,
order_method_en    VARCHAR(90)
)
STORED AS parquetfile;

 

With the exception of the final line (which specifies the PARQUETFILE format), all aspects of this statement should be familiar to you by now.

 

__2.          Populate this table with data based on the results of a query.  Note that this query joins data from 4 tables you previously defined in Big SQL using a TEXTFILE format.  Big SQL will automatically reformat the result set of this query into a Parquet format for storage.

 

insert into big_sales_parquet
SELECT sales.product_key, pnumb.product_name, sales.quantity,
meth.order_method_en
FROM
sls_sales_fact sales,
sls_product_dim prod,
sls_product_lookup pnumb,
sls_order_method_dim meth
WHERE
pnumb.product_language='EN'
AND sales.product_key=prod.product_key
AND prod.product_number=pnumb.product_number
AND meth.order_method_key=sales.order_method_key 
and sales.quantity > 5500;

 

__3.          Query the table.  Note that your SELECT statement does not need to be modified in any way because of the underlying file format.

 

select * from big_sales_parquet;

 

__4.          Inspect the results, a subset of which are shown below.  The query should return 471 rows.

 

4.6.          Optional:  Working with external tables

The previous exercises in this lab caused Big SQL to store tables in a default location (in the Hive warehouse). Big SQL also supports the concept of an externally managed table – i.e., a table created over a user directory that resides outside of the Hive warehouse. This user directory contains all the table’s data in files.
 
As part of this exercise, you will create a DFS directory, upload data into it, and then create a Big SQL table that over this directory.  To satisfy queries, Big SQL will look in the user directory specified when you created the table and consider all files in that directory to be the table’s contents.  Once the table is created, you’ll query that table.
 
__1.          If necessary, open a terminal window.
 
__2.          Check the directory permissions for your DFS.

 

hdfs dfs -ls /

2015-09-03 02_18_29-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader
 
If the /user directory cannot be written by the public (as shown in the example above), you will need to change these permissions so that you can create the necessary subdirectories for this lab using your standard lab user account.
 
From the command line, issue this command to switch to the root user ID temporarily:

 

su root

 

When prompted, enter the password for this account. Then switch to the hdfs ID.

 

su hdfs

 

While logged in as user hdfs, issue this command:

 

hdfs dfs -chmod 777 /user

 

Next, confirm the effect of your change:

 

hdfs dfs -ls /

 

2015-09-03 02_19_21-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader
 
Exit the hdfs user account:

 

exit

 

Finally, exit the root user account and return to your standard user account:

 

exit

 

__3.          Create directories in your distributed file system for the source data files and ensure public read/write access to these directories. (If desired, alter the DFS information as appropriate for your environment.)

 

hdfs dfs -mkdir /user/bigsql_lab
hdfs dfs -mkdir /user/bigsql_lab/sls_product_dim
hdfs dfs -chmod -R 777 /user/bigsql_lab

 

__4.          Upload the source data files into their respective DFS directories.  Change the local and DFS directories information below to match your environment.

 

hdfs dfs -copyFromLocal /your-dir/data/GOSALESDW.SLS_PRODUCT_DIM.txt /user/bigsql_lab/sls_product_dim/SLS_PRODUCT_DIM.txt

 

__5.          List the contents of the DFS directories into which you copied the files to validate your work.

 

hdfs dfs -ls /user/bigsql_lab/sls_product_dim

2015-09-03 02_20_39-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader
 
__6.          From your query execution environment (such as JSqsh), create an external Big SQL table for the sales product dimension (sls_product_dim_external).  Note that the LOCATION clause in each statement references the DFS directory into which you copied the sample data.

 

-- product dimension table stored in a DFS directory external to Hive
CREATE EXTERNAL HADOOP TABLE IF NOT EXISTS sls_product_dim_external
 ( product_key INT NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
)                
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
location '/user/bigsql_lab/sls_product_dim';

 

__7.          Query the table.

 

select product_key, introduction_date from sls_product_dim_external
where discontinued_date is not null
fetch first 20 rows only;

 

__8.          Inspect the results.

2015-09-03 02_21_52-bigsql4-150421142453-conversion-gate02.pdf - Adobe Reader

 

4.7.          Optional:  Creating and querying the full sample database

Big SQL ships with sample SQL scripts for creating, populating, and querying more than 60 tables.  These tables are part of the GOSALESDW schema — a schema that differs from the one used in this lab.  (You created tables in the default schema – i.e., your user ID’s schema. Since the JSqsh connection you created earlier used the bigsql ID, your schema was BIGSQL.)

If desired, use standard Linux operating system facilities to inspect the SQL scripts and sample data for the GOSALESDW schema in the samples directory. By default, this location is   /usr/ibmpacks/bigsql/4.0/bigsql/samples. Within this directory, you’ll find subdirectories containing (1) the full sample data for the GOSALESDW tables and (2) a collection of SQL scripts for creating, loading, and querying these tables.  Feel free to use the supplied scripts to create all GOSALESDW tables, load data into these tables, and query these tables.  Note that you may need to modify portions of these scripts to match your environment.

Join The Discussion

Your email address will not be published. Required fields are marked *