Lab 5: Working with Complex Data Types

 
The previous lab on Querying Structured Data demonstrated how you can create and populate Big SQL tables with columns of primitive data types, such as INT, VARCHAR(), and so on.  Big SQL also supports columns based on complex data types, specifically:

  • ARRAY, an ordered collection of values of the same data type.
  • MAP, a collection of key-value pairs.¬† MAPs are also known as associative arrays.
  • STRUCT, a collection of fields of various data types.¬† A STRUCT type is also known as a ROW type.

 
This lab introduces you to Big SQL’s support for complex types, employing Hive-compatible syntax in its exercises.¬† (For additional syntax options, see the product Knowledge Center.)¬† To help you readily understand how to work with complex types, you will create tables that are slight variations of some of the tables you created in the previous lab on Querying Structured Data.
 
Prior to starting this lab, you must know how to connect to your Big SQL server and execute SQL from a supported tool. If necessary, complete the earlier lab on JSqsh before proceeding.
 
After you complete all the lessons in this module, you will understand how to:

  • Create Big SQL tables that contain columns of ARRAY, MAP, and STRUCT types.
  • Populate these tables with data using LOAD and INSERT INTO . . . SELECT statements.
  • Query these tables using projection and restriction operators.

 
Allow ¬Ĺ – 1 hour to complete this lab. Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.
 

5.1.          Working with the ARRAY type

 
In this lesson, you will explore Big SQL’s ARRAY type, which stores values of the same data type as an ordered collection in a single column of a table.¬† First, you will create a variation of the sls_product_line_lookup table that you created earlier. Your new table will include 2 columns:¬† an integer column for the product line code and an array column of varying-length character strings for product line descriptions in 3 different languages.¬† Next, you’ll load data into this table from a file and query the table.
 
If desired, examine the definition of the sls_product_line_lookup table that you created in the Querying Structured Data lab to get some context for the table you will create in this exercise.
 
In your local file system, create a sample data file named csv using your favorite editor. Instructions in this lab are based on the vi editor
 

	
vi product_line_array.csv

In vi, enter insert mode. Type

i

 
Cut and paste the following content into the file:

991|Camping Equipment,Campingausr√ľstung,Mat√©riel de camping
992|Mountaineering Equipment,Bergsteigerausr√ľstung,Mat√©riel de montagne
993|Personal Accessories,Accessoires,Accessoires personnels
994|Outdoor Protection,Outdoor-Schutzausr√ľstung,Articles de protection
995|Golf Equipment,Golfausr√ľstung,Mat√©riel de golf

For simplicity, this sample data represents a subset of the data contained in the sls_product_line_lookup To exit the file, press Esc and then enter

:wq

Inspect the contents of your file, ignoring any special language-specific characters that do not display properly from the shell:

cat product_line_array.csv

 
__4.          Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment.

 

__5.          Create the sls_product_line_lookup_array table, specifying a single ARRAY column for all of the product descriptions:

 

create hadoop table if not exists sls_product_line_lookup_array
(product_line_code      INT NOT NULL,
description             ARRAY<VARCHAR(50)>);

__6.          Load the sample data from your product_line_array.csv file into this table.  Adjust the FILE URL specification below as needed to match your environment, including the ID, password, host name, and directory information.

 

LOAD HADOOP USING FILE URL 
'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/product_line_array.csv'
WITH SOURCE PROPERTIES ('field.delimiter'='|', 'collection.items.delimiter'=',')
INTO TABLE sls_product_line_lookup_array OVERWRITE;

 

¬†2015-08-19 14_12_25-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Most of the LOAD HADOOP statement shown here should be familiar to you by now, so let’s concentrate on the SOURCE PROPERTIES clause. The properties specified in this clause must correspond to the characteristics of your source data file. For example, the field.delimiter property of this LOAD HADOOP statement indicates that the source file contains a pipe character (“|”) to separate field (column) values.¬† The collection.items.delimiter property indicates how each element of the complex data type (an array, in this case) is delineated:¬† in this example, by a comma.Consult the product Knowledge Center for additional information about using LOAD HADOOP with complex data types.

 

__7.          Inspect the informational message that appears after the LOAD operation completes.  Verify that the number of rows loaded into this table were 5 and that 0 rows were skipped.
 
__8.¬†¬†¬†¬†¬†¬†¬†¬† Query the table, retrieving only the product line code, English description, and German description for product lines with English descriptions that begin with the word “Personal”:

select product_line_code, description[1] as English, description[2] as German from sls_product_line_lookup_array where description[1] like 'Personal%';

 

Note that this query effectively “flattens” the array’s content by identifying specific elements of interest and including each in a separate column of the result set. In this case, the query retrieves the first element of the array (which contains the product line’s English description) and the second element of the array (which contains the product line’s German description).
 
__9.          Inspect the results:
 
2015-08-18 13_56_32-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 
__10.       Attempt to query the table using improper syntax for the array column:
 

select * from sls_product_line_lookup_array;

 
__11.       Inspect the error message that appears.  When you include a complex column in the SELECT list or WHERE clause of a query, you must specify the individual element or field of interest.
 
2015-08-18 13_57_38-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 

5.2.          Working with the MAP type

 
Now that you know how to work with ARRAYs, it’s time to explore the MAP type.¬† In this lesson, you will revisit the design of the product line table that you just created and implement a slightly different model. Instead of storing descriptions in an array, you’ll use a key-value pairs in a MAP column.¬† One advantage of this design is that you won’t need to remember the order in which the language descriptions are stored in your column.¬† Instead of querying for description[1] to get data in English, you’ll be able to query the data using a more intuitive key value.
 
In your local file system, create a sample data file named csv using your favorite editor. Instructions in this lab are based on the vi editor.

vi product_line_map.csv

In vi, enter insert mode. Type

i

Cut and paste the following content into the file:

991|EN#Camping Equipment,DE#Campingausr√ľstung,FR#Mat√©riel de camping
992|EN#Mountaineering Equipment,DE#Bergsteigerausr√ľstung,FR#Mat√©riel de montagne
993|EN#Personal Accessories,DE#Accessoires,FR#Accessoires personnels
994|EN#Outdoor Protection,DE#Outdoor-Schutzausr√ľstung,FR#Articles de protection
995|EN#Golf Equipment,DE#Golfausr√ľstung,FR#Mat√©riel de golf

Note that the data values shown here are the same as those used in the previous example. However, an additional delimiter appears. Specifically, the “#” symbol separates the key from the data value of each key-value pair.
 
To exit the file, press Esc and then enter

:wq

Inspect the contents of your file, ignoring any special language-specific characters that do not display properly from the shell:

cat product_line_map.csv
 

__3.          Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment.

 

__4.          Create the sls_product_line_lookup_map table, specifying a MAP column for the product line descriptions:

 

create hadoop table if not exists sls_product_line_lookup_map
(product_line_code      INT NOT NULL,
description             MAP<VARCHAR(2),VARCHAR(50)>);

__5.          Load the sample data from your product_line_map.csv file into this table.  Adjust the FILE URL specification below as needed to match your environment.

 

LOAD HADOOP USING FILE URL 
'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/product_line_map.csv'
WITH SOURCE PROPERTIES ('field.delimiter'='|', 'collection.items.delimiter'=',', 'map.keys.delimiter'='#')
INTO TABLE sls_product_line_lookup_map OVERWRITE;

 

¬†2015-08-19 14_12_25-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Note that the SOURCE PROPERTIES clause in this example sets the map.keys.delimiter property to the “#” symbol.¬† If you inspect the contents of the sample data file, you’ll node that this symbol separates the key and data values of the key-value pairs to be included in the MAP column. The field.delimiter and ¬†collection.items.delimiter properties are also set to values appropriate for the sample file.

 

__6.          Inspect the informational message that appears after the LOAD operation completes.  Verify that the number of rows loaded into this table were 5 and that 0 rows were skipped.

 

__7. ¬†¬†¬†¬†¬†¬†¬†¬† Query the table, retrieving only the product line code, English description, and German description for product lines with English descriptions that begin with the word “Personal”:

 

select product_line_code, description['EN'] as English, description['DE'] as German from sls_product_line_lookup_map where description['EN'] like 'Personal%';

 

Compare this query to its equivalent in the previous exercise, which modeled the description data in an ARRAY column. Here, you retrieve the English description of a product line by referencing the key contained in the complex column.

 

__8.          Inspect the results.

 

2015-08-18 14_04_54-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word

 

__9.          Attempt to query the table using improper syntax.
 

select * from sls_product_line_lookup_map;

 
Once again, an error message appears. Recall that a MAP is an associative array, so you must specify the map key of interest when referencing the complex column in the SELECT list or WHERE clause of your query.
 
2015-08-18 14_06_41-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 

5.3.          Working with the STRUCT type

 
In this lesson, you will explore Big SQL’s support for the STRUCT or ROW type.¬† As the name implies, this type enables you to store a structured collection of different data types in a single column.

 

In this lab module, you will create a variation of the go_region_dim table that you created in an earlier lab.  Your new table will include 3 columns:  an integer column for the country key, an integer column for the country code, and a STRUCT column for the collection of ISO codes that were represented as separate columns in the original table.

 

__1.         If desired, examine the definition of the go_region_dim table that you created in the Querying Structured Data lab to get some context for the table you will create in this exercise.

 

__2.         In your local file system, create a sample data file named csv using your favorite editor. Instructions in this lab are based on the vi editor.

vi go_region_dim_struct.csv

 
In vi, enter insert mode. Type
 

i

 
Cut and paste the following content into the file:
 

90001,1003,USA|US|840
90002,1004,CAN|CA|124
90003,1020,MEX|MX|484

To exit the file, press Esc and then enter

:wq

 
__3.          Inspect the contents of your file.

cat go_region_dim_struct.csv

 
__4.          Launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment.
 
__5.          Create the go_region_dim_struct table, specifying a single STRUCT column for all the ISO codes.  Note that the first 2 elements of this column are VARCHARs, while the final element is an INT.

create hadoop table if not exists go_region_dim_struct
(country_key      INT NOT NULL,
country_code      INT NOT NULL,
isocodes          STRUCT<letter3: VARCHAR(3), letter2: VARCHAR(2), digit: INT>
);

 
__6.          Load the sample data from your go_region_dim_struct.csv file into this table.  Adjust the FILE URL specification below as needed to match your environment.
 

LOAD HADOOP USING FILE URL 
'sftp://yourID:yourPassword@myhost.ibm.com:22/yourDir/go_region_dim_struct.csv'
WITH SOURCE PROPERTIES ('field.delimiter'=',','collection.items.delimiter'='|')
INTO TABLE go_region_dim_struct OVERWRITE;

 
__7.          Inspect the informational message that appears after the LOAD operation completes.  Verify that the number of rows loaded into this table were 3 and that 0 rows were skipped.
 
2015-08-18 14_43_05-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 
__8. ¬†¬†¬†¬†¬†¬†¬†¬† Query the table, retrieving only the country key, 2-letter ISO code, and ISO digit for the country with a 3-letter ISO code of “CAN” (Canada):

select country_key, isocodes.letter2 as l2, isocodes.digit as digit 
from go_region_dim_struct where isocodes.letter3 = 'CAN';

 
__9.          Inspect the results.
 
2015-08-18 14_46_52-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 
__10.       Attempt to query the table using improper syntax for the STRUCT column.
 

select * from go_region_dim_struct;

 

__11. ¬†¬†¬†¬†¬† Inspect the error message that appears.¬† As you’ve probably learned by now, when you query a complex column, you must specify the individual element or field of interest.
 
2015-08-18 14_47_47-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 

5.4.          Optional:  Exploring additional scenarios

 
This optional module explores two additional scenarios involving complex types.  The first scenario involves a multi-table join that retrieves data from columns containing complex data types.  The second scenario involves using a query result set to populate a table that contains a complex type column.

 

Before beginning this lab, verify that you have completed previous exercises for creating and populating the following tables:  go_region_dim_struct, sls_product_line_lookup_map, mrk_promotion_fact, and sls_product_dim.   The first two tables just mentioned are part of this lab on complex data types. The final two tables are part of an earlier lab on Querying Structured Data.

 

__1.          If necessary, launch your query execution tool (e.g., JSqsh) and establish a connection to your Big SQL server following the standard process for your environment.

 

__2.          Execute the following query, which joins data from multiple tables to obtain information about the results of marketing promotions and their associated product lines in countries with an ISO digit code of < 500. (Given the sample data that you loaded earlier into the go_region_dim_struct table, those countries are Canada and Mexico.)

 

select rs.country_key, rs.isocodes.letter3 as iso3code,
m.order_day_key, m.retailer_key, m.product_key,
pl.product_line_code, pl.description['EN'] as Eng_Desc,
m.promotion_key, m.sales_order_key, m.sale_total, m.gross_profit
from go_region_dim_struct rs, mrk_promotion_fact m,
sls_product_dim p, sls_product_line_lookup_map pl
where rs.country_key = m.rtl_country_key
and p.product_key = m.product_key
and p.product_line_code = pl.product_line_code
and rs.isocodes.digit < 500
fetch first 10 rows only;

Note that the SELECT list and WHERE clause of this query reference columns of both primitive and complex data types.

 

__3.          Inspect the results.

2015-08-18 14_49_10-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word

 

__4.          Next, explore your ability to use an INSERT … SELECT statement to populate a table containing columns of primitive and complex types. To begin, create a sample table named mycomplextest.

 

create hadoop table mycomplextest
(country_key        int,
isocodes            struct<letter3: varchar(3), letter2: varchar(2), digit: int>,
order_day_key       int,
retailer_key        int,
product_key         int,
product_line_code   int,
product_EN_desc     varchar(50)
);

 

Note that this table’s definition is very similar to the structure of the query result set you just retrieved. For simplicity, it omits some of the primitive columns. However, it does include one complex column of type STRUCT for all forms of ISO codes.

 

__5.          Populate this table with data from a query.

insert into mycomplextest
select rs.country_key, rs.isocodes,
m.order_day_key, m.retailer_key, m.product_key,
pl.product_line_code, pl.description['EN']
from go_region_dim_struct rs, mrk_promotion_fact m, sls_product_dim p,
sls_product_line_lookup_map pl
where rs.country_key = m.rtl_country_key
and p.product_key = m.product_key
and p.product_line_code = pl.product_line_code
and rs.isocodes.digit < 500
fetch first 10 rows only;

 

 

¬†2015-08-19 14_12_25-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word Examine this INSERT statement, noting the columns in the SELECT list.¬† As you can see, rs.isocodes refers to a column of type STRUCT in the go_region_dim_struct table.¬† Thus, the entire contents of this complex column will be inserted into the mycomplextest.isocodes column.¬† (Because the SELECT list is used in the context of an INSERT statement, it’s acceptable to reference the entire¬† rs.isocodes column without further qualification — i.e., you don’t need to reference only a specific element or field of this column, as you would need to do if this SELECT statement wasn’t part of a larger INSERT statement.)Note also that the SELECT list references the pl.description ¬†column, which was defined as type MAP in the sls_product_line_lookup_map table.¬† In this case, rather than inserting the entire contents of this MAP column into the mycomplextest table, the query will extract only the key value for ‚ÄėEN‚Äô (English) descriptions. This was done simply to illustrate that Big SQL‚Äôs INSERT . . . SELECT support enables you to work with the full content of a complex data type column or only specific fields or elements of a complex column.

 

__6.          Query the table:

 

select country_key, isocodes.letter3 as iso3code, order_day_key,
retailer_key, product_key, product_line_code, product_EN_desc
from mycomplextest;

 

__7.          Inspect the results:
 
2015-08-18 14_53_00-Big_SQL4.0_HoL_2015-Aug-17.htm (Last saved by user) - Microsoft Word
 
As you might imagine, there’s more to Big SQL complex types than this lab covers.  Consult the product documentation (Knowledge Center) for further details.

Join The Discussion

Your email address will not be published. Required fields are marked *