Using Big SQL to create and query HBase tables
Although HBase provides useful data storage and retrieval capabilities, it lacks a rich query language. Fortunately, IBM’s Big SQL technology, a component of BigInsights, enables programmers to use industry-standard SQL to create, populate, and query Big SQL tables managed by HBase. Native HBase data retrieval operations (such as GET and SCAN) can be performed on these tables, too. This lab introduces you to the basics of using Big SQL with HBase. Later, you’ll have a chance to explore additional topics, such as data modeling.
Prior to starting this lab, you must have access to a BigInsights 4.0 environment with Big SQL running. In addition, you need to know how to connect to your Big SQL server and how to issue a Big SQL query using JSqsh or another query tool. If necessary, consult the Getting Started with Big SQL 4.0 lab (http://www.slideshare.net/CynthiaSaracco/big-sql40-hol) or the BigInsights product documentation for details before proceeding.
After you complete the lessons in this module, you will understand how to:
- Create Big SQL tables that use HBase as the underlying storage mechanism.
- Populate these tables using Big SQL INSERT and LOAD operations.
- Query these tables using projections and restrictions.
- Create views over Big SQL tables managed by HBase and query these views.
Allow 30 minutes to complete this lab.¬† Please post questions or comments about this lab or the technologies it describes to the forum on Hadoop Dev at https://developer.ibm.com/hadoop/.
To get started, you’ll create a Big SQL table named bigsqllab.reviews to capture information about product reviews. In contrast to the previous reviews table that you created natively in HBase, this Big SQL table will have a pre-defined set of columns just like any other SQL-based table. Moreover, your bigsqllab.reviews table definition will also specify how to map these SQL columns to HBase table attributes — specifically, a row key and columns in specific column families.
__1.¬†¬†¬†¬†¬†¬† If necessary, refresh your memory about the HBase reviews table created in an earlier lab. Recall that this table tracked reviews that users posted about various products. The table had 3 column families — summary, reviewer, and And, of course, it had a row key.
__2.¬†¬†¬†¬†¬† Launch your Big SQL query execution environment and connect to your Big SQL database following the standard process appropriate for your query environment. For example, if you’re using JSqsh and you previously created a Big SQL database connection named bigsql, you might enter
Modify this sample command as needed to match your environment.
__3. Create a Big SQL table named bigsqllab.reviews managed by HBase.
CREATE HBASE TABLE IF NOT EXISTS BIGSQLLAB.REVIEWS (
REVIEWID ¬†¬† ¬†¬†¬†¬†¬† varchar(10) primary key not null,
REVIEWERLOC ¬†¬†¬†¬†¬† varchar(30),
TIP¬†¬†¬†¬†¬†¬†¬†¬† ¬†¬†¬†¬†¬† varchar(100)
key¬†¬†¬†¬†¬†¬†¬†¬† mapped by (REVIEWID),
summary:product¬†¬† mapped by (PRODUCT),
summary:rating mapped by (RATING),
reviewer:name¬†¬†¬†¬† mapped by (REVIEWERNAME),
reviewer:location mapped by (REVIEWERLOC),
details:comment¬†¬† mapped by (COMMENT),
details:tip ¬†¬† mapped by (TIP)
|About this CREATE HBASE TABLE statement . . . .¬†¬† This statement creates a Big SQL table named REVIEWS in the BIGSQLLAB schema and instructs Big SQL to use HBase as the underlying storage manager. The COLUMN MAPPING clause specifies how SQL columns are to be mapped to HBase columns in column families. For example, the SQL REVIEWERNAME column is mapped to the HBase column family:column of ‘reviewer:name’.For simplicity, this example uses a 1:1 mapping of SQL columns to HBase columns. It also uses a single SQL column as the HBase row key. ¬† As you’ll see in a subsequent lab, you may want (or need) to map multiple SQL columns to one HBase column or to the HBase row key. Big SQL supports doing so.Also for simplicity, we’ve accepted various defaults for this table, including default HBase column family property settings (for VERSIONS and others) and binary encoding for storage.
Finally, note that the SQL REVIEWID column was defined as the SQL primary key. This is an informational constraint that can be useful for query optimization. However, it isn’t actively enforced.
__4.¬†¬†¬†¬†¬† If necessary, open a terminal window.
Use Hadoop file system commands to verify that the table was created. List the contents of the HBase data directory and confirm that a reviews subdirectory exists.
hdfs dfs -ls /apps/hbase/data/data/default
|The root HBase directory is determined at installation. The examples in this lab were developed for an environment in which HBase was configured to store data in /apps/hbase/data/data/default.¬†¬† If your HBase configuration is different, adjust the commands as needed to match your environment.|
__5.¬†¬†¬†¬† List the contents of your table’s subdirectory. Observe that this subdirectory contains another subdirectory with a system-generated name. (In the screen capture below, this is shown in the final line.)
hdfs dfs -ls /apps/hbase/data/data/default/bigsqllab.reviews
__6.¬†¬†¬†¬†¬† List the contents of the . . . /bigsqllab.reviews subdirectory with the system-generated name. (In the example above, this is the . . . /22de6a42d868c010cf6d6f5714ac5b90 ) Adjust the path specification below to match your environment.
hdfs dfs -ls /apps/hbase/data/data/default/bigsqllab.reviews/22de6a42d868c010cf6d6f5714ac5b90
Note that there are additional subdirectories for each of the 3 column families specified in the COLUMN MAPPING clause of your CREATE HBASE TABLE statement.
__7.¬†¬†¬†¬† Return to your Big SQL execution environment.
__8.¬†¬†¬†¬† Insert a row into your Big SQL reviews table.
insert into bigsqllab.reviews
values ('198','scarf','2','Bruno',null,'Feels cheap',null);
- Note that this INSERT statement looks the same as any other SQL statement. You didn’t need to insert one HBase cell value at a time, and you didn’t need to understand the underlying HBase table structure.
__9.¬†¬†¬†¬† Insert another row into your table, specifying values for only a subset of its columns.
insert into bigsqllab.reviews (reviewid, product, rating, reviewername)
__10.¬†¬†¬†¬† Use SQL to count the number of rows stored in your table, verifying that 2 are present.
select count(*) from bigsqllab.reviews;
__11.¬†¬†¬†¬† Execute a simple query to return specific information about reviews of products rated 3 or higher.
select reviewid, product, reviewername, reviewerloc
where rating >= 3;
As expected, only 1 row is returned.
Again, note that your SELECT statement looks like any other SQL SELECT — you didn’t need to add any special code because the underlying storage manager is HBase.
__12.¬†¬†¬†¬† Launch the HBase shell and verify that you can work directly with the reviews table from the shell. To do so, scan the table.
As you would expect, the final line of output reports that there are 2 rows in your table. In case you’re curious, \x00 is both the non-null marker and also the terminator used for variable length binary encoded values.
You can create views over Big SQL tables stored in HBase just as you can create views over Big SQL tables that store data in the Hive warehouse or in simple DFS files. Creating views over Big SQL HBase tables is straightforward, as you’ll see in this exercise.
__1.¬†¬†¬†¬† From your Big SQL query execution environment, create a view based on a subset of the reviews table that you created earlier.
create view bigsqllab.testview as
select reviewid, product, reviewername, reviewerloc
where rating >= 3;
Note that this view definition looks like any other SQL view definition. You didn’t need to specify any syntax unique to Hadoop or HBase.
__2.¬†¬†¬†¬† Query the view.
select reviewid, product, reviewername
In this exercise, you’ll explore how to use the Big SQL LOAD command to load data from a file into a Big SQL table managed by HBase. To do so, you will use sample data shipped with Big SQL that is typically installed with Big SQL client software. By default, this data is installed at /usr/ibmpacks/bigsql/4.0/bigsql/samples/data. It is also available online at https://hub.jazz.net/project/jayatheerthan/BigSQLSamples/overview#https://hub.jazz.net/git/jayatheerthan%252FBigSQLSamples/list/master/samples/data.
The sample data represents data exported from a data warehouse that tracks sales of outdoor products. It includes a series of FACT and DIMENSION tables. For this lab, you will create 1 DIMENSION table and load sample data from 1 file into it.
__1. ¬†¬† Create a Big SQL table in HBase named sls_product_dim.
-- product dimension table
CREATE HBASE TABLE IF NOT EXISTS bigsqllab.sls_product_dim
( product_key INT PRIMARY KEY NOT NULL
, product_line_code INT NOT NULL
, product_type_key INT NOT NULL
, product_type_code INT NOT NULL
, product_number INT NOT NULL
, base_product_key INT NOT NULL
, base_product_number INT NOT NULL
, product_color_code INT
, product_size_code INT
, product_brand_key INT NOT NULL
, product_brand_code INT NOT NULL
, product_image VARCHAR(60)
, introduction_date TIMESTAMP
, discontinued_date TIMESTAMP
key¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_KEY),
data:line_code¬†¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_LINE_CODE),
data:type_key¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_TYPE_KEY),
data:type_code¬†¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_TYPE_CODE),
data:number¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_NUMBER),
data:base_key¬†¬†¬†¬†¬†¬†¬†¬†¬†¬† mapped by (BASE_PRODUCT_KEY),
data:base_number¬† mapped by (BASE_PRODUCT_NUMBER),
data:color¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_COLOR_CODE),
data:size¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_SIZE_CODE),
data:brand_key¬†¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_BRAND_KEY),
data:brand_code¬†¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_BRAND_CODE),
data:image¬†¬†¬†¬†¬†¬†¬† mapped by (PRODUCT_IMAGE),
data:intro_date¬†¬†¬†¬†¬†¬†¬†¬† mapped by (INTRODUCTION_DATE),
data:discon_date¬† mapped by (DISCONTINUED_DATE)
|HBase design for this table¬†¬† The HBase specification of this Big SQL statement placed nearly all SQL columns into a single HBase column family named ‘data’. As you know, HBase creates physical files for each column family; this is something you should take into consideration when designing your table’s structure. It’s often best to keep the number of column families per table small unless your workload involves many queries over mutually exclusive columns. For example, if you knew that PRODUCT_IMAGE data was rarely queried or frequently queried alone, you might want to store it separately (i.e., in a different column family:column).There’s something else worth noting about this table definition: the HBase columns have shorter names than the SQL columns. As you saw in earlier exercises, HBase stores full key information (row key, column family name, column name, and timestamp) with the key value. This consumes disk space. If you keep the HBase column family and column names short and specify longer, more meaningful names for the SQL columns, your design will minimize disk consumption and remain friendly to SQL programmers.|
__2.¬†¬†¬†¬† Load data into the table. Change the SFTP and file path specifications shown below to match your environment. ¬†This statement will return a warning message providing details on the number of rows loaded, etc.
LOAD HADOOP using file url
'sftp://yourID:yourPassword@rvm.svl.ibm.com:22/your-dir/data/GOSALESDW.SLS_PRODUCT_DIM.txt' with SOURCE PROPERTIES ('field.delimiter'='\t') INTO TABLE bigsqllab.sls_product_dim;
|A closer look at LOAD . . . .¬†¬†¬†¬† Let‚Äôs explore the LOAD syntax briefly. This statement loads data into a table using a file URL specification that relies on SFTP to locate the source file. In particular, the SFTP specification includes a valid user ID and password (yourID/yourPassword), the target host server and port (rvm.svl.ibm.com:22), and the full path of the data file on that system. The WITH SOURCE PROPERTIES clause specifies that fields in the source data are delimited by tabs (\t). The INTO TABLE clause identifies the target table for the LOAD operation.Using SFTP (or FTP) is one way in which you can invoke the LOAD command. If your target data already resides in your distributed file system, you can provide the DFS directory information in your file URL specification. Indeed, for optimal runtime performance, you may prefer to take that approach. In addition, you can load data directly from a remote relational DBMS via a JDBC connection. See the BigInsights Knowledge Center (product documentation) for details.|
__3.¬†¬†¬†¬† Count the number of rows in the table to verify that 274 are present.
-- total rows in SLS_PRODUCT_DIM = 274
select count(*) from bigsqllab.SLS_PRODUCT_DIM;
__4.¬†¬†¬†¬† Optionally, query the table.
select product_key, introduction_date, product_color_code
where product_key > 30000
fetch first 5 rows only;
If you don’t plan to complete any subsequent labs, you may want to clean up your environment. This optional exercise shows you how to drop the tables you created in this lab.
__1.¬†¬†¬†¬† Drop the reviews table and the bigsqllab.sls_product_dim table.
drop table bigsqllab.reviews;
drop table bigsqllab.sls_product_dim;
__2.¬†¬†¬†¬† Verify that these tables no longer exist. For example, query each table and confirm that you receive an error message indicating that the table name you provided is undefined (SQLCODE -204, SQLSTATE 42704).
select count(*) from bigsqllab.reviews;
select count(*) from bigsqllab.sls_product_dim;