Apache HBase is a distributed key-value store of data that runs on top of the Hadoop Distributed File System (HDFS). It’s modeled on Cloud Bigtable, which is Google’s NoSQL big data database service, and it provides APIs to query the data. The data is organized, partitioned, and distributed by its row keys. Per partition, the data is further physically partitioned by column families that specify collections of columns of data. The data model is well-suited for wide tables where columns are dynamic and the data is generally sparse.

Although HBase is a useful big data store, its access mechanism is primitive and requires client-side APIs, map/reduce interfaces, and interactive shells. SQL accesses to HBase data are available 1) through map/reduce or interface mechanisms like Apache Hive and Impala, or 2) through some native SQL technologies like Apache Phoenix. While the former is usually cheaper to implement and use, the latencies and inefficiencies don’t compare well with the latter — and are often only suitable for offline analysis. Native SQL technologies, in contrast, often perform better and can really be considered online engines, and are typically enabled on top of purpose-built execution engines.

By leveraging Spark as a unified big data processing engine, we provide a new approach to support HBase: HSpark. We can benefit from the performance advantage of Spark compared to other approaches (Apache Phoenix, for example). HSpark not only has the Spark Dataset capability to query HBase but also has a command-line interface (CLI) to support new DDL/DML commands – HSpark Shell.

About HSpark

By combining Spark with HBase, we hope to enable more Spark usage on the popular HBase data store, but in a SQL way — while also providing high-performance computing on the big data ecosystem. The HSpark GitHub project can be found at https://github.com/bomeng/HSpark.

The following sections will provide a brief overview of basic HSpark and HSpark Shell syntax.

HSpark supported data types

HSpark natively supports several data types in its SQL grammar. When designing the schema, you need to carefully select the proper data types that match your requirement. Here are the supported data types:
  • Boolean
  • Byte
  • Date
  • Double
  • Float
  • Integer
  • Long
  • String
  • Timestamp

HSpark Shell command syntax

HSpark Shell is a tool packaged with the HSpark release. Within the HSpark Shell, users can type in the DDL/DML commands to create and drop tables, import data into existing tables, and query against tables.

Create table

CREATE TABLE tablename (colname1 datatype1, colname12 datatype2, ...) 

A SQL table on HBase is basically a logical table mapped to an HBase table. This mapping can be many-to-one in order to support schema-on-read for SQL access to HBase data.

In the example above:

  • The <hbase_table_name> denotes the HBase table.
  • The keyCols constraint denotes the HBase row key composition of columns.
  • The nonKeyCols='<nonKeyCol1>,<colFamily1>,<colQualifier1>;' denotes the mapping of the non-key column to the HBase tables column qualifier of <colQualifier1> of column family <colFamily1>.

If the table and the column families specified do not exist in HBase, HSpark will create the HBase table for the CREATE TABLE statement to succeed. In addition, the columns in the primary key cannot be mapped to another column family/column qualifier combination. Other normal SQL sanity checks, such as uniqueness of logical columns, are applied as well.

Query table

SELECT ... FROM <table_name> WHERE ...

Query table syntax will simply leverage on Spark query syntax or Dataframe syntax.

Drop table

DROP TABLE <table_name>

Drop table will not delete the HBase table but simply remove the SQL table with its schema.

Insert data

INSERT INTO TABLE <table_name> SELECT clause
INSERT INTO TABLE <table_name> VALUES (value, ...)

An HBase key must be present for insertion while inserting the data. Also, the regular SQL sanity check (for example, uniqueness of logical columns), will be performed for insertion.

Bulk load

LOAD DATA LOCAL INPATH '<file_path>' INTO TABLE <table_name>

Bulk load provides a way for user to load data into an existing HBase table. Currently, only CSV files are supported.

Other useful commands

We also support SHOW TABLES and DESCRIBE for catalog information. Type HELP in the CLI, it will display a list of commands it supports.

A new developer pattern titled Use Spark SQL to access NoSQL HBase tables provides a closer look at using Spark SQL and the HSpark connector package to create and query data tables in HBase region servers.

Join The Discussion

Your email address will not be published. Required fields are marked *