About HSparkBy combining Spark with HBase, we hope to enable more Spark usage on the popular HBase data store, but in a SQL way — while also providing high-performance computing on the big data ecosystem. The HSpark GitHub project can be found at https://github.com/bomeng/HSpark. The following sections will provide a brief overview of basic HSpark and HSpark Shell syntax.
HSpark supported data typesHSpark natively supports several data types in its SQL grammar. When designing the schema, you need to carefully select the proper data types that match your requirement. Here are the supported data types:
HSpark Shell command syntaxHSpark Shell is a tool packaged with the HSpark release. Within the HSpark Shell, users can type in the DDL/DML commands to create and drop tables, import data into existing tables, and query against tables.
CREATE TABLE tablename (colname1 datatype1, colname12 datatype2, ...) TBLPROPERTIES ( 'hbaseTableName'=' ','keyCols'=' ; ;...', 'nonKeyCols'=' , , ; , , ...')
A SQL table on HBase is basically a logical table mapped to an HBase table. This mapping can be many-to-one in order to support schema-on-read for SQL access to HBase data. In the example above:
<hbase_table_name>denotes the HBase table.
keyColsconstraint denotes the HBase row key composition of columns.
nonKeyCols='<nonKeyCol1>,<colFamily1>,<colQualifier1>;'denotes the mapping of the non-key column to the HBase tables column qualifier of
<colQualifier1>of column family
CREATE TABLEstatement to succeed. In addition, the columns in the primary key cannot be mapped to another column family/column qualifier combination. Other normal SQL sanity checks, such as uniqueness of logical columns, are applied as well.
SELECT ... FROM WHERE ...
Query table syntax will simply leverage on Spark query syntax or Dataframe syntax.
Drop table will not delete the HBase table but simply remove the SQL table with its schema.
INSERT INTO TABLE SELECT clause INSERT INTO TABLE VALUES (value, ...)
An HBase key must be present for insertion while inserting the data. Also, the regular SQL sanity check (for example, uniqueness of logical columns), will be performed for insertion.
LOAD DATA LOCAL INPATH ' ' INTO TABLE
Bulk load provides a way for user to load data into an existing HBase table. Currently, only CSV files are supported.
Other useful commandsWe also support
DESCRIBEfor catalog information. Type
HELPin the CLI, it will display a list of commands it supports. A new developer pattern titled Use Spark SQL to access NoSQL HBase tables provides a closer look at using Spark SQL and the HSpark connector package to create and query data tables in HBase region servers.