In IBM BigInsights 4.0, HBase support has been added to Big SQL. This means you can now run SQL workloads against data stored in HBase!
But what is HBase? And why should I use it?
- HBase is an open source, distributed, column-oriented DBMS that runs on top of Hadoop.
HBase excels at fetching a single row or small batches of rows from large datasets.
This is because HBase is essentially a key-value store. Internally, all of the rows in any given HBase table are maintained in sorted order (or indexed) on their “row key”. So, given a particular row key, HBase can do an index lookup to quickly retrieve the set of “values” associated with that key!
Because HBase is also a column store, narrow queries (fetching a small set of columns from an HBase table) also perform well.
However, HBase isn’t suited for every use case. In particular, it typically won’t perform well at table scans or aggregations over a large number of rows. And it does require more storage space than other Hadoop file formats and tend to take longer to populate (indexed data doesn’t come for free).
With this support, BigInsights 4.0 now offers industry leading SQL support over data stored in HBase.
Any queries that could be run against Big SQL Hive tables can now also be run against HBase tables! That includes (but isn’t limited to) support for the following:
- Nested sub-queries + common sub-query expressions
- Windowing and OLAP aggregate functions
- Grouping sets and the ROLLUP function
- Complex joins
Note: update / delete into HBase tables is not yet supported but will be coming in the near future.
The following features are also offered for HBase tables to optimize performance:
- Predicate pushdown:
- Predicates are pushed down as close to the data as possible to limit unnecessary data transfer.
- Secondary indexes:
- There can only be one row key for any HBase table and if your access pattern does not fit that key, then performance will suffer.
- This also allows Big SQL to leverage HBase’s strengths by having HBase apply certain predicates itself. For example, predicates on a table’s row key will be pushed down to HBase so HBase can perform an index lookup and pass any qualifying data back to Big SQL for any further processing.
But fear not — if youâ€™re having trouble choosing between two possible row keys for your HBase table, BigInsights 4.0 offers the ability to create secondary indexes. Secondary indexes provide the same retrieval benefits as a row key index but can be defined on any other set of columns in your table!
- Pre-split HBase tables:
- As you populate your HBase tables, HBase will initially throw all of the data into a single region. Once the region grows above a certain threshold (256 MB by default), the region will be split into two smaller regions of equal size (128 MB).
Because a region is considered a unit of work in Big SQL, at most two nodes will be assigned to scan a table with two regions. But in some cases, it may be more preferable to have more nodes involved in the scan rather than sitting idle.
This is where pre-splitting comes in. For tables that you would like to spread across your cluster, you can define them as pre-split tables with the same number of initial regions as the number of data nodes you have. Then, during a scan of the table, each of your datanodes will be assigned at least one region to read. Rather than having most nodes remaining idle while one or two nodes do all the heavy lifting.
Pre-splitting also results in better performance when populating your tables as each region can be populated in parallel! As opposed to starting with a single region that HBase will split in two only when it becomes full.
- Salted HBase tables:
- Depending on your HBase row key selection, you may end up having lots of similar data (or data that is often queried together) stored in a single HBase region. In such a case, you could end up hitting a “region hot-spotting” problem where a single HBase region has to serve all of the Big SQL nodes running your queries and becomes a bottleneck.
This issue is particularly prevalent when monotonically increasing values are used for the leading column of your row key. For example, if your row key is a TIMESTAMP column, then all of the rows associated with a particular month will likely end up in the same region (since HBase keeps its data sorted on the row key). If a query comes in to process all of the data from that particular month, every node involved in the scan of the table will go to the region server hosting that region to ask for any qualifying rows. Resulting in a bottleneck on that region server.
To relieve this bottleneck, Big SQL offers the ability to “salt” your tables. “Salting” a table tells Big SQL to add a random prefix to the beginning of your HBase row keys such that each row gets mapped to a random region. Making it far more likely that all of the qualifying rows in your queries are spread across your cluster and can be accessed in parallel by different Big SQL agents.
- Different encoding types:
- Users have a lot of control over how data is mapped between Big SQL and HBase via different encoding schemes.
The following encodings are currently supported: binary encoding, string encoding, custom encoding.
As the name suggests, binary encoding stores data in binary format in HBase. It is generally preferred over string encoding for both functionality and performance and is the default encoding used in Big SQL unless otherwise specified.
String encoding is much more user-friendly than binary encoding as values stored in HBase will be in a human readable format. However, strings tend to be expensive to parse and thus string encoding generally results in worse performance than binary encoding. It can also lead to unexpected results as numeric data is collated lexicographically rather than by value (eg. “9” > “10” with string encoding).
Alternatively, a custom encoding scheme can be used where users can provide their own serializer/deserializer (SerDe) to re-define exactly how data should be mapped to/from HBase.
- Mature SQL query optimizer:
- Big SQL offers an extremely mature query optimizer which is backed by IBM’s decades of experience in optimizing SQL queries run on distributed RDBMSes. Over the past few months, this optimizer has been slightly tweaked to understand how it take can advantage of having data stored in HBase.
For example, if we’re joining an HBase table to another table (not necessarily another HBase table) on its row key, the optimizer will consider a NLJOIN (nested loop join) plan with the HBase table on the inner leg.
In such a plan, Big SQL will loop over every qualifying row in the outer table and do the following:
1. Extract the values that make up the row key of the inner HBase table
2. Use the values to perform an index lookup in HBase to determine if a matching row exists.
This access plan plays to HBase’s strengths in terms of being able to quickly and efficiently fetch single rows.
But the optimizer also knows that this isn’t always the best join strategy. Particularly if there are lots of qualifying rows in the outer table which would result in lots of separate RPC calls to HBase. In such a case, the optimizer may consider another join strategy (perhaps a hash join) as a better choice.
- Complex types:
- In BigInsights 4.0, you are able to define ARRAY, MAP, and STRUCT columns in all of your Big SQL tables (not just HBase tables). Giving you more ways to work with and analyze your datasets.
- Universally improved performance over HBase in Big SQL v1
- Internal testing has shown HBase in Big SQL v4 to be roughly 2-3x faster than HBase in Big SQL v1! Stay tuned for more blogs on Big SQL HBase performance!
Check out the IBM Knowledge Center for more information on HBase in Big SQL!