IBM Support

Using Big SQL on HBase for Fast Point Queries - Hadoop Dev

Technical Blog Post


Abstract

Using Big SQL on HBase for Fast Point Queries - Hadoop Dev

Body

As organizations start to explore the possibilities of moving their data into a distributed file system, they would be amazed by the wide variety of tools that are available now. It may be difficult to pick the right one to match the access model. What if one were using SQL and JDBC as the primary client interfaces and would like to continue to use the same? The good thing is there are many solutions that offer SQL over Hadoop data. Big SQL, with its ability to support the broadest range of SQL with high performance and scalability has been gaining popularity quickly. Now, what if your data access required fast sub-second response times? Big SQL can cater to that as well. With support for SQL over HBase, you can get all the features available with other file formats with additional benefit of fast response times for small queries. Through a series of articles, we plan to go over some of features of HBase and explain how Big SQL is leveraging these.

If one were to pick a feature, the ability to do fast lookups using row key is the one that makes HBase a good candidate for queries requiring sub-second response times. No matter how big your HBase table is, you can expect to get back row(s) in few milliseconds, provided you know the full row key or a key range that would match a narrow set of rows. Before going into the SQL side of things, let us take a look at how data is stored in an HBase table. Being a columnar database, HBase is basically a KeyValue store. What this means is that each value in an HBase table is uniquely identified by a key that consists of a row key, column qualifier, column family and timestamp.

Here is a sample of how data would look like in HBase.

   rowkey_columnfamily_columnqualifier_timestamp_value   row1_cf1_cq1_ts1_val#1111   row1_cf1_cq1_ts2_val#1112   row1_cf1_cq2_ts1_val#1121   row2_cf1_cq3_ts1_val#2131   row2_cf2_cq4_ts1_val#2231

The data is stored sorted and indexed by the key and the entire key gets persisted on disk along with each value. You can see how this helps when you want to retrieve a narrow set of rows. On the other hand, this makes HBase a very verbose data store. You would be paying the cost while storing the data as well as when reading it.

Let that not make you turn away from HBase. Big SQL is here to help you alleviate the issues I touched upon. And it does this by providing a column mapping clause to map the SQL columns in a table to HBase entities. A user can have as many columns as they would like in their table. The trick is to not have many columns in the corresponding HBase table. Remember each column carries a lot of baggage.

Before going into different mapping possibilities, here is a sample Big SQL DDL to create a table with data stored in HBase. This would create a table named schemaname.tablename in HBase . In this case, it will be bigsql.mixed_encodings.

CREATE HBASE TABLE mixed_encodings (c1 int,c2 int,c3 int,c4 varchar(20),c5 varchar(40),c6 int)   COLUMN MAPPING   (   KEY MAPPED BY (c1,c2,c3),   cf1:cq1 MAPPED BY (c4,c5) ENCODING DELIMITED FIELDS TERMINATED BY '#',   cf1:cq2 MAPPED BY (c6) ENCODING USING SERDE 'com.ibm.biginsights.bigsql.serde.JSONSerDeForNonUTFStrings'   );

This sample table has six SQL columns: the first three are mapped to the HBase row key; the next two are mapped to an HBase column cf1:cq1; last one is mapped to cf1:cq2. You may also notice an ENCODING clause which indicates how to encode the data before storing into HBase. If you don’t care, leave this to Big SQL and it would choose the best encoding which happens to be its in-built binary encoding. With binary encoding, you would get the maximum pushdown and numeric collation. The DELIMITED encoding is actually a string encoding which makes the data in HBase readable and the USING SERDE encoding is one where a user can specify a SerDe class to encode/decode the data. Did I mention HBase does not care about types or how you store stuff? To it, everything is just raw bytes. That gives an application the flexibility to store anything they want and interpret it however they like. And this is one of the features Big SQL leverages. It lets you pack more than one SQL column into an HBase entity, which could be the row key or a column. It also lets you choose short names for the HBase columns so that you don’t add much to the column baggage which includes the column qualifier and family name.

Hint: Keep the family and qualifier short! Use your naming skills to choose some fancy names for those SQL columns.

Here is a peek at how data will look in HBase when the following insert statement is used.

INSERT INTO mixed_encodings values (1,2,3,'four','five','six');
hbase(main):001:0> scan 'bigsql.mixed_encodings'  ROW COLUMN+CELL   \x00\x80\x00\x00\x01\x00\x80\x00\x00\x02\x00\x80\ column=cf1:cq1, timestamp=1447974104744, value=four#five   x00\x00\x03   \x00\x80\x00\x00\x01\x00\x80\x00\x00\x02\x00\x80\ column=cf1:cq2, timestamp=1447974104744, value={"c6":6}   x00\x00\x03  1 row(s) in 0.1870 seconds

When packing columns, one has to make some choices, for which a knowledge of the data access workload is required. Most likely, this may be something one may have studied before choosing HBase as a data store. You may have chosen to store data in HBase mainly if there is a need to select a small range of data using a set of predicates. Look at your predicates. Choose ones that are used in the most queries or ones in the query which happens to be used most frequently. Map these to the row key. Big SQL supports mapping multiple SQL columns to the row key and can push down the query predicates on the columns forming the composite row key. In the example, these happen to be c1, c2 and c3, with c1 being the most used and most restrictive. Now, if a query comes in with predicates on c1, (c1,c2) or (c1, c2, c3), you are in luck.

The predicate pushdown mentioned above makes the work happen in HBase region servers closer to the data. HBase, being a client/server database, clients have to send requests to region servers to read/write data. There are some cases where this could be bypassed and this will be covered in later articles. For now, think of predicate pushdown as a way to avoid costly network as well as disk I/O. Big SQL translates the predicates on column(s) mapped to the row key into key ranges and will apply them to the HBase scan requests. If the range matches one to a few thousands of rows, the query completes in sub-second response time.

Hint: Use a composite row key and binary encoding to maximize predicate pushdown.

As with most indexing implementations, you would need predicates on the leading part(s) of the row key for Big SQL to be able to push down and get fast responses. What if you have queries on just c2? Or say another column like c4 which is not mapped to the row key? Well, in such cases, you could create a secondary index on such columns. More on that later.

Shifting focus back to the columns in the row key, these are also used by Big SQL to eliminate the number of regions it needs to scan. A region can be thought of as a well-defined slice of data bounded by a start and stop key. A table in HBase is a collection of regions. When a query does not have any predicates, the entire set of regions would have to be scanned. With predicates on leading part(s) of row key, Big SQL would avoid doing this expensive full table scan and actually work on a subset of regions. The ability to perform region elimination ensures less work when scheduling and distributing the query among Big SQL worker nodes, which can work on regions in parallel.

Hint: Deciding the row key is one of the most important parts of HBase table design. When coming up with column mapping, try to map SQL columns that are used in queries expecting fast response times to the row key.

Now that we are done with row key mapping, what can be done with all those other columns? In the example, there are only three more but in real life, they can amount to hundreds or more. Think of all the baggage that would have to be stored and transferred if you map the columns one to one. That is where the dense column mapping comes in handy. You could put all your other columns in a single HBase column. If you know the projections that your queries use, you may know the set of columns being retrieved together. Putting columns that get queried together into one HBase column would be an even better option. In Big SQL, an encoding can be specified at an HBase entity level. This means one can use a different encoding for the row key and each HBase column. In most cases, it is best to use a single encoding for the entire table.

Hint: Use dense column mappings to minimize storage space and improve scan times. Look at the projection list of queries in your workload to decide which columns to pack together.

When you are writing down the column mapping, you may think why not use a:b, c:d as HBase column names? They are short and should work well based on one of the hints in this article. Hold on! You need to understand what this would mean for an HBase table. The table in HBase would have two column families a and b. It is generally good to keep the number of column families to the minimum, preferably one. This is because HBase persists the data in each column family into separate HFiles. An HFile is the basic unit of storage in HBase. To minimize the disk I/O, one would want to read fewer files. Having a single column family is the best case, unless you have some columns that are rarely queried and/or are always queried separately. In this example, it may work better to use a:b, a:c as column names .This would create a single column family named a in the HBase table.

Hint: Try to use a single column family unless your workload calls for more.

For SQL columns mapped to HBase columns, you would get some pushdown if there is a query predicate on the column that is the leading part of the dense column mapping, e.g c4. In this case, Big SQL would push down a column filter into HBase. The server-side filtering is not as efficient as key range lookups as HBase would still have to scan all data but it can help to cut down what is sent back to the client.

With the dense column mapping described above, we minimize the rowkey_columnfamily_columnqualifier_timestamp_value that gets returned for each value. What if we can avoid some more baggage? This is possible by defining a table with a key only mapping.

CREATE HBASE TABLE keyonly(k0 int, k1 int, k2 int)   COLUMN MAPPING   (   key mapped by (k0,k1,k2)   );

With such a table, Big SQL would apply special filters to retrieve just the row key. Such key only tables are good for narrow tables where all columns are generally queried together. However, keep in mind that HBase limits the row key size and also, row key is one thing that gets retrieved all the time. Consider packing enough columns into the row key but try to keep it short.

Hint: You can define a key only Big SQL HBase table for narrow tables.

Now, a word of caution. If you have started thinking of the row key in HBase as a primary key in a relational database, the comparison is fair to a certain extent. However, you will not get any constraint violation error when you try to insert different values for an existing row key. HBase would just create another version of the value. As Big SQL retrieves the latest version of a value, this would mean you may lose some data without getting any errors. Before transferring data, it is important to make sure you have a unique row key. If you are not sure, Big SQL has a special clause to overcome such situations.

CREATE HBASE TABLE force_key_unique(k0 int, c0 int)   COLUMN MAPPING   (   key mapped by (k0) force key unique,   f:q mapped by (c0)   );

Without the FORCE KEY UNIQUE clause, if the data had two rows with same value for k0, only the last one would be visible to queries. With the clause, Big SQL will append a unique value to the row key before storing it in HBase and both the rows would be visible to queries. Note that more data would get stored into HBase. Hence, use this only if you really need to.

Hint: Beware of the column version feature of HBase and ensure the column(s) mapped to the row key are unique to prevent data loss. Big SQL provides a FORCE KEY UNIQUE clause but use it only if you really need to.

There is more to talk about row key design as well other areas to consider while modeling data in HBase. For now, I hope these tips help you to pack light and pack right, and enable you to quickly and efficiently retrieve what you need.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259975