Apache HBase is an open source NoSQL distributed database that runs on top of the Hadoop Distributed File System (HDFS). It is well-suited for faster read/write operations on large datasets with high throughput and low input/output latency. But, unlike relational and traditional databases, HBase lacks support for SQL scripting, and data types, and it requires the Java API to achieve the equivalent functionality.
Apache Spark is a big data processing engine built for speed, ease of use, and sophisticated analytics. Like Spark, HBase is built for fast processing of large amounts of data. Spark plus HBase is a popular solution for handling big data applications. To manage and access your data with SQL, HSpark connects to Spark and enables Spark SQL commands to be executed against an HBase data store.
This code pattern is intended to provide application developers who are familiar with SQL the ability to access HBase data tables using the same SQL commands. You quickly learn how to create and query the data tables by using Apache Spark SQL and the HSpark connector package. Then you can take advantage of the significant performance gains from using HBase without having to learn the Java APIs required to traditionally access the HBase data tables.
HSpark provides a new approach to supporting HBase. It leverages the unified big data processing engine of Spark, while also providing native SQL access to HBase data tables.
When you complete this pattern, you will understand how to:
- Install and configure Apache Spark and HSpark connector.
- Create metadata for tables in Apache HBase.
- Write Spark SQL queries to retrieve HBase data for analysis.
- Set up the environment (Apache Spark, Apache HBase, and HSpark).
- Create the tables using HSpark.
- Load the data into the tables.
- Query the data using the HSpark Shell.
Ready to put this code pattern to use? Complete details on how to get started running and using this application are in the README.md file.