Accessing tables created in Hive and files added to HDFS from Big SQL

Introduction

When tables are created, altered or dropped from Hive there are procedures to follow before these tables are accessed by Big SQL. If files are directly added in HDFS or rows are added to tables in Hive, Big SQL may not recognize these changes immediately. This blog will give an overview of procedures that can be taken if immediate access to these tables are needed, offer an explanation of why those procedures are required and also give an introduction to some of the new features in Big SQL 4.2 and later releases in this area.

Big SQL and Hive Metastore Introduction

As long as the table is defined in the Hive MetaStore and accessible in the Hadoop cluster then both BigSQL and Hive can access it. The Hive metastore stores the metadata for Hive tables, this metadata includes table definitions, location, storage format, encoding of input files, which files are associated with which table, how many files there are, types of files, column names, data types etc. Big SQL uses these low level API’s of Hive to physically read/write data. Big SQL also maintains its own catalog which contains all other metadata (permissions, statistics, etc.) specific to Big SQL.

Syncing of the Big SQL Catalog and Hive Metastore

When a table is created from Big SQL, the table is also created in Hive. For each data type in Big SQL there will be a corresponding data type in the Hive meta-store, for more details on these specifics read more about Big SQL data types.
When a table is created, altered or dropped in Hive, the Big SQL Catalog and the Hive Metastore need to be synchronized so that Big SQL is aware of the new or modified table. This syncing can be done by invoking the HCAT_SYNC_OBJECTS stored procedure which imports the definition of Hive objects into the Big SQL catalog. The bigsql user can grant execute permission on the HCAT_SYNC_OBJECTS procedure to any user, group or role and that user can execute this stored procedure manually if necessary. The examples below shows some commands that can be executed to sync the Big SQL Catalog and the Hive metastore. Note that we use regular expression matching where ‘.’ matches any single character and ‘*’ matches zero or more of the preceding element.

    GRANT EXECUTE ON PROCEDURE HCAT_SYNC_OBJECTS TO USER1;  CALL SYSHADOOP.HCAT_SYNC_OBJECTS(‘bigsql’,’mybigtable’,’a’,’MODIFY’,’CONTINUE’);    --Optional parameters also include IMPORT HDFS AUTHORIZATIONS or TRANSFER OWNERSHIP TO user  CALL SYSHADOOP.HCAT_SYNC_OBJECTS(‘bigsql’,’mybigtable’,’a’,’REPLACE’,’CONTINUE’, ‘IMPORT HDFS AUTHORIZATIONS’);  --Import tables from Hive that start with HON and belong to the bigsql schema  CALL SYSHADOOP.HCAT_SYNC_OBJECTS('bigsql', 'HON.*', 'a', 'REPLACE', 'CONTINUE')";

Performance tip – call the HCAT_SYNC_OBJECTS stored procedure using the ‘MODIFY’ instead of the ‘REPLACE’ option where possible. The ‘REPLACE’ option will drop and recreate the table in the Big SQL catalog and all statistics that were collected on that table would be lost.

Automatic Syncing of the Big SQL Catalog and Hive Metastore

New in Big SQL 4.2 is the ‘auto hcat sync’ feature – this feature will check to determine whether there are any tables created, altered or dropped from Hive and will trigger an automatic HCAT_SYNC_OBJECTS call if needed to sync the Big SQL catalog and the Hive Metastore. Auto hcat sync is the default in releases after 4.2.

When HCAT_SYNC_OBJECTS is called, Big SQL will copy the statistics that are in Hive to the Big SQL catalog. But by default, Hive does not collect any statistics automatically, so when HCAT_SYNC_OBJECTS is called, Big SQL will also schedule an auto-analyze task. For details read more about Auto-analyze in Big SQL 4.2 and later releases. Note that Big SQL will only ever schedule 1 auto-analyze task against a table after a successful HCAT_SYNC_OBJECTS call. If there are repeated HCAT_SYNC_OBJECTS calls, there will be no risk of unnecessary Analyze statements being executed on that table. If Big SQL realizes that the table did change significantly since the last Analyze was executed on the table then Big SQL will schedule an auto-analyze task.
Since Big SQL 4.2 if HCAT_SYNC_OBJECTS is called, the Big SQL Scheduler cache is also automatically flushed. The next section gives a description of the Big SQL Scheduler cache.

Introduction to the Big SQL Scheduler Cache

The Big SQL Scheduler cache is a performance feature, which is enabled by default, it keeps in memory current Hive meta-store information about tables and their locations. The Big SQL compiler has access to this cache so it can make informed decisions that can influence query access plans. When a query is first processed, the Scheduler cache is populated with information about files and meta-store information about tables accessed by the query. The Scheduler cache is flushed every 20 minutes. This time can be adjusted and the cache can even be disabled. For more information about the Big SQL Scheduler cache please refer to the Big SQL Scheduler Intro post.
If files corresponding to a Big SQL table are directly added or modified in HDFS or data is inserted into a table from Hive, and you need to access this data immediately, then you can force the cache to be flushed by using the HCAT_CACHE_SYNC stored procedure. The following examples shows how this stored procedure can be invoked:

    –-Tells the Big SQL Scheduler to flush its cache for a particular schema  CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’);  –-Tells the Big SQL Scheduler to flush its cache for a particular object  CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’,’mybigtable’);

Performance tip – where possible invoke this stored procedure at the table level rather than at the schema level.

Adding a partition directory of files to HDFS

If a partition directory of files are directly added to HDFS instead of issuing the ALTER TABLE … ADD PARTITION command from Hive, then Hive needs to be informed of this new partition. This can be done by executing the MSCK REPAIR TABLE command from Hive.

    hive> MSCK REPAIR TABLE mybigtable;

When the table is repaired in this way, then Hive will be able to see the files in this new directory and if the ‘auto hcat-sync’ feature is enabled in Big SQL 4.2 then Big SQL will be able to see this data as well. In Big SQL 4.2, if the ‘auto hcat-sync’ feature is not enabled (which is the default behavior) then you will need to call the HCAT_SYNC_OBJECTS stored procedure. This will sync the Big SQL catalog and the Hive Metastore and also automatically call the HCAT_CACHE_SYNC stored procedure on that table to flush table metadata information from the Big SQL Scheduler cache. If you are on versions prior to Big SQL 4.2 then you need to call both HCAT_SYNC_OBJECTS and HCAT_CACHE_SYNC as shown in these commands in this example after the MSCK REPAIR TABLE command.

    –-Tells the Big SQL Scheduler to flush its cache for a particular schema  CALL SYSHADOOP.HCAT_SYNC_OBJECTS(‘bigsql’,’mybigtable’,’a’,’MODIFY’,’CONTINUE’);    CALL SYSHADOOP.HCAT_CACHE_SYNC (‘bigsql’);

Summary

Prior to Big SQL 4.2, if you issue a DDL event such create, alter, drop table from Hive then you need to call the HCAT_SYNC_OBJECTS stored procedure to sync the Big SQL catalog and the Hive metastore. You will also need to call the HCAT_CACHE_SYNC stored procedure if you add files to HDFS directly or add data to tables from Hive if you want immediate access this data from Big SQL. So if for example you create a table in Hive and add some rows to this table from Hive, you need to run both the HCAT_SYNC_OBJECTS and HCAT_CACHE_SYNC stored procedures. In Big SQL 4.2 and beyond, you can use the ‘auto hcat-sync’ feature which will sync the Big SQL catalog and the Hive metastore after a DDL event has occurred in Hive if needed. Auto hcat-sync is the default in all releases after 4.2. Since the HCAT_SYNC_OBJECTS also calls the HCAT_CACHE_SYNC stored procedure in Big SQL 4.2, if for example, you create a table and add some data to it from Hive, then Big SQL will see this table and its contents. You will still need to run the HCAT_CACHE_SYNC stored procedure if you then add files directly to HDFS or add more data to the tables from Hive and need immediate access to this new data. In Big SQL 4.2 if you do not enable the ‘auto hcat-sync’ feature then you need to call the HCAT_SYNC_OBJECTS stored procedure to sync the Big SQL catalog and the Hive Metastore after a DDL event has occurred.

Tips

Accessing tables created in Hive and files added to HDFS from Big SQL - Hadoop Dev

Technical Blog Post

Abstract

Body