Big SQL Automatic Catalog Synchronization (Part 1 - Introduction)

Introduction
Automatic synchronization of the Hive metastore and Big SQL catalog was introduced in Big SQL 4.2 and is a significant enhancement to how Big SQL manages its catalog tables. With this feature enabled, Big SQL will automatically synchronize Hive metastore changes into the Big SQL catalog, so that, any Hive DDL operations (CREATE, ALTER, DROP), will be automatically reflected in the Big SQL catalog. If a new table is created in Hive, for example, that table will automatically be available in Big SQL.

This blog is the first in a three-part series that will outline all you need to know to start working with Big SQL’s Automatic Catalog Synchronization (Auto-Sync). Future blogs in the series will provide more detailed information on the feature’s architecture, configuration options and problem determination. This first blog is an introduction to Auto-Sync, discussing its significance, the problem it addresses and how it can be enabled/disabled.

Background
Big SQL and Hive share table metadata via the Hive metastore. By doing this, Big SQL can work with tables created in Hive and Hive can work with tables created in Big SQL. Big SQL also stores metadata locally in the Big SQL catalog, for ease of access and to facilitate query execution.

Generally the Big SQL catalog and the Hive metastore are in sync, and things look something like what we have in Figure 1.

Auto_Synchronization_insync — Fig.1 – Big SQL catalog and the Hive metastore in sync.

Under some circumstances, however, we may end up in an out-of-sync state that looks more like what we have in Figure 2. Here, due to DDL changes executed in Hive, Big SQL and Hive have a different view of the table definitions.

Auto_Synchronization_outofsync — Fig.2 – Big SQL catalog and the Hive metastore out of sync.

How Might This Happen?
A metadata mismatch is the result of a Hive metastore change occurring outside of Big SQL’s control. That is, a metadata update via Hive that is yet to be picked up by Big SQL. Big SQL is unaware of these DDL operations and therefore the catalog potentially falls out of sync with the Hive metastore.

For example say we have a table called “mybigtable” that was originally created in Big SQL with all Hive metadata and Big SQL catalog data in sync and as expected. Later, a user, while using Hive, adds a new integer column called ‘newcol‘ to the table.

At this point, the table definition in Hive now looks like this:
Auto_Synchronization_hiveDescribe

However, without Auto-Sync enabled, the same table as far as Big SQL is concerned, still looks like this:
Auto_Synchronization_bigsqlDescribec

Big SQL Solution Prior to 4.2 Release – HCAT_SYNC_OBJECTS
Prior to version 4.2, Big SQL does provide a solution for this problem, however manual intervention is required. The user can rectify this metadata inconsistency by manually executing the HCAT_SYNC_OBJECTS stored procedure.

Big SQL 4.2 Solution – Auto-Sync
Since Big SQL 4.2, Auto-Sync enables the Big SQL catalog to be kept up to date with the Hive metastore automatically. This feature can be enabled/disabled through the Ambari GUI (details below). When enabled, any DDL changes reflected in the Hive metastore will be picked up and automatically synced with the Big SQL catalog. In later versions of Big SQL, Auto-Sync is enabled by default.

Enabling/Disabling Auto-Sync
Auto-Sync can be enabled/disabled via the Ambari GUI:

Go To

Shown below in Figure 3

Auto_Synchronization_EnableDisable — Fig.3 – Enabling/Disabling Big SQL Auto-Sync in Ambari.

It’s possible that a user will experience a lag of up to five minutes from the time when Auto-Sync is (re)enabled (or when Big SQL is restarted) to when it first processes DDL changes from Hive. After this initial lag, however, Big SQL will quickly process any DDL changes from the Hive metastore.

Depending on the version of Big SQL installed, automatic synchronization will either happen at a fixed 60 second interval or, for later versions, at an interval set by bigsql.catalog.sync.sleep (default is 30 seconds). We show how to configure this parameter in part 2 of this blog series.

Summary
In this blog we introduced Big SQL’s Automatic Synchronization (Auto-Sync). This is a Big SQL feature that provides automatic synchronization of table metadata between the Hive metastore and the Big SQL catalog. We outlined the significance of this feature and the problem it solves. We also looked at how to get started using this feature by enabling/disabling via Ambari.

In part 2 of this blog series we’ll take a closer look at the architecture of Auto-Sync, detailing how Big SQL provides this metadata synchronization solution.

Additional Information

IBM Support

Tips

Big SQL Automatic Catalog Synchronization (Part 1 - Introduction) - Hadoop Dev

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?