Big SQL Automatic Catalog Synchronization (Part 2 - Architecture)

Introduction
This blog is the second installment in a series that will outline all you need to know to start working with Big SQL’s Automatic Catalog Synchronization (Auto-Sync). In part 1 we provided an introduction to Auto-Sync, discussing it’s significance, the problem it addresses and how it can be enabled/disabled via the Ambari GUI. In this blog we’ll provide more details on the feature’s architecture and configuration options.

Architecture
At a high level and as shown in Figure 1, Big SQL’s Auto-Sync feature can be thought of as containing two core components:

Event-File Generation
Event-File Processing

AutoHcatSync_Arch — Fig.1 – Big SQL Auto-Sync Architecture

1. Event-File Generation
For every DDL statement executed that results in an update to the Hive metastore, the relevant DDL event information is serialized and a JSON formatted file is created and stored in a predetermined location on HDFS. This HDFS location is known as the events-directory and by default is: “/user/bigsql/sync”.

Figure 2 shows an example of a DDL event written to a JSON event-file. This event-file is associated with an ALTER table statement executed against the table, mybigtable. In fact, this is the file generated as a result of our example in part 1, where a new column was added, via Hive, to our existing table, ‘mybigtable’.

AutoHcatSync_EventFile — Fig.2 – Example of an Auto-Sync event-file

For any DDL statement executed in Hive (CREATE, DROP, ALTER) and causing a change in the Hive metastore, a new JSON event-file, similar to the one shown in Figure 2, is written to the events-directory on HDFS. For example, if there are 10 DDL statements executed in Hive, we get 10 corresponding event-files in the events-directory, ready to be processed by Big SQL.

2. Event-File Processing
Once there are files present in the events-directory, they will need to be processed in some way. Big SQL automatically parses through all files in this pre-configured directory on HDFS every ‘n’ seconds, processing any associated DDL events and updating the Big SQL catalog where necessary to reflect the relevant metadata changes. (Where ‘n’ is the time, in seconds, Big SQL waits between processing event-files – see Configuration section below for more details). Once a DDL event has been successfully processed, the associated event-file is removed from the events-directory.

Configuration
There are a couple of Auto-Sync related configuration parameters that you should be aware of. While, generally speaking, there is usually no need to modify these, it is possible to change their values if necessary.

The bigsql.catalog.sync.events Parameter
This is the HDFS directory that event-files are written to and processed from. This parameter is accessible via Ambari (as shown in Figure 3), under:

Hive - Configs - Advanced - Custom hive-site - bigsql.catalog.sync.events

Fig.3 – bigsql.catalog.sync.events Parameter
The bigsql.catalog.sync.sleep Parameter
This parameter is available in more recent versions of Big SQL. It determines the duration (in seconds) Big SQL will wait before re-processing event-files from the events-directory. The default for this parameter is 30 seconds and is configurable in Ambari (as shown in Figure 3), under:

Hive – Configs – Advanced – Custom hive-site – bigsql.catalog.sync.sleep.

The valid values for bigsql.catalog.sync.sleep are between 1 and 60 (seconds). If a value less than 1 is specified, bigsql.catalog.sync.sleep defaults to 1 second. If a value greater than 60 is specified, bigsql.catalog.sync.sleep defaults to 60 seconds.

A Note on Table Statistics and Scheduler Cache
When Auto-Sync processes event-files, Big SQL may also schedule an Auto-Analyze task and from Big SQL 4.2, the Big SQL Scheduler cache is also automatically flushed. For more information see:

Summary
In this blog we presented a high level view of the Big SQL Auto-Sync architecture. We saw how JSON formatted DDL event-files are written to the events-directory (by default: /user/bigsql/sync) and processed to ensure the Big SQL catalog and Hive metastore stay synchronized at all times. We also presented the main configuration parameters available for controlling Auto-Sync behaviour.

In the next and final blog of this series, we’ll take a look at some problem determination and explain what you can do if experiencing issues related to Big SQL’s Auto-Sync feature.

Additional Information

Tips

Big SQL Automatic Catalog Synchronization (Part 2 - Architecture) - Hadoop Dev

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?