This blog is the second installment in a series that will outline all you need to know to start working with Big SQL’s Automatic Catalog Synchronization (Auto-Sync). In part 1 we provided an introduction to Auto-Sync, discussing it’s significance, the problem it addresses and how it can be enabled/disabled via the Ambari GUI. In this blog we’ll provide more details on the feature’s architecture and configuration options.
At a high level and as shown in Figure 1, Big SQL’s Auto-Sync feature can be thought of as containing two core components:
- Event-File Generation
- Event-File Processing
1. Event-File Generation
For every DDL statement executed that results in an update to the Hive metastore, the relevant DDL event information is serialized and a JSON formatted file is created and stored in a predetermined location on HDFS. This HDFS location is known as the events-directory and by default is: “/user/bigsql/sync”.
Figure 2 shows an example of a DDL event written to a JSON event-file. This event-file is associated with an ALTER table statement executed against the table, mybigtable. In fact, this is the file generated as a result of our example in part 1, where a new column was added, via Hive, to our existing table, ‘mybigtable’.
For any DDL statement executed in Hive (CREATE, DROP, ALTER) and causing a change in the Hive metastore, a new JSON event-file, similar to the one shown in Figure 2, is written to the events-directory on HDFS. For example, if there are 10 DDL statements executed in Hive, we get 10 corresponding event-files in the events-directory, ready to be processed by Big SQL.
2. Event-File Processing
Once there are files present in the events-directory, they will need to be processed in some way. Big SQL automatically parses through all files in this pre-configured directory on HDFS every ‘n’ seconds, processing any associated DDL events and updating the Big SQL catalog where necessary to reflect the relevant metadata changes. (Where ‘n’ is the time, in seconds, Big SQL waits between processing event-files – see Configuration section below for more details). Once a DDL event has been successfully processed, the associated event-file is removed from the events-directory.
There are a couple of Auto-Sync related configuration parameters that you should be aware of. While, generally speaking, there is usually no need to modify these, it is possible to change their values if necessary.
- The bigsql.catalog.sync.events Parameter
This is the HDFS directory that event-files are written to and processed from. This parameter is accessible via Ambari (as shown in Figure 3), under:
Hive - Configs - Advanced - Custom hive-site - bigsql.catalog.sync.events
- The bigsql.catalog.sync.sleep Parameter
This parameter is available in more recent versions of Big SQL. It determines the duration (in seconds) Big SQL will wait before re-processing event-files from the events-directory. The default for this parameter is 30 seconds and is configurable in Ambari (as shown in Figure 3), under:
Hive – Configs – Advanced – Custom hive-site – bigsql.catalog.sync.sleep.
The valid values for bigsql.catalog.sync.sleep are between 1 and 60 (seconds). If a value less than 1 is specified, bigsql.catalog.sync.sleep defaults to 1 second. If a value greater than 60 is specified, bigsql.catalog.sync.sleep defaults to 60 seconds.
When Auto-Sync processes event-files, Big SQL may also schedule an Auto-Analyze task and from Big SQL 4.2, the Big SQL Scheduler cache is also automatically flushed. For more information see:
In this blog we presented a high level view of the Big SQL Auto-Sync architecture. We saw how JSON formatted DDL event-files are written to the events-directory (by default: /user/bigsql/sync) and processed to ensure the Big SQL catalog and Hive metastore stay synchronized at all times. We also presented the main configuration parameters available for controlling Auto-Sync behaviour.
In the next and final blog of this series, we’ll take a look at some problem determination and explain what you can do if experiencing issues related to Big SQL’s Auto-Sync feature.
- Big SQL Automatic Catalog Synchronization (Part 1 – Introduction)
- Big SQL Automatic Catalog Synchronization (Part 3 – Problem Determination)
- Big SQL Automatic Catalog Synchronization – Error Handling
- Automatic Hive catalog syncing to the Big SQL catalog
- Accessing tables created in Hive and files added to HDFS from Big SQL
- Hive and Big SQL catalogs are inconsistent