IBM Support

Big SQL Scheduler Intro and Memory Improvements in 4.2 - Hadoop Dev

Technical Blog Post


Abstract

Big SQL Scheduler Intro and Memory Improvements in 4.2 - Hadoop Dev

Body

Big SQL is IBM’s SQL-On-Hadoop solution, and in this blog we’ll cover the basics of the Big SQL Scheduler component, and a technical exposition of improvements that we have done for the 4.2 release.

What is the Big SQL Scheduler?

Big SQL follows a similar high level deployment design as other Hadoop services: we have one head node, and many worker nodes. On every node, we run Big SQL’s database core, built on top of DB2 technology. But on the head node we also run the Big SQL Scheduler. This Scheduler serves as a liaison between the database core and other Hadoop components.

The Scheduler understands how to talk to Apache Hive’s Metastore for table information, how to talk to Apache HDFS’s Namenode for file information, and to Apache HBase for region information. The Scheduler knows how to efficiently divide the work of a SQL statement into what is known as ‘splits’ in the Hadoop world. It also knows about the status of the Big SQL worker nodes and makes ‘split’ assignments to available workers based on many factors, including locality. This separation of concerns allows the Big SQL database core to do what it does best ( SQL! ) without introducing Hadoop-specific details, which are kept nicely compartmentalized inside the Scheduler.

What does a typical query initialization sequence look like in the Scheduler?

At runtime, the Scheduler’s main job is to understand what worker nodes are available, and what work is to be assigned. The steps below roughly describe the work for initializing a query on a Hadoop table.

Big SQL Scheduler Typical Interaction

1) First, the database core contacts the Scheduler, notifying it of the need to query table t.
2) Scheduler then asks Hive Metastore for information about table t. The Metastore functions like a catalog, and responds with details such as the table’s file format, the table’s data folder locations in HDFS, whether the data is compressed, whether the data is partitioned, and more. Since the Metastore is an independent process, and can potentially be located in a different node, the Scheduler communicates with it via a ‘Remote Procedure Call’ (RPC) mechanism. This step can potentially be costly in terms of time.
3) With the table’s data location acquired in step (2), the Scheduler now asks HDFS for information about the table’s files. Again, this step can be quite costly since HDFS is also an independent service called via RPC, and a table can have an unbounded number of files.
4) Using this information, the Scheduler then proceeds to ‘split’ the files and to assign them to the available workers. At this point, the scheduler also tries to eliminate the scanning of unnecessary partitions.
5) Finally, the Scheduler returns control to the database core, and the database core drives other Big SQL components to fulfill the query on table t.

Note that these steps are repeated for every single query that touches Hadoop tables, so you’d like this process to be as fast as possible. To help in that goal, we aggressively cache the results of steps (2) and step (3). But there is always a price to pay: maintaining a cache means that we utilize significant memory, and that we have to deal with cache coherence.

Big SQL Scheduler now consumes 3-5x less memory for typical workloads

From Big SQL version 4.2 and on, for typical Hadoop workloads of HADOOP partitioned tables, the Scheduler now consumes 3-5x less memory than previous versions. To achieve this, we closely looked at the memory footprint of all the data structures that we keep at runtime. We knew that the memory consumption was mostly because of the cost of keeping the cache, but we also observed some interesting behaviors:

A) If your workload is mostly on top of HBase tables, the Scheduler cache memory consumption is minimal. This is because HBase manages its own HDFS files internally, and thus the cached information on Scheduler is small, mostly coming from step (2), and we do not need step (3).

B) If your workload is mostly on top of Hadoop tables, then roughly 35% of the Scheduler’s memory consumption was used to cache Hive metadata and the remaining 65% for cached HDFS metadata.

We did a deep dive on the findings from point (B) and found out that the main culprit was a less than optimal in-memory representation. For example, we were keeping HDFS file names in memory as Java String objects. String objects are implemented with char[] primitives, and a char happens to utilize UTF-16 encoding, meaning that most chars utilize 2 bytes. When you are caching hundreds of thousands of file names, these bytes add up. We no longer use Java String objects, and opted for an UTF-8 representation, effectively cutting memory usage in this particular scenario by 50% for the common case.

Another interesting observation was that we would keep around the objects generated by our RPC mechanism of choice, Apache Thrift. The Thrift folks decided to generate code that favors simplicity and no extra libraries. For example, a Thrift specification that includes a list<i32> gets translated to Java into an ArrayList<Integer>. Think about that for a second. A primitive int in Java consumes 4 bytes, yet an Integer object consumes at least 16 bytes, since you need a reference for the object (which takes at least 8 bytes), the primitive instance member (4 bytes), and 4 bytes of padding because Java likes to align memory by 8 byte chunks! The solution to this one was to utilize Thrift for RPC only, and to quickly convert and toss the Thrift objects. We replaced these and many other situations with their primitive equivalents. In some cases, we also applied the Flyweight pattern to share objects instead of having multiple copies.

Both of these examples trade CPU time for more efficient memory use. As we found out, it was a good tradeoff for the typical use cases of the Big SQL Scheduler.

Big SQL Scheduler now responds 25x faster for workloads with 100,000’s of partitions

Big SQL customers are now querying tables with hundreds of thousands of partitions, and concurrently adding one or more partitions to the table. One limitation we had in this area was that after a new partition was registered, we would invalidate the Scheduler cache for that particular table. This meant that we would have to reload the cache, and on a table of this size, the reloading cost would be the single most expensive part of the query. Version 4.2 now does delta changes to the cache, and for workloads that include a SELECT after a transaction that modifies partitions (think INSERT, LOAD, DROP PARTITION, etc.), we observe a 25x improvement when the table has 10,000 partitions or more.

Some tips to further controlling memory usage in the Big SQL Scheduler

Scheduler reports memory usage every 5 minutes to the log available at $BIGSQL_DIST_VAR/logs/bigsql-sched-recurring-diag-info.log. Inspecting this file can let you know if the Scheduler is memory constrained:

[xabriel@test-cluster bigsql]# cd $BIGSQL_DIST_VAR/logs
[xabriel@test-cluster logs]# tail bigsql-sched-recurring-diag-info.log
...
2016-07-11 13:14:33,979 INFO com.ibm.biginsights.bigsql.scheduler.server.RecurringDiag [scheduler-recurring-diag] : maximum-allocatable-memory (2,147,483,648); currently-total-allocated-memory (536,870,912); currently-used-memory (106,590,160); currently-free-memory (430,280,752)

The Scheduler is configured with a maximum of 2GB of Java Heap Space by default. To bump this number up or down, refer to the official documentation. Remember that:

  • The longer the filename of an HDFS file, the more memory it will consume.
  • The more HDFS files a Hadoop table has, the more memory use in Scheduler. Hadoop in general prefers fewer, bigger files rather than many small files.

This work was a joint effort between Xabriel J. Collazo Mojica, Abhayan Sundararajan, and Michelle Jou.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259935