I’m often asked for top tips when it comes to tuning Big SQL for optimal performance so I’ve pulled together the 7 most frequent tips I pass along. Rather than priority order, they’re ordered chronologically in the order I think you’ll need them.
1. Big SQL database path:
Specify multiple paths, on multiple physical disks for the Big SQL database path (bigsql_db_path).
Specifying a set of paths for the Big SQL database path is a key decision to make prior to installing Big SQL 4.0 – as after installation, this is tricky (but not impossible) to change. Among other things, the Big SQL database path determines the location where Big SQL will store temporary working data during the execution of a query (assuming that working data does not fit into the memory assigned). Therefore, for queries that overflow to temporary storage, the performance of the storage underlying the Big SQL database path matters.
For this reason, the best practice is to spread the Big SQL database path across all of the physical disks assigned to store data in your Hadoop cluster. By default, only a single path (disk) per node is specified – so you need to change this during installation. Change the default to specify multiple paths, on multiple physical disks, separating each path with a comma. In the example below, /data1, /data2, /data3 etc… are paths on different physical disks:

If your Big SQL head node has fewer physical disks than the data nodes (which is the case in the standard reference architecture), then simply create the paths on the head node and assign multiple paths to the same physical disk. Using the example above, if your head node has 2 physical disks, then create paths /data1, /data3, /data5 etc… on disk 1, and /data2, /data4 etc… on disk 2.
The same principle applies if using GPFS – spread the GPFS file system across as many physical disks as possible, and then configure bigsql_db_path to a path on GPFS. For best performance, temporary data such as the Big SQL temporary space should not be replicated. GPFS allows users to create “filesets” with different replication factors – the default replication factor of the GPFS FPO file system is usually 3. The fileset has a root path and everything under the root directory inherits the fileset’s replication factor. So for optimal performance, bigsql_db_path should point to a directory in a fileset with replication factor of 1.
Spreading the IO across multiple physical disks in this manner will ensure Big SQL uses parallel IO when reading from and writing to the database path.
Big SQL is presumably not the only component running in your Hadoop cluster. You will also be running HDFS and are likely to be running MapReduce jobs, ZooKeeper, Hive, HBase, and so forth. Each of these components will also require disk resources to operate. So how should the disks on your cluster be shared between Big SQL, HDFS and MapReduce ? The simple answer is to spread everything across all disks. I recommend allocating a portion of each data disk on the data nodes to store the Big SQL database, the HDFS dfs.data.dir, the yarn.nodemanager.local-dirs and yarn.nodemanager.log-dirs. Spreading everything across all the disks in this manner is simple, and generally yields more than adequate IO performance.
2. Resource sharing:
Determine what percentage of compute resource should be assigned to Big SQL.
A Hadoop cluster with Big SQL installed has its resources split between Big SQL and all other Hadoop components. The Big SQL percent usage (bigsql_pct_usage) install property specifies how much of the cluster’s resources (CPU and memory) are assigned to Big SQL. During install, specify a value for bigsql_pct_usage property indicating the percentage of cluster resources you would like to allocate to Big SQL:

This can be tricky to get right at install time since it’s difficult to determine the optimal split of resources required between Big SQL and other Hadoop components until your workload is up and running. Fortunately, this property is relatively easy to change post install.
The exact resources you need to dedicate to Big SQL vs other Hadoop components depends upon the performance requirements of the Big SQL queries and Hadoop jobs within your workload. For example, you may have a 50:50 split of resources on your cluster, but find that the Map-Reduce ETL jobs in the cluster are not meeting the performance requirements of your organization, whereas users are more than happy with the performance of Big SQL. In this case you could re-assign cluster resources and dedicate more to Map-Reduce jobs and less to Big SQL.
To change the percentage of resources dedicated to Big SQL post install, use the AUTOCONFIGURE command. For example, to decrease Big SQL’s allocation from 50% to 40%, connect to the bigsql database and use:
CALL SYSHADOOP.BIG_SQL_SERVICE_MODE('ON'); AUTOCONFIGURE USING MEM_PERCENT 40 WORKLOAD_TYPE COMPLEX IS_POPULATED NO APPLY DB AND DBM;
Corresponding adjustments will need to be made to the following YARN properties to increase its allocation (these are not done automatically by the AUTOCONFIGURE command):
- yarn.nodemanager.resource.memory-mb
This property can be found on the “Node Manager” section of the “YARN” Configuration on the Ambari interface:Figure 3: Changing the yarn.nodemanager.resource.memory-mb property - yarn.scheduler.maximum-allocation-mb
This property can be found on the “Scheduler” section of the “YARN” Configuration on the Ambari interface:Figure 4: Changing the yarn.scheduler.maximum-allocation-mb property
3. Choose an optimized storage format
Big SQL 4.0 is optimized for Parquet storage format.
The storage format used to store your Big SQL data is a critical contributor to the performance and efficiency of your cluster. Not only does the storage format have a direct influence on query times, it also impacts the volume of data stored (due to compression), CPU consumed to read and process that data, and data ingest times.
Although Big SQL 4.0 supports many of the popular Hadoop storage formats, for analytical workloads the Parquet storage format is recommended. Big SQL is highly optimized to process data in Parquet files, and internal tests have shown Parquet to have the most efficient compression ratio, and lowest query elapsed times.
4. PARTITION your data
Partition your data to improve query response times and improve cluster efficiency
Partitioning is a common practice in SQL over Hadoop solutions today. The aim of partitioning a table is to reduce the amount of data a query has to process in order to arrive at the final answer. By reducing the data processed, the query completes faster, and consumes fewer resources.
Both Big SQL and Hive use a similar partitioning scheme – specified by the “PARTITIONED BY” clause on the “CREATE HADOOP TABLE” statement. Big SQL stores different data partitions as separate files in HDFS and only scans the partitions/files required by the query.
For example, consider the following table holding details of items sold over a 5 year period:
CREATE HADOOP TABLE LINEITEM ( L_ORDERKEY BIGINT NOT NULL, L_PARTKEY INTEGER NOT NULL, L_LINENUMBER INTEGER NOT NULL, L_QUANTITY FLOAT NOT NULL, L_DISCOUNT FLOAT NOT NULL, L_TAX FLOAT NOT NULL, L_SHIPINSTRUCT VARCHAR(25) NOT NULL, L_SHIPMODE VARCHAR(10) NOT NULL, L_COMMENT VARCHAR(44) NOT NULL) PARTITIONED BY (L_SHIPDATE VARCHAR(10)) STORED AS PARQUETFILE;
The table is partitioned on the date the item was shipped – so data for every day over the 5 year period is stored in separate HDFS files (over 5 years, that’s approx. 1825 files). Now, if a query has a predicate based on the partitioning column then Big SQL can eliminate all other partitions/files from the query without having to read them. For example, the following predicate would results in Big SQL reading and processing just 10 partitions to complete the query – instead of all 1825 – resulting in a large performance improvement:
“l_shipdate >= date ('2015-01-01') and l_shipdate <=date(‘2015-01-10’)”
To get the maximum benefit from partitioning, you should choose a partitioning key that is commonly referenced in range delimiting or equality predicates of the queries and that has a few hundred to a few thousand unique values – dates are commonly used. Avoid partitioning by unique identifiers or columns with high cardinality as this will create too many small files and can adversely impact performance.
Partitioning has many additional benefits including the ability to dynamically add and remove partitions to/from a table – making it very easy to roll-in new data and remove old data. In addition, when the ANALYZE command (see below) is run against a partitioned table, it will only ANALYZE those partitions that have no statistics already collected on them, and not the entire table.
For more information on how to create partitioned tables, see the Big SQL Knowledge Center.
5. Use ANALYZE to gather statistics
Frequently use ANALYZE to update a table’s statistics
This is likely the single most important step in ensuring your Big SQL queries are efficiently and expediently executed. The ANALYZE command gathers statistics about the data within a table and is used by the Big SQL cost-based optimizer to make informed decisions about query execution – such as join order, join types, when & where to sort data etc…. Accurate and up to date statistics can improve query performance many fold.
ANALYZE should be run whenever:
- A new table is populated with data,
- An existing table’s data undergoes significant changes:
- new data added,
- old data removed,
- existing data is updated
However, statistics do not need to be gathered for all columns within a table. Only those columns which are referenced by predicates (including join predicates), aggregation functions and used in informational constraints (see below) need to have statistics gathered. In addition, if there are sets of columns that have an inter-relationship, then column group statistics should be gathered on these columns. For example, columns storing the STATE and COUNTRY of a customer are likely to have an inter-relationship – all customers who live in the STATE of ‘California’, will have ‘USA’ for COUNTRY. By gathering column group statistics on STATE and COUNTRY together, the Big SQL optimizer is able to infer this relationship and use the information to more efficiently plan the execution of queries.
The following example collects table and column level statistics on the ORDERS table, along with column group statistics on STATE and COUNTRY:
ANALYZE TABLE SIMON.ORDERS COMPUTE STATISTICS FOR COLUMNS ORDERKEY, CUSTKEY, ORDERSTATUS, TOTALPRICE, ORDERDATE, ORDERPRIORITY, STATE, COUNTRY, (STATE, COUNTRY);
The following query can be used to identify those tables in a schema that have no statistics:
select substr(tabname,1,20) as table, stats_time, card from syscat.tables where tabschema='SIMON' and card=-1;
And the following query to identify the columns within a table that have no statistics:
select substr(colname,1,20) as column, colcard from syscat.columns where tabschema='SIMON' and tabname='ORDERS' and colcard=-1;
6. Define Informational Constraints
Use Informational Constraints to define relationships within your schema.
Informational Constraints are much like regular primary and foreign key constraints (available in most RDBMSs today) except the constraints are not enforced. One might ask what’s the use of a constraint that is not enforced – well, even though these constraints are not enforced, once defined they do provide useful information to the Big SQL optimizer which it can use when planning a query. As a simple example, if a query requires only distinct values from a column, then if a primary key constraint is defined on that column the Big SQL optimizer knows that the column only contains unique values and no operations are necessary in order to ensure this uniqueness. However, if a unique constraint was not defined on the column, then the optimizer would have to sort and aggregate the data during query execution in order to guarantee uniqueness.
It is recommended to create primary and foreign key informational constraints to define relationships that exist in your schema. The following example defines primary keys on the ORDERS and LINEITEM tables, and a primary-foreign key relationship between ORDERS and LINEITEM:
alter table simon.orders add primary key (O_ORDERKEY) not enforced; alter table simon.lineitem add primary key (L_ORDERKEY,L_LINENUMBER) not enforced; alter table simon.lineitem add foreign key (L_ORDERKEY) references SIMON.orders (O_ORDERKEY) not enforced;
7. Use appropriate data types for the values being stored
In Big SQL 4.0 the use of STRING data types is strongly discouraged.
The STRING data type is commonly used in Hive to represent character data of variable (and undetermined) length. In Big SQL, the STRING type is mapped internally as VARCHAR(32K). The Big SQL engine internally works with data in units of 32K pages and works most efficiently when the definition of a table allows a row to fit within 32k of memory. Once the calculated row size exceeds 32k, performance degradation can be seen for certain queries. Therefore, the use of STRING when defining columns of a table is strongly discouraged. Instead, modify references to STRING to explicit VARCHAR(n) that most appropriately fits the data size, or use the bigsql.string.size property to lower the default size of the VARCHAR to which the STRING is mapped when creating new tables.
Is there anything else I should tune ?
Well there are other knobs within Big SQL and the larger Hadoop eco-system that can be tweaked to improve performance, but if you follow the 7 tips above then your cluster should hum along nicely. The Big SQL 4.0 installer automatically calculates the memory required for sorting, caching and various heaps based on the memory available on the cluster and the Big SQL percent usage specified. As workloads are executed, Big SQL’s Self Tuning Memory Manager (STMM) monitors the memory consumption between the consumers and automatically adjusts the allocation to tune for the workload being executed. So there’s no need to jiggle Big SQL’s memory allocation – that’s all done automatically by STMM.
Big SQL 4.0 also has a built in Workload Manager that can be used to prioritize work and prevent run away queries. This workload manager is inherited from the workload manager in DB2 Linux, Unix and Windows. For more information, see the latest “DB2 Workload Management Guide and Reference”.
Finally, but by no means least, if you want the maximum performance out of your Big SQL cluster (and why wouldn’t you), but sure to check that CPU Scaling is disabled.
Additional information:
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.