BigInsights Big SQL v4.2 brings lots of improvements to the ANALYZE Command. In particular, a lot of focus was paid to issues faced by users in previous versions. Here is a list of 10 major issues resolved for Analyze in v4.2.

1 – Automate all the things

As you know, Analyze collects statistics on tables. Without these statistics, Big SQL queries tend to perform poorly. One major challenge is Analyzing the tables that need it, when they need it. In Big SQL v4.2, Analyze is executed automatically on tables after a Load or HCAT_SYNC_OBJECTS operation. Analyze is also automatically scheduled to be executed on any table that has been queried, but does not have up-to-date statistics. You can read more about this in the Auto-Analyze article.

Another issue is, which columns to collect statistics on. Oftentimes, the list of columns needing statistics is not known beforehand. In these cases, the Analyze command has to be issued on the full list of the table’s columns. For tables with lots of columns (wide tables), this can be a tedious task. The ALL COLUMNS clause has been added to the Analyze syntax to make this exercise much easier. And not to worry, Analyze can handle lots of columns now – we’ve tested with over 900!

2 – Work smarter, not harder

Analyze execution can take a long time, especially on very large tables. To tackle this, Sampling has been introduced in v4.2. With Sampling, Analyze no longer has to scan the entire table in order to collect adequate statistics for optimal query performance. This drastically reduces the time Analyze takes to process large tables. Tests have shown over 5x improvement in Analyze execution time with Sampling.

3 – Out with the old, in with the new

It is generally better to have some stats than no stats at all. In previous versions of Big SQL, a new Analyze execution on a table would erase all previous statistics for the table. This is especially a problem when the new Analyze doesn’t include columns that had statistics before. The Cumulative Statistics feature has been introduced in v4.2 to address this. With this feature, previous statistics are maintained whenever a new Analyze execution is performed on a table.

Analyze v2 is now the default in Big SQL v4.2. With this, Analyze v1 has been deprecated. All the features mentioned earlier are only supported in Analyze v2. There are also some more advantages to Analyze v2, as we’ll see below.

4 – Hive bee-gone

Analyze v1 depends heavily on Hive. Among other things, it uses Hive’s Analyze to perform the scan of the table data. The Analyze v2 Tech Preview in Big SQL v4.1.0.2 also utilized Hive, but to a lesser extent. In both instances however, performance and functionality of Big SQL Analyze suffered due to certain limitations of Hive Analyze. In Big SQL v4.2, Analyze v2 has removed all dependencies on Hive Analyze. This has resulted in significant improvements to Big SQL Analyze execution. Additionally, custom modifications to Hive components are no longer needed.

5 – No More *YARN* Map/Reduce

With the removal of Hive Analyze dependencies also comes the removal of Map/Reduce. This was the mechanism used by Analyze v1 for scanning the table data. Analyze v2 instead utilizes Big SQL’s Java Reader for this task. This frees users from having to perform and maintain Map/Reduce and YARN configurations for Big SQL Analyze. Map/Reduce is also very resource intensive, so users will see a lot of improvement here too.

6 – Less is more

As mentioned above, Analyze v2 uses a lot less resources with the removal of Map/Reduce. Other optimizations have also been made in v4.2 that reduce resource consumption even further. In tests, we have seen Analyze v2 use only 25% of the CPU and 20% of the memory of Analyze v1, while executing in the same amount of time or faster.

7 – Why run one when you can have two at twice the performance?

One limitation that the use of Hive Analyze and Map/Reduce created was that Big SQL Analyze could not be executed in parallel. Running Analyzes in parallel is a great way to speed up execution on multiple tables. Therefore, Analyze v2 in Big SQL v4.2 allows for concurrent executions. Additionally, no delay between executions is needed. In tests, we were able to execute 24 Analyze statements concurrently. This is not an upper limit, we just didn’t feel the need to test for more. Performance of Analyze v2 improved over 11x faster than Analyze v1, when combined with Sampling.

8 – All together now

We did not forget about HBase tables! The headache here is that, in previous versions of Big SQL, Analyze on HBase tables with rowkey and/or secondary indexes is slow. This was due to Analyze performing the scan of the base table and indexes sequentially. That, coupled with those scans being Map/Reduce jobs, meant a significant amount of execution time. Analyze v2 in Big SQL v4.2 now processes an HBase table and its indexes all in a single scan. You’re going to like what you see.

9 – If at first you don’t succeed…

Deadlocks are an issue in any DBMS that has concurrent executions. This issue is compounded in parallel Analyze executions – executions that will all need to make use of the same catalog resources. Fortunately, Analyze v2 in Big SQL v4.2 has this covered with automatic deadlock retries. This feature is completely invisible to the user, but it so greatly improves usability that we felt it worth mentioning.

10 – No partition left behind

Limitations in Hive Analyze prevented scanning of the default partitions (__HIVE_DEFAULT_PARTITION__) of a table. In earlier versions of Big SQL, the presence of these partitions would sometimes cause Analyze to fail. Not being able to scan these partition can skew the statistics, resulting in sub-optimal query execution plans. Analyze v2 in Big SQL v4.2 no longer suffers from these limitations, and can scan all partitions of a table.

Additional Information

Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.

2 comments on"10 Issues Resolved for Analyze in Big SQL v4.2"

  1. […] v2 does not use Map Reduce. Read more about why Big SQL Analyze moved away from Map Reduce in this issues resolved with Analyze v2 article. If you are on Big SQL v4.1.0.2 and the default Analyze v1 resulted in out of memory […]

  2. […] to create a new version number for it. Analyze v2 is now the default in v4.2 read more about Issues resolved with Analyze v2. Analyze v1 refers to the version that depends on the MapReduce framework and Analyze v2 brings all […]

Join The Discussion

Your email address will not be published. Required fields are marked *