BigInsights Big SQL v4.2 brings lots of improvements to the ANALYZE Command. In particular, a lot of focus was paid to issues faced by users in previous versions. Here is a list of 10 major issues resolved for Analyze in v4.2.
1 – Automate all the things
As you know, Analyze collects statistics on tables. Without these statistics, Big SQL queries tend to perform poorly. One major challenge is Analyzing the tables that need it, when they need it. In Big SQL v4.2, Analyze is executed automatically on tables after a Load or HCAT_SYNC_OBJECTS operation. Analyze is also automatically scheduled to be executed on any table that has been queried, but does not have up-to-date statistics. You can read more about this in the Auto-Analyze article.
Another issue is, which columns to collect statistics on. Oftentimes, the list of columns needing statistics is not known beforehand. In these cases, the Analyze command has to be issued on the full list of the tableâ€™s columns. For tables with lots of columns (wide tables), this can be a tedious task. The ALL COLUMNS clause has been added to the Analyze syntax to make this exercise much easier. And not to worry, Analyze can handle lots of columns now – we’ve tested with over 900!
2 – Work smarter, not harder
Analyze execution can take a long time, especially on very large tables. To tackle this, Sampling has been introduced in v4.2. With Sampling, Analyze no longer has to scan the entire table in order to collect adequate statistics for optimal query performance. This drastically reduces the time Analyze takes to process large tables. Tests have shown over 5x improvement in Analyze execution time with Sampling.
3 – Out with the old, in with the new
It is generally better to have some stats than no stats at all. In previous versions of Big SQL, a new Analyze execution on a table would erase all previous statistics for the table. This is especially a problem when the new Analyze doesnâ€™t include columns that had statistics before. The Cumulative Statistics feature has been introduced in v4.2 to address this. With this feature, previous statistics are maintained whenever a new Analyze execution is performed on a table.
Analyze v2 is now the default in Big SQL v4.2. With this, Analyze v1 has been deprecated. All the features mentioned earlier are only supported in Analyze v2. There are also some more advantages to Analyze v2, as weâ€™ll see below.
4 – Hive bee-gone
Analyze v1 depends heavily on Hive. Among other things, it uses Hiveâ€™s Analyze to perform the scan of the table data. The Analyze v2 Tech Preview in Big SQL v184.108.40.206 also utilized Hive, but to a lesser extent. In both instances however, performance and functionality of Big SQL Analyze suffered due to certain limitations of Hive Analyze. In Big SQL v4.2, Analyze v2 has removed all dependencies on Hive Analyze. This has resulted in significant improvements to Big SQL Analyze execution. Additionally, custom modifications to Hive components are no longer needed.
5 – No More *YARN* Map/Reduce
With the removal of Hive Analyze dependencies also comes the removal of Map/Reduce. This was the mechanism used by Analyze v1 for scanning the table data. Analyze v2 instead utilizes Big SQLâ€™s Java Reader for this task. This frees users from having to perform and maintain Map/Reduce and YARN configurations for Big SQL Analyze. Map/Reduce is also very resource intensive, so users will see a lot of improvement here too.
6 – Less is more
As mentioned above, Analyze v2 uses a lot less resources with the removal of Map/Reduce. Other optimizations have also been made in v4.2 that reduce resource consumption even further. In tests, we have seen Analyze v2 use only 25% of the CPU and 20% of the memory of Analyze v1, while executing in the same amount of time or faster.
7 – Why run one when you can have two at twice the performance?
One limitation that the use of Hive Analyze and Map/Reduce created was that Big SQL Analyze could not be executed in parallel. Running Analyzes in parallel is a great way to speed up execution on multiple tables. Therefore, Analyze v2 in Big SQL v4.2 allows for concurrent executions. Additionally, no delay between executions is needed. In tests, we were able to execute 24 Analyze statements concurrently. This is not an upper limit, we just didnâ€™t feel the need to test for more. Performance of Analyze v2 improved over 11x faster than Analyze v1, when combined with Sampling.
8 – All together now
We did not forget about HBase tables! The headache here is that, in previous versions of Big SQL, Analyze on HBase tables with rowkey and/or secondary indexes is slow. This was due to Analyze performing the scan of the base table and indexes sequentially. That, coupled with those scans being Map/Reduce jobs, meant a significant amount of execution time. Analyze v2 in Big SQL v4.2 now processes an HBase table and its indexes all in a single scan. Youâ€™re going to like what you see.
9 – If at first you don’t succeed…
Deadlocks are an issue in any DBMS that has concurrent executions. This issue is compounded in parallel Analyze executions â€“ executions that will all need to make use of the same catalog resources. Fortunately, Analyze v2 in Big SQL v4.2 has this covered with automatic deadlock retries. This feature is completely invisible to the user, but it so greatly improves usability that we felt it worth mentioning.
10 – No partition left behind
Limitations in Hive Analyze prevented scanning of the default partitions (__HIVE_DEFAULT_PARTITION__) of a table. In earlier versions of Big SQL, the presence of these partitions would sometimes cause Analyze to fail. Not being able to scan these partition can skew the statistics, resulting in sub-optimal query execution plans. Analyze v2 in Big SQL v4.2 no longer suffers from these limitations, and can scan all partitions of a table.
Visit the Knowledge Center for more details on the ANALYZE command.
To get started, download the IBM Open Platform with Apache Hadoop.
Visit the Knowledge Center for installation and configuration information.