In the first part of this blog (Big SQL Problem Determination Part 1), we discussed how to classify a problem into some high-level categories (Crash, Error, Performance, Sys Admin, and I Don’t Know), and then introduced the bigsql-support.sh data collection tool to automate the collection of Big SQL diagnostic data from a cluster depending on the type of problem.
In part 2 of this blog, we’ll focus on a performance problem or if an application does not appear to be responsive (hang).
These types of performance/hang problems are a different beast and require a different approach.
Performance is in fact a huge topic and it is not really feasible to provide a short blog on how to achieve optimal performance. Such a blog would likely be a multi-part course with different topics addressing different aspects of performance! That goes a bit beyond what I wanted to accomplish with this blog.
What we will do here though is present a tool that can help collect information that can supplement any performance investigation in Big SQL. The data collected from this tool can help to:
- Identify where/why something appears to be stuck
- Identify what things are running slow
- Track resources (such as memory, cpu) over a period of time
- Compare different environments (QA vs prod etc)
If any of this information above seems like it may help for an investigation, then read on!
Problem Type: “Performance” (or something is not responsive like a “hang”)
It is unlikely that you will be able to solve a performance/hang problem by looking (only) at the configurations and log files provided by the bigsql-support tool data collection in a post-mortem fashion.
Sure, there may be some obvious cases where performance problems are caused by an underlying error condition that is spamming the logs files, but that would be an exception to the rule.
A performance/hang problem necessitates a diagnostic data collection that is:
- Generated DURING the execution of the slow/unresponsive operation.
- Performed at regular intervals so that you can later compare the delta metrics of performance counters over a period of time.
The script is called bigsql-collect-perf.sh, and you can find it here:
Pro tip! This same tool is useful for analyzing memory usage in Big SQL over a period of time also!
Before running the script
Before we mention some example use-cases of the script, there’s 2 things you need to be aware of before you dive in:
bigsql-collect-perf.sh wants to use pstack to generate stack/thread dumps. By default, many clusters do not have pstack installed.
bigsql@oak1 ~> which pstack /bin/pstack
pstack is a free open-source tool that is provided by the gdb debugger rpm package.
If possible, please ensure that gdb (which also gives you pstack) is installed on all hosts. While this is not critical (the script can run without it), it does add a bit more to the data collection and is generally recommended. By itself, gdb is a useful diagnostic aid and debugger to have installed anyway.
When you launch the bigsql-collect-perf.sh script, if pstack is not installed, the script will tell you this with an error message and it will fail. If you choose to execute without pstack installed, use the -N option to instruct the script that pstack collections will not be done, and this will bypass the sanity check to allow the script to continue without it.
bigsql-collect-perf.sh operates using the concept of intervals and sleep time. It will execute all of its data collections, then sleep for some time, then wake up and execute the data collection again. Wash, rinse, repeat.
This is built into the script, and of course it does have default options for the number of iterations and the sleep time, however we cannot know ahead of time what type of performance issue you are investigating, so it is up to you to decide on what would be good values for the number of iterations and the time between each iteration rather than rely on the defaults.
This is controlled using these two options of the script:
-i number_of_iterations -s sleep_time_between_each_iteration_in_seconds
For example, if you have a query that you know will run for 3 hours, using a sleep time of 300 seconds (5 minutes) with an interval count of 36 will give you a data collection that spans those 3 hours.
You don’t want to pick a sleep time that is too small (it will cause too much monitoring overhead and data created). Conversely, choosing a sleep time that is too large will not give you the granularity in the data collected to spot performance problems. Some judgment is called for here.
Note that if ctrl-c is done, the script will terminate itself, but first it will shut down and tar up all the files that it has collected so far. Thus, it is okay to overshoot the amount of time to run the script and you can cancel it with ctrl-c after.
Running the script
1) As the bigsql user on the head node, create a directory to hold the result file, and then cd into that directory:
bigsql@oak1 ~> whoami bigsql bigsql@oak1 ~> pwd /home/bigsql bigsql@oak1 ~> mkdir collectinfo bigsql@oak1 ~> cd collectinfo/ bigsql@oak1 ~/collectinfo>
2) From within that directory, execute the script using the absolute path to the script location. The following examples give some scenarios about how you may want to run it:
- For a general issue, where you don’t know which query or job may be causing the problem, do the following (substituting valid values for intervals and sleep_time):
/usr/ibmpacks/bigsql/188.8.131.52/bigsql/install/bigsql-collect-perf.sh -i intervals -s sleep_time
- For an issue where you can reproduce it by executing a query from the command line, put the query text into a file terminated by a semi-colon, and then execute the query through the collection script like this (substituting valid values for the query file name, intervals and sleep_time):
/usr/ibmpacks/bigsql/184.108.40.206/bigsql/install/bigsql-collect-perf.sh -Q queryfile -i intervals -s sleep_time
- For a hang problem where you think it is not a performance issue and something is really stuck, do:
(for this hang collection, just use the defaults for interval and sleep time)
Pro tip!! If you suspect something is really hung rather than a performance problem, then sometimes even monitoring commands may themselves hang. This presents a problem because we need to generate diagnostics for the hang, but at the same time, we need the script to avoid getting stuck also! This is why we have the special -H syntax for hangs. With -H, it avoids any commands that actually interact with the services themselves, and is focused mainly on “outsider” commands to invoke dumps, traces, and other diagnostics that should not be prone to the hang itself.
In all cases, the script produces a tar file that contains the data and moves it into the current working directory.
Here is an example output from a run. This is an example of collecting data for a hang scenario using the -H option:
bigsql@oak1 ~/collectinfo> /usr/ibmpacks/bigsql/220.127.116.11/bigsql/install/bigsql-collect-perf.sh -H Mon Aug 21 11:44:08 PDT 2017: Starting program to collect hang data ( Iterations = 3, Sleeptime = 60 ) Mon Aug 21 11:44:11 PDT 2017: Started bigsql-coll-snapshot-util.sh on oak1.fyre.ibm.com, PID: 6575 Mon Aug 21 11:44:11 PDT 2017: Started bigsql-coll-snapshot-util.sh on oak2.fyre.ibm.com, PID: 6581 Mon Aug 21 11:44:11 PDT 2017: Started bigsql-coll-snapshot-util.sh on oak3.fyre.ibm.com, PID: 6589 Mon Aug 21 11:44:11 PDT 2017: Waiting for data collection to finish Mon Aug 21 11:47:15 PDT 2017: Finished data collection program. Doing housekeeping now. Pls. wait Mon Aug 21 11:47:15 PDT 2017: Collecting the files Mon Aug 21 11:47:15 PDT 2017: Tarring up the files on oak1.fyre.ibm.com and copying them over Mon Aug 21 11:47:22 PDT 2017: Successfully copied over file /tmp/IBMData_20170821_1144.oak1.fyre.ibm.com.tar.gz. Mon Aug 21 11:47:22 PDT 2017: Remove temp dir now from oak1.fyre.ibm.com Mon Aug 21 11:47:23 PDT 2017: Tarring up the files on oak2.fyre.ibm.com and copying them over Mon Aug 21 11:47:31 PDT 2017: Successfully copied over file /tmp/IBMData_20170821_1144.oak2.fyre.ibm.com.tar.gz. Mon Aug 21 11:47:31 PDT 2017: Remove temp dir now from oak2.fyre.ibm.com Mon Aug 21 11:47:31 PDT 2017: Remove the tar file from host oak2.fyre.ibm.com Mon Aug 21 11:47:32 PDT 2017: Tarring up the files on oak3.fyre.ibm.com and copying them over Mon Aug 21 11:47:40 PDT 2017: Successfully copied over file /tmp/IBMData_20170821_1144.oak3.fyre.ibm.com.tar.gz. Mon Aug 21 11:47:40 PDT 2017: Remove temp dir now from oak3.fyre.ibm.com Mon Aug 21 11:47:40 PDT 2017: Remove the tar file from host oak3.fyre.ibm.com Mon Aug 21 11:47:46 PDT 2017: Tarring up the files Mon Aug 21 11:47:52 PDT 2017: Data collection complete. Final tar file is: /home/bigsql/collectinfo/BigSqlDataCollection.20170821_114746.tar.gz bigsql@oak1 ~/collectinfo> ls -l total 123348 -rw-r--r-- 1 bigsql hadoop 126304522 Aug 21 11:47 BigSqlDataCollection.20170821_114746.tar.gz
There you have it, 2 tools for collecting data to help solve problems!