Big SQL Problem Determination: Data Collection (Part 1)

Introduction

Problem determination is not simple. The solution to a problem might be simple (if you’re lucky!), but going from PROBLEM ==> SOLUTION is where the time and effort is spent!

Let’s put on our detective hats! In order for a detective to solve a crime, they need evidence and clues. What information do you need in order to be able to properly investigate and solve a problem in Big SQL?

As a frequent investigator of diagnostics myself, often I do not know what information I need until I need it! We cannot always predict what information is going to have the clues to crack open the case.

Fortunately in Big SQL, we have a number of data collection tools to make this simpler. In the following sections of this blog, we’ll review what data to collect as determined by the nature of the problem you are experiencing.

Specifically, we’ll be introducing 2 data collection scripts that are built into Big SQL version 5.0.1:

bigsql-support.sh – a script that collects all Big SQL logs and configuration information.
bigsql-collect-perf.sh – A performance data collector that collects “live” information and performance data at regular intervals.

Step 1) Classify the problem

You can’t investigate a problem if you don’t know what the problem is!

Ask yourself these questions:

Crash: Is it a crash, did something (or some service) go down?
Error: Is it an error message, where something failed, but the services are still running?
Performance: Is something unresponsive, hung, or simply taking too long to return?
Sys Admin: Is it an install problem, or an issue with adding/dropping a node?
I Don’t Know: I have no clue what the problem is!

Bonus questions for extra fame and glory! :

When did the problem happen?
Was the problem caused by a new or ad hoc task/query/workload?
Was the problem caused by a task/query/workload that historically worked fine, but has just started to be an issue recently?
Has anything changed in the environment between the time when the task/query/workload was successful to now when it is no longer successful?

Step 2) Collect data for the problem, depending on the problem type

Problem Type: “I Don’t Know”

If the problem description is too vague or there’s some confusion around what the actual problem is, then get everything!

In ambari, click on IBM Big SQL->Service Actions->Collect Big SQL Logs. This launches the bigsql-support tool with a default set of options that collects all logs and configs from all nodes.

Since it’s collecting information from all nodes, the result file may be large, but it’s better to get everything rather than risk missing a key piece of information that could help solve a case. Drill down into the ambari output to see the files that it created. The resultant tar.gz file will contain all of the diagnostics.

Problem Types: Crash, Error, or Sys Admin

From the OS shell command line, use the bigsql support tool executed from the head node as the bigsql user. The default syntax is as follows (referencing the appropriate version identifier for your software level, in this case 5.0.1.0), which will collect information from the head node only:
/usr/ibmpacks/bigsql/5.0.1.0/bigsql/install/bigsq_support.sh

That’s it! No arguments needed.

If you will be speaking to a support engineer for a pmr, this is the essential “must gather” first data collection for just about any problem you can think of as it provides all the logs, software levels, configurations, and other background information from the head node that will form the starting point of any investigation.

Important:
The above syntax only gets information from the head node. This is a good starting point to investigate many issues, however if a crash or error is specific to a particular worker node, then you will need to refine the data collection strategy so that it collects information from other nodes, such as the Big SQL workers.

Of course, the easiest way to do this would be to simply collect it from all nodes! If you take that approach, then please see the method above for the “I don’t know” problem type.

Why not always just get the info from all nodes? ==> The amount of data from all the hosts in a cluster may be large and slow to transfer it to interested parties (i.e. Big SQL support team). Certainly, a full node collection is the most comprehensive, but often it’s overkill. Instead, whenever possible, try to limit the collection to the nodes that you know are interesting for a particular problem.

To collect information from workers:
Create a file that will contain a list of the hosts, each hostname in it’s own line of the file, and use the -f option of the tool with this file as input.

Example contents of an input file “support_hosts.txt”:

  bigsql@oak ~/temp> cat support_hosts.txt   oak1.fyre.ibm.com  oak3.fyre.ibm.com

Example data collection for a specific set of hosts:

bigsql@oak1 ~/temp> /usr/ibmpacks/bigsql/5.0.1.0/bigsql/install/bigsql-support.sh -f support_hosts.txt Log file for this shell on oak1.fyre.ibm.com is: /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log bigsql-support.sh 100% 34KB 34.2KB/s 00:00 bigsql-util.sh 100% 23KB 23.4KB/s 00:00 Spawned child 5147 Log file for this shell on oak1.fyre.ibm.com is: /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log Executing db2support tool ... bigsql-support.sh 100% 34KB 34.2KB/s 00:00 bigsql-util.sh 100% 23KB 23.4KB/s 00:00 Spawned child 5629 Total numPid:2 Waiting for 0, pid 5147 Log file for this shell on oak3.fyre.ibm.com is: /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log chmod: changing permissions of ‘/tmp/bigsql/logs’: Operation not permitted Executing db2support tool ... Adding db2support file to archive ... Adding miscelleneous data collection to the archive ... Adding install logs files to archive ... Adding bigsql diagnostic logs logs to archive ... Support processing is complete on oak1.fyre.ibm.com. Log file can be found at /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log bigsql-support: Success on idx:0 host:oak1.fyre.ibm.com pid:5147[0] Waiting for 1, pid 5629 Adding db2support file to archive ... Adding miscelleneous data collection to the archive ... Adding install logs files to archive ... Adding bigsql diagnostic logs logs to archive ... Support processing is complete on oak3.fyre.ibm.com. Log file can be found at /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log bigsql-support: Success on idx:1 host:oak3.fyre.ibm.com pid:5147[1] Support processing is complete on oak1.fyre.ibm.com. Log file can be found at /tmp/bigsql/logs/bigsql-support-2017-08-21_09.57.47.4156.log Support archive is: /tmp/bigsql/support/bigsql_support_full_archive_2017-08-21_09.57_ver1.tar.gz

Pro tips!!

Which nodes should I get info from?
Must gather always! : head node

For a crash or error problem, if you know that “something happened” on worker node n, then in addition to the head node, ensure that you collect information from the given worker by specifying the appropriate hosts and using -f like the above example.

How do I know which worker was responsible for an error?

Some error codes in Big SQL give you a hint about which node has produced the error. Take this error for example:

  [State: 58040][Code: -5105]: The statement failed because a Big SQL component encountered an error. Component receiving the error: "DDL FMP". Component returning the error: "DDL FMP". Log entry identifier: "[BSL-0-605a648c8]".. SQLCODE=-5105, SQLSTATE=58040, DRIVER=3.72.24

The “Log entry identifier” has a clue in it. The syntax of this identifier is:
log_type-node_number-log_entry_identifier

Log type may be: BSL, NRL, SCL, etc. The Big SQL support tool collects all of these logs. The key thing to pay attention to is the node number. This is the logical node number as reflected in the first column of the db2nodes.cfg file.

  bigsql@oak1 ~> cat /home/bigsql/sqllib/db2nodes.cfg   0 oak1.fyre.ibm.com 0 oak1.fyre.ibm.com  1 oak2.fyre.ibm.com 0 oak2.fyre.ibm.com  2 oak3.fyre.ibm.com 0 oak3.fyre.ibm.com  3 oak2.fyre.ibm.com 1 oak2.fyre.ibm.com  4 oak3.fyre.ibm.com 1 oak3.fyre.ibm.com

In the above example error message then, we know that it came from the host oak1.fyre.ibm.com (corresponding to logical node 0), and thus for a data collection around this problem it would be important to ensure data is collected from that host.

For reference, see also some more information about Diagnosing Big SQL Errors

Problem Type: Performance (or something is not responsive like a “hang”)

See Part 2 of the data collection blog

Tips

Big SQL Problem Determination: Data Collection (Part 1) - Hadoop Dev

Technical Blog Post

Abstract

Body

Introduction

Step 1) Classify the problem

Step 2) Collect data for the problem, depending on the problem type

Pro tips!!

UID

Share your feedback

Need support?