IBM Support

Troubleshooting Map Reduce Errors - Hadoop Dev

Technical Blog Post


Abstract

Troubleshooting Map Reduce Errors - Hadoop Dev

Body

The Big SQL LOAD HADOOP command uses a Map Reduce framework, for database administrators who are not so familiar with combing through map reduce logs, this blog hopes to shed some light on how to troubleshoot such issues. In this blog I will focus on one error I experienced on my environment and the steps I took to debug the issue. Most likely you won’t hit this exact issue but you can use these steps to figure out the root cause of your problem.

Example Error

    load hadoop using file url '/user/bigsql/hadoopds1000g/store_sales' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table store_sales overwrite;  SQL Exception(s) Encountered:  [State: 58040][Code: -5105]: The statement failed because a Big SQL component encountered an error. Component receiving the error: "DDL FMP". Component returning the error: "DDL FMP". Log entry identifier: "[BSL-0-783ead44f]".. SQLCODE=-5105, SQLSTATE=58040, DRIVER=3.72.24    

Interpreting the error code

Here’s a quick reminder on understanding what this error code [BSL-0-783ead44f] actually means. The node reporting the error is ‘0’. So you can either go to node 0 (head node) and cat /var/ibm/bigsql/logs/bigsql.log and search for this tag: BSL-0-783ead44f.
Or you can issue this statement from the head node:

    db2 "select varchar(line,500)  from table (SYSHADOOP.LOG_ENTRY('BSL-0-783ead44f'))";  Which returns a few lines, one of which is:  2017-11-03 15:05:25,967 ERROR com.ibm.biginsights.biga.udf.SqlReturn [Thread-806] : [BSL-0-783ead44f]: Internal error: Load job failed. See hadoop jobid:job_1508965120648_0010 logs for details.       

Navigate to the Map Reduce logs

First determine the web address of the map reduce job history. From the Ambari dashboard, click on Map Reduce, go to the Advanced tab and select the Advanced mapred-site drop down. Note the mapreduce.jobhistory.webapp.address field and paste the contents of this field into a new browser.

Or you can use the quick links to get to the Job History UI.
Job history ui

Click on failed load statement, in the example above 2 load statements had failed but we are drilling down on the job id ending in *10 as indicated from the error code in the example above.
Job History UI
There were 9 map tasks that failed, 1 that was killed and 2 that was successful. Click on the 9 failed map tasks hyperlink.
Map Reduce Log
There will be 9 map tasks with the state set to FAILED here is one of them:
Map Failures
Scrolling to the right I noticed a series of messages related to HDFS errors:
Map Reduce Failure 2
These errors indicate that there was an issue with trying to replicate the data to other nodes.

Checking Cluster Health

Going back to Ambari we can see a number of critical alerts that HDFS has reached its capacity.
Dashboard
And form the main metrics page, 100% of the HDFS disk space has been used:
HDFS usage

Resolution and Conclusion

The resolution to this problem was to free space on HDFS and retry the LOAD HADOOP command. Once I did that then the commands executed successfully. You can use the technique in this blog to debug other map reduce like issues not necessarily from LOAD HADOOP.

[{"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm16259787