Troubleshooting Map Reduce Errors

The Big SQL LOAD HADOOP command uses a Map Reduce framework, for database administrators who are not so familiar with combing through map reduce logs, this blog hopes to shed some light on how to troubleshoot such issues. In this blog I will focus on one error I experienced on my environment and the steps I took to debug the issue. Most likely you won’t hit this exact issue but you can use these steps to figure out the root cause of your problem.

Example Error

    load hadoop using file url '/user/bigsql/hadoopds1000g/store_sales' with source properties ('field.delimiter'='|', 'ignore.extra.fields'='true') into table store_sales overwrite;  SQL Exception(s) Encountered:  [State: 58040][Code: -5105]: The statement failed because a Big SQL component encountered an error. Component receiving the error: "DDL FMP". Component returning the error: "DDL FMP". Log entry identifier: "[BSL-0-783ead44f]".. SQLCODE=-5105, SQLSTATE=58040, DRIVER=3.72.24

Interpreting the error code

Here’s a quick reminder on understanding what this error code [BSL-0-783ead44f] actually means. The node reporting the error is ‘0’. So you can either go to node 0 (head node) and cat /var/ibm/bigsql/logs/bigsql.log and search for this tag: BSL-0-783ead44f.
Or you can issue this statement from the head node:

    db2 "select varchar(line,500)  from table (SYSHADOOP.LOG_ENTRY('BSL-0-783ead44f'))";  Which returns a few lines, one of which is:  2017-11-03 15:05:25,967 ERROR com.ibm.biginsights.biga.udf.SqlReturn [Thread-806] : [BSL-0-783ead44f]: Internal error: Load job failed. See hadoop jobid:job_1508965120648_0010 logs for details.

Navigate to the Map Reduce logs

First determine the web address of the map reduce job history. From the Ambari dashboard, click on Map Reduce, go to the Advanced tab and select the Advanced mapred-site drop down. Note the mapreduce.jobhistory.webapp.address field and paste the contents of this field into a new browser.

Or you can use the quick links to get to the Job History UI.

Click on failed load statement, in the example above 2 load statements had failed but we are drilling down on the job id ending in *10 as indicated from the error code in the example above.

There were 9 map tasks that failed, 1 that was killed and 2 that was successful. Click on the 9 failed map tasks hyperlink.

There will be 9 map tasks with the state set to FAILED here is one of them:

Scrolling to the right I noticed a series of messages related to HDFS errors:

These errors indicate that there was an issue with trying to replicate the data to other nodes.

Checking Cluster Health

Going back to Ambari we can see a number of critical alerts that HDFS has reached its capacity.

And form the main metrics page, 100% of the HDFS disk space has been used:

Resolution and Conclusion

The resolution to this problem was to free space on HDFS and retry the LOAD HADOOP command. Once I did that then the commands executed successfully. You can use the technique in this blog to debug other map reduce like issues not necessarily from LOAD HADOOP.

Tips

Troubleshooting Map Reduce Errors - Hadoop Dev

Technical Blog Post

Abstract

Body

Example Error

Interpreting the error code

Navigate to the Map Reduce logs

Checking Cluster Health

Resolution and Conclusion

UID

Share your feedback

Need support?