In this exercise, youâ€™ll work directly with Apache Hadoop to perform some basic tasks involving the Hadoop Distributed File System (HDFS) and launching a sample application. All the work youâ€™ll perform here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As mentioned earlier, Hadoop is part of IBMâ€™s InfoSphere BigInsights platform.
Allow 15 minutes to complete this lab module. Prior to this lab, you should have set up a working environment. See Getting Started with Hadoop and BigInsights for details.
Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.
2.1. Creating a directory in your distributed file system
__1. Click the BigInsights Shell icon.
__2. Select the Terminal icon to open a terminal window.
__3. Execute the following Hadoop file system command to create a directory in HDFS for your work:
hadoop fs -mkdir /user/biadmin/test
Note that HDFS is distinct from your Unix/Linux local file system directory, and working with HDFS requires using hadoop fs commands.
__1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses directory.
Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a sample data file for a future exercise.
__2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.
hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test
__3. List the contents of your target HDFS directory to verify that the file was successfully copied.
hadoop fs -ls /user/biadmin/test
WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in Java, it simply scans through input document(s) and, for each word, returns the total number of occurrences found. You can read more about WordCount on the Apache wiki (http://wiki.apache.org/hadoop/WordCount).
Since launching MapReduce applications (or jobs) is a common practice in Hadoop, youâ€™ll explore how to do that with WordCount.
__1. Execute the following command to launch the sample WordCount application provided with yourHadoop distribution.
hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount /user/biadmin/test WordCount_output
This command specifies that the wordcount application contained in the specified .jar file is to be launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This directory will be created automatically as a result of executing this application.
NOTE:If the output folder already exists or if you try to rerun a successful MapReduce job with the same parameters, you will receive an error message. This is the default behavior of the sample WordCount application.
__2. Inspect the output of your job.
hadoop fs -ls WordCount_output
In this case, the output was small and contained written to a single file. If you had run WordCount against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001, part-r-00002, and so on).
__3. To view the contents of part-r-0000 file, issue this command:
hadoop fs -cat WordCount_output/*00
Partial output is shown here:
__4. Optionally, inspect details about your job. Open a Web browser, or click on the web console icon on your desktop and open a new tab. Access the URL for Hadoopâ€™s Job Tracker (http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to locate the Job ID associated with the Word Count application. Click on the Job ID link to review details, such as the number of Map and Reduce tasks launched for your application, the number of bytes read and written, etc. Partial output is shown in the second image that follows.
To find the other tutorials in this series, go to Overview tutorial.