Lab 2: Issuing basic Hadoop commands

In this exercise, you’ll work directly with Apache Hadoop to perform some basic tasks involving the Hadoop Distributed File System (HDFS) and launching a sample application. All the work you’ll perform here involves commands and interfaces provided with Hadoop from http://hadoop.apache.org. As mentioned earlier, Hadoop is part of IBM’s InfoSphere BigInsights platform.

Allow 15 minutes to complete this lab module. Prior to this lab, you should have set up a working environment. See Getting Started with Hadoop and BigInsights for details.

Please post questions or comments about this lab to the forum on Hadoop Dev at https://developer.ibm.com/answers?community=hadoop.

2.1. Creating a directory in your distributed file system

__1. Click the BigInsights Shell icon.

image001

__2. Select the Terminal icon to open a terminal window.

image002

__3. Execute the following Hadoop file system command to create a directory in HDFS for your work:

hadoop fs -mkdir /user/biadmin/test

Note that HDFS is distinct from your Unix/Linux local file system directory, and working with HDFS requires using hadoop fs commands.

2.2. Copying data into HDFS

__1. Using standard Unix/Linux file system commands, list the contents of the /home/biadmin/licenses directory.

ls /home/biadmin/licenses

image003

Note the BIlicense_en.txt file. It contains license information in English, and it will serve as a sample data file for a future exercise.

__2. Copy the BIlicense_en.txt file into the /user/biadmin/test directory you just created in HDFS.

hadoop fs -put /home/biadmin/licenses/BIlicense_en.txt /user/biadmin/test

__3. List the contents of your target HDFS directory to verify that the file was successfully copied.

hadoop fs -ls /user/biadmin/test

image004

2.3. Running a sample MapReduce application

WordCount is one of several sample MapReduce applications provided for Apache Hadoop. Written in Java, it simply scans through input document(s) and, for each word, returns the total number of occurrences found. You can read more about WordCount on the Apache wiki (http://wiki.apache.org/hadoop/WordCount).

Since launching MapReduce applications (or jobs) is a common practice in Hadoop, you’ll explore how to do that with WordCount.

__1. Execute the following command to launch the sample WordCount application provided with yourHadoop distribution.

hadoop jar /opt/ibm/biginsights/IHC/hadoop-example.jar wordcount /user/biadmin/test WordCount_output

This command specifies that the wordcount application contained in the specified .jar file is to be launched. The input for this application is in the /user/biadmin/test directory of HDFS. The output of this job will be stored in HDFS in the WordCount_output subdirectory of the user executing this command (biadmin). Thus, the output directory will be /user/biadmin/WordCount_output. This directory will be created automatically as a result of executing this application.

image008

NOTE:If the output folder already exists or if you try to rerun a successful MapReduce job with the same parameters, you will receive an error message. This is the default behavior of the sample WordCount application.

image006

__2. Inspect the output of your job.

hadoop fs -ls WordCount_output

image007

In this case, the output was small and contained written to a single file. If you had run WordCount against a larger volume of data, its output would have been split into multiple files (e.g., part-r-00001, part-r-00002, and so on).

__3. To view the contents of part-r-0000 file, issue this command:

hadoop fs -cat WordCount_output/*00

Partial output is shown here:

image008

__4. Optionally, inspect details about your job. Open a Web browser, or click on the web console icon on your desktop and open a new tab. Access the URL for Hadoop’s Job Tracker (http://bivm.ibm.com:50030/jobtracker.jsp). Scroll to the Completed Jobs section to locate the Job ID associated with the Word Count application. Click on the Job ID link to review details, such as the number of Map and Reduce tasks launched for your application, the number of bytes read and written, etc. Partial output is shown in the second image that follows.

image009image010

To find the other tutorials in this series, go to Overview tutorial.

Join The Discussion

Your email address will not be published. Required fields are marked *