Apache Sqoop is a tool that is designed to efficiently transfer large amounts of data between Apache Hadoop and structured data stores such as relational databases. Sqoop provides an incremental import tool that you can use to retrieve only those rows that are newer than some previously-imported set of rows. And the Sqoop merge tool enables you to combine two data sets, whereby entries in one data set overwrite entries in an older data set.

Let’s examine a typical scenario in detail. Consider a remote RDMS (DB2, for example) with a STUDENT table that contains basic information about students. The STUDENT table includes a column named ID (INTEGER) and a column named TIME (DATE).

You can use the Sqoop incremental import tool to update the latest student information from DB2 to Hive, as shown in the following example:

sqoop import -incremental lastmodified --check-column TIME --last-value 2017-02-08...

The incremental import operation runs based on values in the TIME column and imports records from “2017-02-08” to current. Corresponding information in the Sqoop log might look like the following example:

17/02/08 22:44:47 INFO tool.ImportTool: Incremental import based on column TIME
17/02/08 22:44:47 INFO tool.ImportTool: Lower bound value: '2017-02-08'
17/02/08 22:44:47 INFO tool.ImportTool: Upper bound value: '2017-02-09 22:44:47.564217'

The Hive STUDENT table is updated with the latest records, and a new file is created under the table directory in the HDFS:

[ambari-qa@mauve1 ~]$  hadoop fs -ls /apps/hive/warehouse/student
Found 2 items
-rwxrwx---   1 ambari-qa hadoop         26 2017-02-08 23:13 /apps/hive/warehouse/student/part-m-00000
-rwxrwx---   1 ambari-qa hadoop         26 2017-02-09 21:51 /apps/hive/warehouse/student/part-m-00000_copy_1
                                        Include new records from 201-02-08 to 2017-02-09

Because the Hive table directory can grow significantly in size with daily incremental import jobs, it is a good idea to use the Sqoop merge tool to generate a more streamlined file.

Let’s take a look at another example. You can use the codegen command to generate code that interacts with database records. For example:

sqoop codegen --connect  --username  --password 
  --table STUDENT --outdir /tmp/sqoop --fields-terminated-by '\t'

You can then use the Sqoop merge tool to “flatten” two data sets into one, as shown in the following example:

sqoop merge --new-data /apps/hive/warehouse/student/part-m-00000
  --onto /apps/hive/warehouse/student/part-m-00000_copy_1
  --target-dir /tmp/sqoop_merge
  --jar-file /tmp/sqoop-ambari-qa/compile/9062c87c959e4090dcec5995a439b514/TIME.jar
  --class-name TIME
  --merge-key TIME

You can also use the merge tool to extract special data from the HDFS. For example, to extract the first two months of student data, you could run a command that is similar to the following example:

sqoop merge --new-data /apps/hive/warehouse/student/part-m-00000_copy_1
  --onto /apps/hive/warehouse/student/part-m-00000_copy_2
  --target-dir /tmp/student_first_two_month
  --jar-file /tmp/sqoop-ambari-qa/compile/9062c87c959e4090dcec5995a439b514/TIME.jar
  --class-name TIME
  --merge-key TIME

After the merge operation completes, you could import the data back into a Hive or HBase data store.

For more information about Sqoop incremental options, see Incremental Imports.

1 comment on"Typical scenario for Sqoop incremental import and merge"

  1. Dheeraj Rokade March 07, 2017

    Looks like the last example for extracting first two months needs some revision? I am unable to see any difference in this sqoop merge query with the earlier sqoop merge query to merge two files.

    What special is done to get just 2 months of data?

Join The Discussion

Your email address will not be published. Required fields are marked *