Use an open-source tool to move large data to cloud object storage
Moving data from Hadoop Distributed File System to IBM Cloud Object Storage with Cloud Object Storage Distributed Copy
There are many occasions that require moving data from Hadoop Distributed File System (HDFS) to cloud object storage. Occasionally, data scientists may decide to archive part or all of their HDFS data to cloud object storage for later use, or to migrate part or all of their data and processing to a different cluster that is already connected to cloud object storage.
Today, hybrid clouds enable data and computations to be partitioned across clouds. Hybrid cloud and private cloud users can archive their data or migrate their work to the public cloud, where cloud object storage is most common. This means users could benefit from having an efficient way to move large data from HDFS to IBM Cloud Object Storage. This is where Cloud Object Storage Distributed Copy (COSDistCp) comes into the picture.
COSDistCp is based on the popular community tool, Distributed Copy (DistCp). The DistCp tool is a traditional Hadoop application designed to work with file system operations, which are different in the object storage world. Because DistCp wasn’t designed to work with object storage, it suffers from performance inefficiencies that are resolved in COSDistCp.
When DistCp copies large numbers of files, it uses Hadoop map jobs to copy files in parallel within the same Hadoop cluster or between clusters. To preserve integrity and correctness, DistCp first writes files with temporary names to the target location. Once all bytes are written, it renames the files to their permanent name. Rename is an atomic operation in a tradition file system. The problem is that object storage does not support a rename operation.
COSDistCp is optimized to work with IBM Cloud Object Storage by excluding the rename operation, saving the extra copy and delete operations–and copying complete files with final names. Moreover, by using modern object storage connectors (such as S3A or Stocator) COSDistCp uses APIs that support MD5 headers–helping ensure data writes a native object storage method to preserve integrity and correctness. Specifically, Stocator connects internally to IBM Cloud Object Storage using the IBM SDK, which generates MD5 hash and adds it to the write request.
We compared the performance of COSDistCp and DistCp by running experiments on a Hadoop cluster installed in a private environment. The cluster consisted of three worker nodes with access to HDFS storage. Each node had 48 cores and 256 GB RAM. We used the default configurations of both tools.
We performed 2 experiments moving files from HDFS to IBM Cloud Object Storage. In the first experiment, we moved 50 GB of data partitioned to 373 files. In the second experiment, we moved 5 GB of data partitioned to 5000 files. In each experiment, we repeated the copy operation 5 times.
In both experiments, the number of write operations for DistCp was about three times the number of write operations for COSDistCp. DistCp’s extra operations are due to its RENAME operation (which equals 1 PUT and 1 DELETE) per file, as illustrated in Figure 1.
Although the number of operations for DistCp was greater, the total copy time was similar for DistCp and COSDistCp in the first experiment. However, in the second experiment, COSDistCp performance was much better than DistCp. Here, the total DistCp copy time was 290% greater than COSDistCp (Figure 2). We believe the time overhead was due to the larger number of files, which caused more operation overhead as illustrated in Figure 1.
Furthermore, when the target is a remote cloud object storage, COSDistCp outperforms DistCp–even when copying just a few file–because of the network overhead related to the extra DistCp operations. In our remote cloud object storage experiment, we observed that DistCp’s total copy time was 134% greater than COSDistCp, as illustrated in Figure 3.
After the initial two comparisons, we experimented with different configuration options for the COSDistCp tool (as illustrated in Figure 4). We ran the tool in a private environment and on the cloud in the IBM Analytics Engine. We also examined different worker node numbers for the Hadoop clusters. The first tool configuration option we checked was the number of mappers (the copy processes, –m option). In most cases, there was no observable performance improvement when the number of mappers exceeded the number of cores in the cluster. Thus, a good rule of thumb is to set the number of mappers to equal the number of cores in the cluster. The second configuration option we checked was the copy strategy (strategy option). Here, we examined both uniform size and dynamic size; the results show no performance differences in these specific experiments.
The following command illustrates the detailed usage definition of COSDistCp for copying data from HDFS to IBM Cloud Object Storage with IAM authentication using the Stocator connector:
hadoop jar <path-to:cos-distcp>/target/cosdistcp-0.8-SNAPSHOT-jar-with-dependencies.jar com.ibm.cosdistcp.COSDistCp -D fs.stocator.scheme.list=cos -D fs.cos.impl=com.ibm.stocator.fs.ObjectStoreFileSystem -D fs.stocator.cos.impl=com.ibm.stocator.fs.cos.COSAPIClient -D fs.stocator.cos.scheme=cos -D fs.cos.<SERVICE_NAME>.endpoint=<endpoint> -D fs.cos.<SERVICE_NAME>.iam.service.id=<ServiceId-...> -D fs.cos.<SERVICE_NAME>.iam.api.key=<apikey> -libjars <path-to:stocator>/target/stocator-1.0.14-IBM-SDK.jar <in_dir> cos://<BUCKET_NAME.SERVICE_NAME/out_dir>
I’d like to thank Gil Vernik for all of his help with the COSDistCp development. For details on the COSDistCp build process and more usage instructions and examples, visit.