The Big data revolution has led to increased number of data processing and analytics applications. These Big data analytics ecosystems requires a robust, scalable and enterprise level file system to store this huge amount of data. The default file system used in Hadoop ecosystem for storing the data is Hadoop Distributed File System (HDFS).

The IBM Spectrum Scale file system, offers an enterprise-class alternative to Hadoop Distributed File System (HDFS) for building big data platforms. IBM Spectrum Scale is a high-performing, POSIX-compliant technology that is used in thousands of mission-critical commercial installations worldwide. The IBM Spectrum Scale file system can be deployed independently or with IBMs big data platform which, consists of IBM BigInsights for Apache Hadoop. IBM Spectrum Scale is now certified with Hortonworks HDP 2.6 hadoop distribution as well.

Apache Ambari is open source tool used for management, provisioning and monitoring of hadoop clusters. Apache Ambari has a pluggable architecture wherein any service can be added or removed easily. Also, it provides an easy-to-use Hadoop management web UI backed by its RESTful APIs. Major Hadoop distributions such as Hortonworks and IBM Big Insights use Apache Ambari for managing and creating big data Hadoop clusters.

The Ambari integration package helps in leveraging the pluggable architecture of ambari server and simplifying the addition of spectrum scale as a service to an existing Big Data cluster. Furthermore, the Ambari Service Addition Wizard helps in easy configuration and installation of Spectrum Scale file system as a service.

The Apache Ambari 2.4.0 or higher provides a unified way of adding a custom service to hadoop clusters using a management pack. This allows the same management pack can be added to different hadoop distributions which are using ambari 2.4.0 or higher. The same IBM Spectrum Scale management pack can be added to IBM Big insights 4.2.5 and Hortonworks HDP 2.6 distribution.

IBM BigInsights is an enterprise ready platform for Hadoop Ecosystem. IBM BigInsights provides Apache Hadoop and its related open source projects as core components, along with several IBM features to provide enterprise-class capabilities. It also provides Web Management Console, Development tools such as Eclipse plug-ins and a text analytics workbench, and analytics accelerators, visualization tools and connectors to ingest and integrate data from variety of data sources.

Hortonworks HDP is the industry’s only true secure, enterprise-ready open source Apache Hadoop distribution based on a centralized architecture (YARN). HDP addresses the complete needs of data-at-rest, powers real-time customer applications and delivers robust analytics that accelerate decision making and innovation. With latest version HDP 2.6, customers benefit from interactive query in seconds, enhanced data science, enterprise-grade security and streamlined operations, in the cloud and on-premises, to harvest value from their data faster than previously possible.

The Ambari integration package provides a way of integrating the installation and provisioning of IBM Spectrum Scale filesystem within an existing hadoop cluster.

The transparency daemons upon installation replaces the HDFS RPC daemons such as datanode and namenode to redirect the I/O request to IBM Spectrum Scale file system instead of HDFS file system.

The Ambari integration package allows addition of IBM Spectrum Scale as a service on the existing Hadoop cluster using ambari. When IBM Spectrum Scale service is integrated in the Hadoop cluster, there is a flexibility to integrate and unintegrate IBM Spectrum Scale service.
After unintegration the I/O request from the Hadoop clients again routes back to HDFS. The HDFS service panel reflects the HDFS Transparency daemon status, The Transparency namenode and datanodes daemons seamlessly replace the HDFS daemons.

The namenode Ambari Metrics shown on the HDFS service page is emitted by transparency namenode. The file system metadata is stored in the IBM Spectrum Scale file system. Therefore, the transparency namenode becomes a stateless entity daemon. Whenever a block request is made to the namenode, it fetches the blocklocation from the metadata stored in the IBM Spectrum Scale file system. The Transparency namenode does not create the fsimage and editlogs.

The fsimage is used for storing the Namenode inode and other information after HDFS namenode shutdown and editlogs is used to track the HDFS namenode runtime operation so when there is any crash, the namenode could recover the status by checking this log and the fsimages. There is no need for secondary namenode since the merging of the editlogs and fsimage is not required in case the IBM Spectrum Scale is integrated because transparency namenodes do not create fsimages and editlogs. Therefore, we remove the Secondary Namenode component (which is used for merging the editlogs and fsimages and send it back to ) in case of non HA from the HDFS service panel and that component is added back when we unintegrate the IBM Spectrum Scale service from the Ambari server. The stateless transparency namenode helps in case of disaster recovery since the namenode is not dependent on the editlogs and fsimage for filesystem recovery.

Spectrum Scale benefits over HDFS:-

  • In-place data analytics. Spectrum Scale is POSIX compatible, which supports various applications and workloads. With Spectrum Scale HDFS Transparency Connector, you can analyze file and object data in-place with no data transfer or data movement.
  • Flexible deployment mode. You can not only run IBM Spectrum Scale on commercial storage rich server, but also choose IBM Elastic Storage Server (ESS) to provide higher performance massive storage system for your Hadoop workload. You can even deploy Spectrum Scale in traditionally SAN storage system as well for HDP.
  • Spectrum Scale enterprise-class data management features, such as POSIX-compliant APIs or the command line
  • Unified File and Object support (NFS,SMB,Object)
  • FIPS and NIST compliant data encryption
  • Cold data compression
  • Disaster Recovery
  • Snapshot support for point-in-time data captures
  • Policy-based information lifecycle management capabilities to manage PBs of data
  • Maturely enterprise-level data backup and archive solutions (inclusing Tape)
  • Remote cluster
  • Seamless secure tiering to Cloud Object stores
  • HDFS Transparency Connector

    IBM Spectrum Scale HDFS Transparency Connector (part of IBM Spectrum Scale Offering) offers a set of interfaces that allows applications to use HDFS Client to access IBM Spectrum Scale through HDFS native RPC requests. All data transmission and metadata operations in HDFS are through the RPC mechanism and processed by NameNode and DataNode services within HDFS. IBM Spectrum Scale HDFS Transparency integrates both the NameNode and the DataNode services, and responds requests from HDFS client. In other words, HDFS client can continue to access Spectrum Scale seamlessly just as it did with HDFS.

    Spectrum Scale HDFS Transparency Connector Architecture


    Key advantage of Spectrum Scale Transparency Connector includes :

  • No Spectrum Scale Client is needed on every Hadoop node. HDFS client can access data on Spectrum Scale as it does with HDFS storage.
  • Full Kerberos support for more Hadoop components (e.g. Impala which will call HDFS client directly without calling Hadoop FileSystem interface, discp, webhdfs)
  • Leverages HDFS client cache
  • HDFS compliant APIs or shell-interface command
  • Application client isolation from storage. Application client may access data in the IBM Spectrum Scale filesystem without having a GPFS client installed.
  • Improved security management by Kerberos authentication and encryption for RPCs
  • Simplified file system monitoring by Hadoop Metrics2 integration
  • Installating the Ambari integration package on an existing hadoop cluster or a new cluster :

    This section provides the overview of adding spectrum scale as a service on a existing Hadoop cluster and shows how easily it can be integrated with the current

    Download the integration package or management pack from the wiki, setup the gpfs repo having transparency connector rpm. The integration package and HDFS Transparency connector can be downloaded from this link (BI 4.2.5 and HDP 2.6)

    For Big Insights 4.2.5 and Hortonworks HDP 2.6 ,

    #./SpectrumScaleIntegrationPackageInstaller-2.4.2.0.bin

    For Big Insights 4.2.0 or below, use this integration package.

    # ./gpfs.hdfs-transparency.ambari-iop_4.2-1.noarch.bin

    Installing this integration package links the IBM Spectrum Scale as a service to the existing hadoop stack. This integration allows the Ambari GUI wizard to identify the IBM Spectrum Scale as a service to be installed.

    Actions Panel for adding the service.
    Custom Service Addition Panel in Ambari Server GUI
    Assignment of GPFS_MASTER on one of the node.
    Assignment of GPFS NODES on all the nodes of a cluster.
    IBM Spectrum Scale customize service panel which is used for configuring the file system parameters
    Final Review Panel of Spectrum Scale Service
    Installation completion of Spectrum Scale Service in Ambari Server

    The cluster is now created successfully. This can be verified from the command prompt of one of the nodes by running this command

    # /usr/lpp/mmfs/bin/mmlscluster

    GPFS cluster information
    ========================
    GPFS cluster name: bigpfs.gpfs.net
    GPFS cluster id: 4605732645497527881
    GPFS UID domain: bigpfs.gpfs.net
    Remote shell command: /usr/bin/ssh
    Remote file copy command: /usr/bin/scp
    Repository type: CCR

    Node Daemon node name IP address Admin node name Designation
    --------------------------------------------------------------------------
    1 c902f10x09.gpfs.net 172.16.1.91 c902f10x13.gpfs.net quorum
    2 c902f10x10.gpfs.net 172.16.1.93 c902f10x14.gpfs.net quorum
    3 c902f10x11.gpfs.net 172.16.1.95 c902f10x15.gpfs.net
    4 c902f10x12.gpfs.net 172.16.1.97 c902f10x16.gpfs.net quorum

    The status of all the nodes can also be verified using this

    # /usr/lpp/mmfs/bin/mmgetstate -a

    Node number Node name GPFS state
    ------------------------------------------
    1 c902f10x09 active
    2 c902f10x10 active
    3 c902f10x11 active
    4 c902f10x12 active

    IBM Spectrum Scale Service Panel shown in Ambari Server.

    The Spectrum Scale service is deployed as a two component service :-
    1. GPFS_MASTER
    2. GPFS_NODE

    GPFS_MASTER creates filesystem and adds the gpfs nodes into the filesystem. GPFS_NODE are the nodes mounting the filesystem.

    After you have successfully added the IBM Spectrum Scale file system, the HDFS service panel displays the transparency connector daemon status such as Namenodes and datanodes.

    HDFS Service Panel as displayed in Ambari Server after spectrum scale integration, the namenodes and datanodes are transparency daemons

    Upgrading the IBM Spectrum Scale Service using Ambari Server in BigInsights:-

    IBM Spectrum Scale Service can be updated using the Ambari Server GUI as well.

    Upgrade of separate components can be done independently, user has to specify the new repo location from which the upgrade will happen.

    1. Upgrading Spectrum Scale :- This option upgrades the IBM Spectrum Scale with the latest provided rpms in the repository.

    2. Upgrading Transparency :- This option upgrades the transparency connector rpm on all the GPFS_Nodes.

    Spectrum Scale Repo Update configuration
    IBM Spectrum Scale Service Upgrade options.

    Running TeraSort MapReduce Benchmark with IBM Spectrum Scale

    The Terasort benchmark can be run on the cluster when IBM Spectrum Scale is integrated with the hadoop cluster. This benchmark is used to test the CPU/Memory power of the cluster. The filesize can be varied according to the available resources of a cluster.

    # hadoop jar /usr/iop/4.2.5.0-0000/hadoop-mapreduce/hadoop-mapreduce-examples.jar teragen 100000 /tmp/DjTeragen/
    WARNING: Use "yarn jar" to launch YARN applications.
    17/06/27 08:38:10 INFO impl.TimelineClientImpl: Timeline service address: http://c902f10x10.gpfs.net:8188/ws/v1/timeline/
    17/06/27 08:38:10 INFO client.RMProxy: Connecting to ResourceManager at c902f10x10.gpfs.net/172.16.1.93:8050
    17/06/27 08:38:11 INFO terasort.TeraSort: Generating 100000 using 2
    17/06/27 08:38:11 INFO mapreduce.JobSubmitter: number of splits:2
    17/06/27 08:38:11 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1498566741990_0007
    17/06/27 08:38:12 INFO impl.YarnClientImpl: Submitted application application_1498566741990_0007
    17/06/27 08:38:12 INFO mapreduce.Job: The url to track the job: http://c902f10x14.gpfs.net:8088/proxy/application_1498566741990_0007/
    17/06/27 08:38:12 INFO mapreduce.Job: Running job: job_1498566741990_0007
    17/06/27 08:38:17 INFO mapreduce.Job: Job job_1498566741990_0007 running in uber mode : false
    17/06/27 08:38:17 INFO mapreduce.Job: map 0% reduce 0%
    17/06/27 08:38:21 INFO mapreduce.Job: map 50% reduce 0%
    17/06/27 08:38:22 INFO mapreduce.Job: map 100% reduce 0%
    17/06/27 08:38:22 INFO mapreduce.Job: Job job_1498566741990_0007 completed successfully
    17/06/27 08:38:22 INFO mapreduce.Job: Counters: 31
    File System Counters
    FILE: Number of bytes read=0
    FILE: Number of bytes written=265820
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=164
    HDFS: Number of bytes written=10000000
    HDFS: Number of read operations=8
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=4
    Job Counters
    Launched map tasks=2
    Other local map tasks=2
    Total time spent by all maps in occupied slots (ms)=4796
    Total time spent by all reduces in occupied slots (ms)=0
    Total time spent by all map tasks (ms)=4796
    Total vcore-milliseconds taken by all map tasks=4796
    Total megabyte-milliseconds taken by all map tasks=17188864
    Map-Reduce Framework
    Map input records=100000
    Map output records=100000
    Input split bytes=164
    Spilled Records=0
    Failed Shuffles=0
    Merged Map outputs=0
    GC time elapsed (ms)=74
    CPU time spent (ms)=2840
    Physical memory (bytes) snapshot=545300480
    Virtual memory (bytes) snapshot=10188140544
    Total committed heap usage (bytes)=580386816
    org.apache.hadoop.examples.terasort.TeraGen$Counters
    CHECKSUM=214574985129000
    File Input Format Counters
    Bytes Read=0
    File Output Format Counters
    Bytes Written=10000000

    The data generated in the teragen stage is stored in the IBM Spectrum Scale mountpoint. This data is then given as input to the terasort.

    # hadoop jar /usr/iop/4.2.5.0-0000/hadoop-mapreduce/hadoop-mapreduce-examples.jar terasort /tmp/DjTeragen/ /tmp/DjTeraoutput
    WARNING: Use "yarn jar" to launch YARN applications.
    17/06/27 08:45:27 INFO terasort.TeraSort: starting
    17/06/27 08:45:28 INFO input.FileInputFormat: Total input paths to process : 2
    Spent 147ms computing base-splits.
    Spent 2ms computing TeraScheduler splits.
    Computing input splits took 150ms
    Sampling 2 splits of 2
    Making 1 from 100000 sampled records
    Computing parititions took 271ms
    Spent 423ms computing partitions.
    17/06/27 08:45:29 INFO impl.TimelineClientImpl: Timeline service address: http://c902f10x10.gpfs.net:8188/ws/v1/timeline/
    17/06/27 08:45:29 INFO client.RMProxy: Connecting to ResourceManager at c902f10x10.gpfs.net/172.16.1.93:8050
    17/06/27 08:45:29 INFO mapreduce.JobSubmitter: number of splits:2
    17/06/27 08:45:29 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1498566741990_0008
    17/06/27 08:45:30 INFO impl.YarnClientImpl: Submitted application application_1498566741990_0008
    17/06/27 08:45:30 INFO mapreduce.Job: The url to track the job: http://c902f10x14.gpfs.net:8088/proxy/application_1498566741990_0008/
    17/06/27 08:45:30 INFO mapreduce.Job: Running job: job_1498566741990_0008
    17/06/27 08:45:35 INFO mapreduce.Job: Job job_1498566741990_0008 running in uber mode : false
    17/06/27 08:45:35 INFO mapreduce.Job: map 0% reduce 0%
    17/06/27 08:45:40 INFO mapreduce.Job: map 100% reduce 0%
    17/06/27 08:45:45 INFO mapreduce.Job: map 100% reduce 100%
    17/06/27 08:45:45 INFO mapreduce.Job: Job job_1498566741990_0008 completed successfully
    17/06/27 08:45:45 INFO mapreduce.Job: Counters: 50
    File System Counters
    FILE: Number of bytes read=10400006
    FILE: Number of bytes written=21202869
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=10000214
    HDFS: Number of bytes written=10000000
    HDFS: Number of read operations=9
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=2
    Job Counters
    Launched map tasks=2
    Launched reduce tasks=1
    Data-local map tasks=1
    Rack-local map tasks=1
    Total time spent by all maps in occupied slots (ms)=5624
    Total time spent by all reduces in occupied slots (ms)=5330
    Total time spent by all map tasks (ms)=5624
    Total time spent by all reduce tasks (ms)=2665
    Total vcore-milliseconds taken by all map tasks=5624
    Total vcore-milliseconds taken by all reduce tasks=2665
    Total megabyte-milliseconds taken by all map tasks=20156416
    Total megabyte-milliseconds taken by all reduce tasks=19102720
    Map-Reduce Framework
    Map input records=100000
    Map output records=100000
    Map output bytes=10200000
    Map output materialized bytes=10400012
    Input split bytes=214
    Combine input records=0
    Combine output records=0
    Reduce input groups=100000
    Reduce shuffle bytes=10400012
    Reduce input records=100000
    Reduce output records=100000
    Spilled Records=200000
    Shuffled Maps =2
    Failed Shuffles=0
    Merged Map outputs=2
    GC time elapsed (ms)=129
    CPU time spent (ms)=5770
    Physical memory (bytes) snapshot=4995510272
    Virtual memory (bytes) snapshot=18434174976
    Total committed heap usage (bytes)=5077729280
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=10000000
    File Output Format Counters
    Bytes Written=10000000
    17/06/27 08:45:45 INFO terasort.TeraSort: done

    Conclusion

    IBM Spectrum Scale provides a robust, reliable and enterprise level alternative to HDFS file system used in existing hadoop clusters. Ambari Integration package and Management packs provide a simpler and easy to use method to add Spectrum Scale service to existing Hadoop clusters. HDFS Transparency connector provides a seamless way for any existing big data application and HDFS clients to interact with IBM Spectrum Scale file system as they would have with HDFS file system. Ambari server provides a easier management console to manage, monitor and upgrade the IBM Spectrum Scale service. Hortonworks HDP 2.6 hadoop distribution is also now certified to run seamlessly with IBM Spectrum Scale file system.

    For more detailed Instructions, refer to
    IBM Knowledge Center ( Big data and analytics )

    4 comments on"Deploying IBM Spectrum Scale File System using Apache Ambari framework on Hadoop clusters"

    1. Good work, Deepak! Lots of technical details in here about how Spectrum Scale’s HDFS transparency service plugs in and what advantages Spectrum Scale can have over traditional HDFS.

    2. Very nice blog Deepak. Really a nice piece of work. Lots of technical details about IBM Spectrum Scale on Big Data and Analytics front.

    3. Jean Miller June 28, 2017

      This was informative. Please allow me to add to this conversation. Have you heard about Binfer? Very easy tool to transfer big data.

      • DjDeepak569 June 29, 2017

        Thanks Jean. I looked up Binfer. Thats a tool to transfer large files. I didn’t understood the reference here.

    Join The Discussion

    Your email address will not be published. Required fields are marked *