Stocator is a storage connector connecting Apache Spark with IBM Cloud Object Storage or any other OpenStack Swift API-based object storage. This post demonstrates how to quickly set up a development environment with Stocator and IBM Cloud Object Storage. We’ll then show you how to use Stocator with Apache Spark.

Software requirements

Before you can install and run Stocator with Apache Spark and IBM Cloud Object Storage, you need to set up Eclipse IDE, Apache Spark, Object Storage, and Stocator.

Development environment

If you’re not familiar with Eclipse IDE, the setup procedures are standard. Just go to https://eclipse.org and download the version suited for your operating system. Your options are:

  • Java 1.7 or higher
  • Apache Maven

Apache Spark

You can obtain the Spark release as a pre-built package. Navigate to http://spark.apache.org/downloads.html and download the latest Apache Spark pre-built for Hadoop

If you wish to use a master branch of Apache Spark, follow http://spark.apache.org/developer-tools.html.

Object Storage

If you don’t have an IBM Cloud Object Storage existing account, the easiest and simplest way is to go to the IBM Cloud Object Storage page and create a free-of-charge Lite plan. There is no charge for small accounts and this will be enough for our needs.

Stocator

Navigate to the Stocator branches page and verify the latest X.Y.Z-ibm-sdk release which is based on IBM Cloud Object Storage Java SDK, where X.Y.Z is most updated Stocator release version. As an example, if the current Stocator release is 1.0.22 then 1.0.22-ibm-sdk branch is based on IBM COS Java SDK. Get more information on branches can be found on the Stocator project page.

git clone https://github.com/SparkTC/stocator cd stocator git fetch git checkout -b 1.0.22-ibm-sdk origin/1.0.22-ibm-sdk mvn clean install –DskipTests

This will generate “target/stocator-1.0.22-IBM-SDK.jar”.

Create the Eclipse project

cd stocator mvn eclipse:eclipse

Import the Stocator project into your Eclipse workspace. In the workspace, open Eclipse. Navigate to File > Import > General > Existing projects and navigate to the Stocator folder.

Configure Stocator in Apache Spark

Next, we’ll configure Spark to use Stocator.

  1. Edit spark/conf/core-sites.xml:
    <property>
    <name>fs.cos.impl</name>
    <value>com.ibm.stocator.fs.ObjectStoreFileSystem</value>
    </property>
    <property>
    <name>fs.stocator.cos.impl</name>
    <value>com.ibm.stocator.fs.cos.COSAPIClient</value>
    <property>
  2. Make sure the Stocator jar is on the class path of Spark and deployed on all the nodes. Then update spark/conf/spark-defaults.xml

    spark.driver.extraClassPath = /stocator-1.0.22-IBM-SDK.jar
spark.executor.extraClassPath = /stocator-1.0.22-IBM-SDK.jar
  3. To show more log messages for Stocator, edit spark/conf/log4j.properties and add to the end: log4j.logger.com.ibm.stocator=DEBUG
  4. The last step is to configure Stocator with your IBM Cloud Object Storage account. Open the dashboard of IBM Cloud, navigate to Object Storage, credential. You should see credentials of the form:
    { "apikey": "123", "endpoints": "https://cos-service.bluemix.net/endpoints", "iam_apikey_description": "Auto generated apikey during resource-key operation for Instance - abc", "iam_apikey_name": "auto-generated-apikey-123", "iam_role_crn": "role", "iam_serviceid_crn": "identity-123::serviceid:ServiceId-XYZ", "resource_instance_id": "abc" }
  5. Edit conf/core-sites.xml and create

    <property>
    <name>fs.cos.mycos.iam.api.key</name>
    <value>123</value>
    </property>
    <property>
    <!- open link https://cos-service.bluemix.net/endpoints and Choose relevant endpoint. You can also obtain this value from the IBM Cloud dashboard and section “location” –>
    <name>fs.cos.mycos.endpoint</name>
    <value>https://s3-api.us-geo.objectstorage.softlayer.net</value>
    </property>
    <property>
    <name>fs.cos.mycos.iam.service.id</name>
    <value> ServiceId-XYZ</value>
    </property>
  6. And you’re done!

You are now ready to run Apache Spark.

To make sure everything is working, create a small array and persist it as an object:

val data = Array(1, 2, 3, 4) val distData = sc.parallelize(data) distData.saveAsTextFile("cos://mybucket.mycos/data.txt")

Make sure you explore the logs to see how Apache Spark really works with object stores — you’ll see the magic that happens in Stocator. And if you’re ready to improve on it and contribute your updates? Go for it!

Join The Discussion

Your email address will not be published. Required fields are marked *