This article is co-authored by Chris Snow (chris.snow@uk.ibm.com) and Pierre Regazzoni (pierrer@us.ibm.com).

IBM® BigInsights™ on Cloud provides Hadoop-as-a-service on IBM’s SoftLayer® global cloud infrastructure. It supports a large variety of ways that you can process your data and integrate with other services.

So how do you start with a cloud data service such as IBM® BigInsights™ on Cloud and might be asking yourself:

  • “how do I programmatically perform action X on my data on service Y?”
  • “how do I programmatically move data between service Y and service Z?”

These questions usually need to be addressed early and quickly in the project’s life-cycle. As such, they are usually addressed during sprint zero to create the basic skeleton and plumbing for the project so that future sprints can truly add incremental value in an efficient way [1].

Starting from scratch, each question can easily take anywhere from a few hours to a few days to answer by the time you have researched the different options and developed some skeleton code.

In this blog post, we would like to introduce you to an open source project [2] that provides working examples of over 30 actions and integrations (and growing) that you can see running against your IBM® BigInsights™ cluster and your own services that you wish to integrate with. All you need to do is provide the connection details of your IBM® BigInsights™ cluster and the service(s) you wish to connect to and you can run a single command to see the example running against your environment. It is possible to set up the examples and run them in under five minutes!

The current examples are listed below:

Hdfs (Using Knox API – WebHDFS)

    List folder contents using Groovy
    Create a folder using Groovy
    Upload a file using Groovy
    List folder contents using cURL
    Create a folder using cURL
    Upload a file using cURL

Ambari

    Get cluster name and then services installed on cluster
    Perform HDFS Service Check via Ambari REST

BigR

    Connect to BigR

BigSQL

    Connect to Big SQL from Groovy
    Insert/Select with Big SQL from Groovy
    Load/Select with Big SQL from Groovy
    Connect to Big SQL from Java

Hive

    Connect to Hive from Groovy
    Connect to Hive from Java
    Start a Hive Beeline Session

Spark (run inside a ssh session on the BigInsights cluster)

    Submit a spark python job
    Submit a spark scala job
    Spark Streaming (run inside a ssh session on the BigInsights cluster) Submit a spark streaming python job

Oozie (Using Knox API)

    Submit a Java Mapreduce job using Groovy
    Submit a Java Mapreduce job using cURL
    Submit a Java Spark job using Groovy

HBase

    Connect to HBase using Groovy
    Manipulate Schema and Perform CRUD Operations using Groovy
    Connect to HBase using Java

WebHCat/Templeton (Using Knox API)

    Execute a MapReduce Job using Groovy
    Execute a Pig Job using Groovy
    Execute a Hive Job using Groovy

Knox

    Run a knox shell client session

Cloudant

    Pull data from a Cloudant database to HDFS using Spark
    Push data from HDFS to a Cloudant database using Spark

Object Store (Swift, S3)

    Pull data from a object store to HDFS using Spark
    Push data from HDFS to a object store using Spark

dashDB

    Pull data from a dashDB database to HDFS using Spark
    Push data to dashDB database using Spark
    Pull data from a dashDB database using Big SQL

Elasticsearch

    Push data to Elasticsearch using Spark
    Pull data from Elasticsearch to HDFS using Spark

Let’s look at a few use cases for moving data between BigInsights and dashDB. As you can see from the list above, there are three examples you can try:

  1. Pull data from a dashDB database to HDFS using Spark
  2. Push data to dashDB database using Spark
  3. Pull data from a dashDB database using Big SQL

The first two examples use Apache Spark to pull data from dashDB to BigInsights, and the second example uses Big SQL.

Running the examples is simple, you first need to checkout and setup the examples:

  1. Clone this repository git clone https://github.com/snowch/biginsight-examples.git
  2. Copy connection.properties_template to connection.properties
  3. Edit connection.properties to add your connection details for BigInsights and other optional services such as dashDB
  4. Export the cluster certificate from your browser
  5. In your connection.properties uncomment the line # known_hosts:allowAnyHosts
  6. Setup driver library by running: ./gradlew DownloadLibs (unix) gradlew.bat DownloadLibs (windows) to download libraries from the cluster

Now let’s just run one example:

Unix, run

./gradlew -p examples/DashDBIntegrationWithBigSQL Example

Windows, run

gradlew.bat -p examples/DashDBIntegrationWithBigSQL Example

The above command creates a table in hadoop and populates it with data from dashDB. If you developed the code for this from scratch, you could easily burn a few hours or a few days on it. Instead in around 5 minutes, you have been able to see some example code running against your own environment.

To run the whole set of examples at once, you can run:

./gradlew test (unix)

gradlew.bat test (windows)

(detailed output for the tests can be found in the folder ./build/test/).

For more information and get the code on the project, visit: https://github.com/snowch/biginsight-examples

We encourage you to also look at the code and provide comments/ideas on future example you’d like to see.

[1] https://www.scrumalliance.org/community/articles/2013/september/what-is-sprint-zero
[2] https://github.com/snowch/biginsight-examples

Join The Discussion

Your email address will not be published. Required fields are marked *