Working with text and excel files in IBM Watson Studio using IBM Cloud Object Storage

IBM Cloud Object Storage makes use of the distributed storage technologies provided by the IBM Cloud Object Storage System (formerly Cleversafe). The COS API is used to work with the storage accounts.

A data scientist works with text, csv and excel files frequently. The files are stored and retrieved from IBM Cloud Object Storage. For analyzing the data in IBM Watson Studio using Python, the data from the files needs to be retrieved from Object Storage and loaded into a Python string, dict or a pandas dataframe.

IBM Watson Studio provides an integration with IBM Cloud Object Storage system. The integration support loads the file from the Cloud Object Storage into a ibm_botocore.response.StreamingBody object but this object cannot be directly used and requires transformation.

Learning objectives

This tutorial covers the following:

  • Create re-usable method for retrieving files into IBM Cloud Object Storage using Python on IBM Watson Studio.

  • Load a text file data from IBM Cloud Object Storage into a Python string.

  • Load an excel file into a Python Pandas DataFrame.

Prerequisites

Before beginning this tutorial, you need the following:

By signing up for the Watson Studio, two services will be created – Spark and ObjectStore in your IBM Cloud account.

Estimated Time

This tutorial will take 30 mins to complete. The creation of re-usable functions in Python will take 10 mins. The loading of text file into a Python string will take 10 mins. The loading of an excel file into a Pandas Dataframe will take 10 mins.

Steps

In the Jupyter notebook on IBM Watson Studio, perform the below steps.

Install necessary packages

Install the below packages:

Run the command !pip install ibm-cos-sdk to install the package.

Import modules

Import the below modules:

   import ibm_boto3
   from botocore.client import Config
   import json
   import pandas as pd

Insert the IBM Cloud Object Storage credentials

Insert the IBM Cloud Object Storage credentials from the menu drop-down on the file as shown below:

alt

Create an Object Storage client

Create a client that can be used to retrieve files from Object Storage or write files to Object Storage.

   cos = ibm_boto3.client('s3',
                       ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],
                       ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],
                       ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],
                       config=Config(signature_version='oauth'),
                       endpoint_url=credentials_1['ENDPOINT'])

Create a function to retrieve a file from Cloud Object Storage

The below function retrieves the file contents into a ibm_botocore.response.StreamingBody instance and returns it.

def get_file(filename):
    '''Retrieve file from Cloud Object Storage'''
    fileobject = cos.get_object(Bucket=credentials_1['BUCKET'], Key=filename)['Body']
    return fileobject

Load the file contents

Text file into a Python string

The below function takes the ibm_botocore.response.StreamingBody instance and returns the contents in a variable of type string.

def load_string(fileobject):
    '''Load the file contents into a Python string'''
    text = fileobject.read()
    return text

Text file in json format into a Python dict

The below function takes the ibm_botocore.response.StreamingBody instance and returns the contents in a variable of type dict.

def load_dict(fileobject):
    '''Load the file contents into a Python dict'''
    text = fileobject.read()
    dictformat = json.loads(text)
    return dictformat

Excel file into a Pandas dataframe

The below function takes the ibm_botocore.response.StreamingBody instance and the sheet name. It returns the sheet contents in a Pandas dataframe.

   def load_df(fileobject,sheetname):
       '''Load file contents into a Pandas dataframe'''
       excelFile = pd.ExcelFile(fileobject)
       df = excelFile.parse(sheetname)
       return df

Summary

This tutorial has covered the aspects of loading files of text and excel formats from IBM Cloud Object Storage using Python on IBM Watson Studio.

Legend