IBM Cloud Object Storage makes use of the distributed storage technologies provided by the IBM Cloud Object Storage System (formerly Cleversafe). The COS API is used to work with the storage accounts.
A data scientist works with text, csv and excel files frequently. The files are stored and retrieved from IBM Cloud Object Storage. For analyzing the data in IBM Watson Studio using Python, the data from the files needs to be retrieved from Object Storage and loaded into a Python string, dict or a pandas dataframe.
IBM Watson Studio provides an integration with IBM Cloud Object Storage system. The integration support loads the file from the Cloud Object Storage into a ibm_botocore.response.StreamingBody
object but this object cannot be directly used and requires transformation.
Learning objectives
This tutorial covers the following:
Create re-usable method for retrieving files into IBM Cloud Object Storage using Python on IBM Watson Studio.
Load a text file data from IBM Cloud Object Storage into a Python string.
Load an excel file into a Python Pandas DataFrame.
Prerequisites
Before beginning this tutorial, you need the following:
An IBM Cloud account. If not, sign up for an account
Sign up for IBM’s Watson Studio.
By signing up for the Watson Studio, two services will be created – Spark
and ObjectStore
in your IBM Cloud account.
Estimated Time
This tutorial will take 30 mins to complete. The creation of re-usable functions in Python will take 10 mins. The loading of text file into a Python string will take 10 mins. The loading of an excel file into a Pandas Dataframe will take 10 mins.
Steps
In the Jupyter notebook on IBM Watson Studio, perform the below steps.
Install necessary packages
Install the below packages:
Run the command !pip install ibm-cos-sdk
to install the package.
Import modules
Import the below modules:
import ibm_boto3
from botocore.client import Config
import json
import pandas as pd
Insert the IBM Cloud Object Storage credentials
Insert the IBM Cloud Object Storage credentials from the menu drop-down on the file as shown below:
Create an Object Storage client
Create a client that can be used to retrieve files from Object Storage or write files to Object Storage.
cos = ibm_boto3.client('s3',
ibm_api_key_id=credentials_1['IBM_API_KEY_ID'],
ibm_service_instance_id=credentials_1['IAM_SERVICE_ID'],
ibm_auth_endpoint=credentials_1['IBM_AUTH_ENDPOINT'],
config=Config(signature_version='oauth'),
endpoint_url=credentials_1['ENDPOINT'])
Create a function to retrieve a file from Cloud Object Storage
The below function retrieves the file contents into a ibm_botocore.response.StreamingBody
instance and returns it.
def get_file(filename):
'''Retrieve file from Cloud Object Storage'''
fileobject = cos.get_object(Bucket=credentials_1['BUCKET'], Key=filename)['Body']
return fileobject
Load the file contents
Text file into a Python string
The below function takes the ibm_botocore.response.StreamingBody
instance and returns the contents in a variable of type string.
def load_string(fileobject):
'''Load the file contents into a Python string'''
text = fileobject.read()
return text
Text file in json format into a Python dict
The below function takes the ibm_botocore.response.StreamingBody
instance and returns the contents in a variable of type dict.
def load_dict(fileobject):
'''Load the file contents into a Python dict'''
text = fileobject.read()
dictformat = json.loads(text)
return dictformat
Excel file into a Pandas dataframe
The below function takes the ibm_botocore.response.StreamingBody
instance and the sheet name. It returns the sheet contents in a Pandas dataframe.
def load_df(fileobject,sheetname):
'''Load file contents into a Pandas dataframe'''
excelFile = pd.ExcelFile(fileobject)
df = excelFile.parse(sheetname)
return df
Summary
This tutorial has covered the aspects of loading files of text and excel formats from IBM Cloud Object Storage using Python on IBM Watson Studio.
IBM Watson Studio: Analyze data using RStudio and Jupyter in a configured, collaborative environment that includes IBM value-adds, such as managed Spark.