Overview

Skill Level: Any Skill Level

This recipe helps you create, configure, compile, and execute a DataStage job that can read data from files on Google Cloud Storage.

Ingredients

1. IBM Infosphere Information Server Datastage 11.7.1 and above

2. DB2 database

3. Google Cloud Account

Step-by-step

  1. Description

    Google Cloud Storage is an online file storage web service for storing and accessing data on Google Cloud Platform infrastructure. The service combines the performance and scalability of Google’s cloud with advanced security and sharing capabilities.

    Information Server provides a native Google Cloud Storage Connector to read / write data from the files on Google Cloud Storage and integrate it into the ETL job design.

    We demonstrate a sample use case here which performs a read operation on Google Cloud Storage using Google Cloud Storage Connector. The datastage job includes a Google Cloud Storage Connector as source stage and a DB2 Connector as target, where the file data from Google Cloud Storage is written to a table located on DB2.

    Screen-Shot-2020-01-22-at-4.03.50-pm

    In this recipe, I will show you how we can configure Google Cloud Storage Connector properties to read data from Google Cloud Storage.

  2. Configure Google Cloud Storage Connection properties

    1. Download the Google service account credentials json file and copy it to any location on Engine tier.

    https://cloud.google.com/iam/docs/creating-managing-service-account-keys

    2. Provide the fully qualified path to the above json file under Credentials file in the Connection Properties as follows:

    Screen-Shot-2020-01-22-at-2.02.48-pm-2

  3. Configure Google Cloud Storage Connector Properties to read multiple files from Google Cloud Storage

    1. Select the Read Mode as “Read Multiple files using Wildcards” and provide the Bucket name from which the files have to be read.

    2. Provide the File Name property with wildcards from which data has to be read from Google Cloud Storage. Wildcards supported include * and ? . If there is any need for more filters in the file name, “Read multiple files using Regex Expression” can be used.

    3. Incase of using multiple files read option, all the files matching the wildcard/Regex should have the same schema.

    4. Choose the File format as Delimited. Six file formats are supported currently: Delimited, CSV, Parquet, Avro, JSON, Excel. Any file format can be selected as per the requirement.

    5. Once the file format is selected, optional formatting properties such as header, delimiters, quotation mark etc can be provided as per the usage requirement.

    Screen-Shot-2020-01-22-at-3.55.24-pm

    6. Under Output tab, provide the column name and type details of data, that needs to be read from Google Cloud Storage as follows:

    Screen-Shot-2020-01-22-at-3.55.34-pm

    7. Provide the table name and the connection details of DB2 in the DB2 Connector stage.

    8. Compile and run the job. The data from files located on Google Cloud Storage is written to a table on DB2.

     Screen-Shot-2020-01-22-at-4.01.16-pm

  4. References

    https://cloud.google.com/iam/docs/creating-managing-service-account-keys

    https://cloud.google.com/storage/

Join The Discussion