Overview

Skill Level: Beginner

This recipe helps the user to create, configure, compile and execute a Datastage job which writes the link data to Microsoft Azure Datalake Storage Gen2 filesystem.

Ingredients

1. Inforsphere Information Server 11.7.1.0 +  Patch JR60868 + Patch JR61087 + Patch JR61543

2. Microsoft Azure account

3. Azure Datalake Storage credentials (client id, secret key, tenant id)

Step-by-step

  1. Overview

    Microsoft Azure Data Lake Storage Gen2 is a combination of file system semantics from Azure Data lake Storage Gen1 and the high availability/disaster recovery capabilities from Azure Blob storage.

    Information Server Datastage provides a ADLS Connector which is capable of writing new files and reading existing files from Azure Data lake Storage Gen2. The connector can be used in ETL workflow.

    This recipe shows the steps to configure the connector in the write mode, that is how the data can loaded into existing filesystems on Gen2 as files or files inside directories.

    The use-case considered is to read data from Snowflake database table and write it in the form of file to ADLS Gen2 storage enabling a multi-cloud data movement.

     

                  adls2

     The above job design shows that Snowflake Connector is on source side and ADLS Connector is on target side and data read from Snowflake database table is written as a file into ADLS Gen2.

  2. Steps to configure Snowflake Connector

    1. Provide the necesary Connection properties on the Snowflake connector stage –

          adls3

    2. Select Generate SQL at runtime property as Yes in Usage properties.

    3. Enter the name of the table whose data has to be written as a file into Gen2 in Table name property –

          adls4

    4. Provide schema of the table to be read in the Columns tab of the connector properties.

          adls5

  3. Configure ADLS Connector and run the job.

    1. Provide the required WebHDFS URL, Tenant ID, Client ID and Client Secret Key connection properties.

          adls6

    The URL for connecting to Gen2 is of the format –

    https://{storageaccountname}.dfs.core.windows.net/{filesystem}

    filesystem corresponds to the name of the filesystem that is created in the storage account. Currently there is no provision in the connector to create the filesystem , so it needs to be created in the azure portal.

          adls7

     To create Azure Storage account please refer to – https://docs.microsoft.com/en-us/azure/storage/blobs/data-lake-storage-quickstart-create-account#create-an-account-using-the-azure-portal

    To obtain the Tenant ID, Client ID and Client Secret Key please refer to – https://developer.ibm.com/recipes/tutorials/how-to-procure-tenant-id-client-id-and-client-secret-key-to-connect-to-microsoft-azure-data-lake-storage-gen2/

    2. Input the name of the file in the File name property. If the file name contains directory/filename and if the directory does not exist in the filesystem then the directory will get created. If the directory exists then the file will be created inside the directory.

    Each time the job runs the file will created , that is we cannot append to an existing file.

         adls8

    In this job the File format is set to CSV , the other values for file format can be Delimited, Parquet, Avro, JSON, Excel and ORC. According to the selection of the file format , the corresponding required properties for File format properties will change.

     3. Compile and Run the job.

          adls9

     

    4. The  created file can be see in the azure portal as –

          adls10

Join The Discussion