Perform advanced ETL operations with DataStage

In this tutorial, we’ll learn how to use DataStage to perform extract, transform, load (ETL) operations on data stored in Netezza Performance Server. IBM DataStage® on IBM Cloud Pak for Data enables users to create, edit, load, and run DataStage jobs which can be used to perform integration of data from various sources in order to glean meaningful and valuable information. Netezza® Performance Server for IBM Cloud Pak for Data is a powerful advanced analytics platform that provides a trusted data warehouse and enables true operationalization of data science and machine learning in place and at scale.

Learning objectives

The purpose of this tutorial is to design a DataStage job to satisfy the following problem statement:

As a data engineer, you have been asked by the line of business that you support to create a data table that contains all employees with compensation greater than $50,000. The file must also contain the employee’s department name and the mean salary of all employees in that department of those who earn more than 50,000. In addition, the file must be sorted in descending order, based on the mean salary. Finally, the application that will consume this output data expects the full name of the employee to be in one field, formatted as first name, then middle initial, then last name).

In this tutorial, you will learn how to:

  • Create a job in DataStage
  • Load data from Netezza Performance Server into DataStage
  • Perform transformations, such as modifying tables, joining tables, aggregating table data, and sorting table data
  • Write tabular data from DataStage into a table in Netezza Performance Server
  • Run jobs
  • View logs for jobs

Prerequisites

Estimated time

Completing this tutorial should take about an hour.

About the data

The input for this table consists of the following tables.

The DEPT table contains departmental information and has the following columns:

Column Name Data Type
DEPTNO VARCHAR(3)
DEPTNAME VARCHAR(28)
MGRNO VARCHAR(6)
ADMRDEPT VARCHAR(3)
LOCATION VARCHAR(5)

The EMP table contains employee information and has the following columns:

Column Name Data Type
EMPNO INTEGER
FIRSTNME VARCHAR(9)
MIDINIT VARCHAR(1)
LASTNAME VARCHAR(10)
WORKDEPT VARCHAR(3)
PHONENO VARCHAR(4)
HIREDATE DATE
JOB VARCHAR(8)
EDLEVEL SMALLINT
SEX VARCHAR(1)
BIRTHDATE DATE
SALARY DECIMAL(10,2)
BONUS DECIMAL(8,2)
COMM DECIMAL(8,2)

The values in the WORKDEPT column of the EMP table come from the DEPTNO column of the DEPT table.

Steps

Step 1. Set up Netezza CLI and the Netezza ODBC driver for DataStage

To connect to Netezza from DataStage, you need to download and install the Netezza ODBC driver in your is-engine-conductor-0 pod:

NOTE: If you are performing these steps in a Linux environment, you will need to set the Netezza CLI install path location (/usr/local/nz) as NETEZZA_HOME instead of NZClientDir in the dsenv file.

NETEZZA_HOME=/usr/local/nz
export NETEZZA_HOME

Step 2. Load data into Netezza Performance Server

Before you connect to IBM Netezza Performance Server from DataStage on IBM Cloud Pak for Data, you should create the required tables in your IBM Netezza Performance Server instance. Once the tables are created, you can populate these tables with the data.

Create tables in Netezza Performance Server

  • Log in to your IBM Netezza Performance Server console. Click on the (☰) hamburger menu in the top-left corner, then click Data.

    Data under hamburger menu on Netezza Performance Server

  • Click on Create database +.

    Create database

  • Provide the name of the database as DATASTAGE and click Create.

    Create new database

  • The database is now created and you can see that a schema with your username has been created.

    Created database

  • Add data to the DATASTAGE database under this schema. Click on the (☰) hamburger menu, then click Query editor.

    Query editor

  • Specify the DATASTAGE database and set the schema to the schema with the same name as <your_username>. Copy the contents of the CreateNPSTables.sql and paste it in the section below the schema name, then click Run.

    Run the CreateNPSTables SQL

If everything runs successfully, you should see a result that says “CREATE TABLE SUCCEEDED.”

NOTE: To drop the tables, you can uncomment and use commented lines 2 and 3 in the SQL.

Load the table data

You can load data into the Netezza Performance Server tables either by using the nzload command of the Netezza CLI or by running SQL insert statements in the Netezza Performance Server console.

Load the table data using the Netezza CLI

  • Download the dept.csv and emp.csv files.
  • Run the following commands to load the data from the dept.csv and emp.csv files into the DEPT and EMP tables, respectively:

NOTE: Replace <user> with your NPS username, <password> with your Netezza Performance Server password and <host> with the Netezza host.

nzload -u <user> -pw <password> -host <host> -db DATASTAGE -t DEPT -delim ',' -df dept.csv
nzload -u <user> -pw <password> -host <host> -db DATASTAGE -t EMP -delim ',' -df emp.csv

If everything runs successfully, you should see the following output saying the load session for the tables was completed successfully.

Load session of table 'DEPT' completed successfully
Load session of table 'EMP' completed successfully

Load the table data using the Netezza Performance Server console

An alternate method of loading data into the tables is to use the Netezza Performance Server console.

  • Back on the console, go to the Query editor. As before, specify to use the DATASTAGE database and set the schema to the schema with the same name as <your_username>. Copy the contents of the LoadDataStageInput.sql and paste it in the section below the schema name, then click Run.

    Run the LoadDataStageInput sql

If everything runs successfully, you should see a result that says “INSERT 0 1.”

NOTE: To truncate (empty) the tables, you can uncomment and use commented lines 2 and 3 in the SQL.

Step 3. Provision the DataStage service

  • Log in to your IBM Cloud Pak for Data console, go to the Services tab, use the Category pull-down and select Data governance, then click on the DataStage Edition tile.

    Deploy DataStage Edition service

  • Follow the instructions to provision DataStage. After the service has been provisioned, you should be able to see Transform data under (☰) hamburger menu > Organize on IBM Cloud Pak for Data.

    Transform data under hamburger menu

Step 4. Add the table definitions

  • Click Transform data.

    Open DataStage

  • You will see that a project named dstage1 is already available. We will use this existing project. Click on your project to open it.

    Open project

NOTE: To create a new project, click on + Create, which will open up a modal. Provide the name of the project and click Create. The project takes a few minutes to be created and once ready, it will be visible on the page. Click on your project to open it.

We will need to provide the definitions of the input tables DEPT and EMP here to access their data in DataStage. For this, we will first need to add the Netezza Performance Server instance as a connection in DataStage. We can then import the table definitions using this connection.

  • Click on the Connections tab and click + Create. A modal is displayed. On the General tab, provide the name of the connection as Netezza. Switch to the Connection tab.

    Add new connection to DataStage

  • On the Connection tab, select the connector type as Netezza. Specify NZSQL, DATASTAGE, <your NPS user name>, and <your NPS password> as the values for data source, database, user name, and password, respectively. Click OK.

    Add new connection to DataStage - 2

  • On the next screen, click Save.

    Add new connection to DataStage - 3

The new connection named Netezza should now be listed on the Connections tab.

Created connection

  • Switch to the Table definitions tab, click on the import icon on the right-hand side, and in the modal that opens, select Netezza, and click Next.

    Import table definition - 1

  • Select the schema, which should be the same as <your NPS user name>, then click Next.

    Import table definition - 2

  • Select the table with the definitions you wish to import. Select the DEPT table and click Next.

    Import table definition - 3

  • The columns of the DEPT table will be displayed and by default, all columns will be selected. Click Finish.

    Import table definition - 4

  • Click Save to save the table definition. Notice that the table definition will be saved as <schema name>.DEPT.

    Import table definition - 5

  • Repeat the above steps to import the EMP table definition. The table definition will be saved as <schema name>.EMP. Finally, you should be able to see the table definitions for both the DEPT and the EMP tables listed on the Table definitions tab.

    Imported table definitions

Step 5. Create and design a parallel job in DataStage

  • Click on the Jobs tab, then click + Create > Parallel job.

    Create parallel job

  • A new tab with the name Job_1* opens up, where you can now start designing the parallel job.

    Canvas

Take a few minutes to look at the various buttons and menus available:

  • The Show/Hide Palette button can be used to show or hide the palette on the left of the screen, which contains the connectors and stages that can be used while designing the job.
  • The Save, Compile, and Run icons are used to save, compile, and run the job.
  • The Compare icon is used to compare this job with another job in the project.
  • The View menu has a number of options, including:
    • Log to view the job run log (available once the job has been run)
    • View code to view the OSH code (available once the job has been successfully compiled)
    • Switch view to switch to another open job
    • Properties to view the properties of the job
  • The Schedule icon is used to set a schedule to run the job.
  • The Settings menu has a number of options, including:
    • Apply horizontal layout, which arranges all the connectors and stages in the canvas in a horizontal manner with data flowing from left to right
    • The ability to view/hide annotations, arrows, link names, and the type of the connector/stage.
    • Auto propagation to specify if you want automatic propagation of metadata
    • Smart palette, which applies smart logic based on usage patterns to reorder the items available in the palette; if disabled, the items in the palette display in alphabetical order
    • Smart stage suggestions applies smart logic based on usage patterns to suggest the next stage that you might want to add to the job

Before we begin designing the job, let’s save this empty job by clicking on the Save icon. In the modal that pops up, provide the name of the job as <user>_job where \<user\> is your name, then click Save.

Save the job

NOTE: Remember to periodically save the job while designing it. Changes are not auto-saved and you will lose your progress if the browser is closed without saving the job.

The first step in the job is to bring in the input tables DEPT and EMP, which have already been loaded into the Netezza Performance Server instance:

  • Drag a Netezza connector from the palette onto the canvas. In the modal that opens up, select the DATASTAGE\<schema_name>.DEPT table definition, then click Next.

    Add Netezza connector

  • On the next screen, the columns in the table definition will be displayed. All the columns will be selected by default. Click Add to job.

    Add Netezza connector - add to job

  • Drag another Netezza connector to the canvas and repeat the steps given above, but this time, select the DATASTAGE\<schema_name>.EMP table definition instead. Once you complete the steps, you should see the two Netezza connectors on the canvas.

    Add Netezza connectors - completed

Next we will use a Modify stage to perform the following modifications to the MIDINIT and WORKDEPT columns in the EMP table:

  • Change the name of the WORKDEPT column to DEPTNO in order to ensure that both the tables have the same name for the column that contains the department number information. This is needed for a future step where we will join the two tables.

  • Replace any NULL MIDINIT values with " ". This is needed for a future step where we will combine the FIRSTNME, MIDINIT, and LASTNAME columns to create the FULLNAME of the employee.

  • Drag and drop a Modify stage next to the Netezza connector for the EMP table. Provide the output of the EMP table Netezza connector as the input to the Modify stage. For this, click on the little blue dot on the right side of the Netezza connector and drag the mouse pointer to the Modify stage.

NOTE: For another method to connect the Netezza connector to the Modify stage, click on the Netezza connector to select it, then drag and drop the Modify stage. The Modify stage will automatically be connected to the Netezza connector.

Add modify for EMP

  • Now that an output link has been added to the Netezza connector for the EMP table, we can double-click on this connector to open the Stage page.

  • Here, we will need to provide the connection details for the data that needs to be imported using this connector. Provide the data source, database, user name, and password as NZSQL, DATASTAGE, <your NPS user name>, and <your NPS password>. Verify that the table name is <schema_name>.EMP, then click OK.

    Stage page for EMP

  • Next, drag and drop a Join stage to the canvas and provide the output of the Modify stage as the input.

    Add join stage

  • Double-click on the Modify stage to open up the Stage page. On the Properties tab, we will provide two specifications. Click + Add Specification and add the following specifications in the spaces available:

DEPTNO=WORKDEPT

MIDINIT=handle_null(MIDINIT," ")

Modify stage - specification

  • Switch to the Outputs tab, click on the row for WORKDEPT to expand it, update the column name to DEPTNO, and click Apply > OK.

    Modify stage - output column

  • Both the tables now have a column called DEPTNO, which can be used to join the tables. Provide the output of the DEPT table Netezza connector as the second input to the Join stage.

  • Double-click on the Netezza connector for the DEPT table. Just as in the EMP table Netezza connector, provide the data source, database, user name, and password as NZSQL, DATASTAGE, <your NPS user name>, and <your NPS password>. Also verify that the table name for this connector is <schema_name>.DEPT, then click OK.

    Stage page for DEPT

  • Double-click the Join stage to bring up the Stage page. Here, you can verify that the DEPTNO is being used as the JOIN KEY, and the Join Type is Inner. If the JOIN KEY is blank, click on the drop-down list and select DEPTNO from the list. Click OK to go back to the canvas.

    Connect DEPT to Join

  • Next, add a Transformer stage to the canvas and provide the output of the Join stage as the input.

    Add transformer stage

  • Double-click the Transformer stage to open the Stage page. We will add two stage variables to this stage. On the Properties tab, click + Add. A new entry is added in the table below. Use this to define a stage variable named TOTALCOMP, which will represent the total compensation (the sum of bonus, commission, and salary) of the employee. Double-click on {derivation} to open the Derivation Builder.

    Transformer - add TOTALCOMP

  • Build the derivation rule for TOTALCOMP by finding the BONUS, COMM, and SALARY input columns in the table. Clicking on the entries in the table will insert them in the Derivation at the top. Type in + signs between them in order to complete the derivation. You can also use the search bar to find the fields. When inserted in the Derivation, the column names will be prepended with the identifier for the input link. Click OK to go back to the Stage page.

    Transformer - build TOTALCOMP derivation

  • Update the name of the stage variable to TOTALCOMP, the SQL type to Double, the precision to 9, and the scale to 2.

    Transformer - complete TOTALCOMP

  • Repeat the process above to add another Stage variable of FULLNAME, which will represent the complete name of the employee. Provide the Derivation as CompactWhiteSpace(<Link>.FIRSTNME:" ":<Link>.MIDINIT:" ":<Link>.LASTNAME), the name of the stage variable as FULLNAME, the SQL type as Varchar, the precision as 36, and the scale as 0. Click OK to save the changes and return to the canvas.

NOTE: <Link> is a placeholder for the input link coming into the Transformer stage. When you search for the FIRSTNME, MIDINIT, and LASTNAME variables in the inputs table on the Derivation Builder screen, the link name is already specified for all the input variables. CompactWhiteSpace is a function that will compact any continuous white spaces into a single white space. CompactWhiteSpace is also available in the table shown on the Derivation Builder screen. : (colon) is the operator used for concatenation.

Transformer - complete FULLNAME

  • Next, add a Join and an Aggregator stage to the canvas. Connect the Transformer stage to both these stages such that the output of the Transformer stage is provided as the input to both these stages.

    Add join and aggregator

  • Since the output links have been added, we can provide the two stage variables TOTALCOMP and FULLNAME as outputs of the Transformer stage, and once that is done, these values will be available as inputs in the subsequent stages.

  • Double-click on the Transformer stage to open the Stage page. Go to the Outputs tab and click on +Add twice to add two new entries in the table. Update one entry with the Derivation value as TOTALCOMP, the column name as TOTALCOMP, the SQL type as Double, the precision as 9, the scale as 2, and set Nullable to true. Update the second entry with the Derivation value as FULLNAME, the column name as FULLNAME, the SQL type as Varchar, the precision as 36, and the scale as 0.

  • We also need to add a constraint here, which will ensure that only the records with TOTALCOMP of more than 50000 are sent in the output. Click on the empty space under Constraint to open the Derivation Builder. Specify the derivation as TOTALCOMP>50000.

  • Switch to the second output link by clicking on the Link_<number> under output name and repeat the above steps to add the two stage variables to the output and to add the constraint. Click OK to save the changes and return to the canvas.

    Transformer - add output columns and constraint

  • Provide the output of the Aggregator stage as the input to the Join stage, double-click on the Aggregator stage to open the Stage page, and select DEPTNAME as the grouping key.

    Aggregator - add grouping key

  • Scroll down to the aggregations and select the aggregation type as Calculation, column as TOTALCOMP, calculation type as Mean Value, then click + Add Calculation.

    Aggregator - add calculation

  • Go to the Outputs tab and verify that you can see two output columns: MEAN_TOTALCOMP and DEPTNAME. Click OK to save the changes and return to the canvas.

    Aggregator - confirm output columns

  • Drag and drop a Sort stage on the canvas and provide the output of the Join stage as the input.

    Add Sort stage

  • Double-click on the Join stage to open the Stage page. On the Stage tab, verify that the join key is DEPTNAME and the join type is Inner.

    Confirm Join stage

  • Go to the Outputs tab and verify that you can see FULLNAME and MEAN_TOTALCOMP in the output column list. Click OK to go back to the canvas.

    Confirm Join stage outputs

  • Double-click on the Sort stage to open the Stage page. Specify the sorting key as MEAN_TOTALCOMP and the sort order as Descending, then click OK.

    Update Sort stage

  • Drag a Peek stage onto the canvas and connect the output of the Sort stage as the input. The Peek stage can be used to look at the intermediate state of the data at any point in the job. Adding a Peek stage here will enable us to look at the data that comes out of the Sort stage.

    Add Peek stage

  • Double-click the Peek stage. By default, it displays the first 10 records from each data partition. Enable All Records (After Skip) and click OK. This will ensure that the Peek stage outputs all the records and not just the first 10.

    Update Peek stage

  • Drag a Netezza connector to the canvas. In the modal that opens up, click on Add connector and configure details later.

    Add output connector

  • Provide the output of the Peek stage as the input to the Netezza output connector. Double-click the Netezza connector to open the Stage page. Provide the data source as NZSQL, the database as DATASTAGE, and provide your Netezza Performance Server username and password as the user name and password, respectively.

    Output connector - add details

  • Scroll down and under the Usage section, provide the table name as <schema_name>.DEPTEMP and update the table action to Replace, then click OK.

    Output connector - add details - 2

The job design is complete, and the job is now ready to be compiled and run.

Step 6. Compile and run the parallel job

  • Click the Save icon to save the job. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message “Compiled successfully.”

    Save compile

  • Click the Run icon to run the job. In the modal that opens up, click Run.

    Run job

  • Once the job completes, you should see a message that says “Run result: Successful with warnings.” Click Log to view the warnings.

    Run result successful

  • Scroll down and look for the warnings. You can see that the warning was raised because the job attempted to drop the <schema_name>.DEPTEMP table, which does not exist at this point. You can ignore this warning. If you run the job again, the warning should go away as this time the table already exists, and you should see a “Run result: Successful” message.

    View job log

Step 7. Viewing the output

The output was saved in the <schema_name>.DEPTEMP table in Netezza Performance Server. We can view the contents of this table in the Netezza Performance Server console.

  • Go to the Netezza Performance Server console, go to the Query editor. As before, specify to use the DATASTAGE database and set the schema to the schema with the same name as <your_username>. In the section below the schema name, type the following SQL statement and then click Run.
SELECT * FROM DEPTEMP;

If everything runs successfully, you should see the table contents under the Result section.

View output in Netezza console

While you can view the contents of the table by going to the Netezza Performance Server instance, you can also verify the output within the job canvas using the Peek stage, which was added just after the Sort stage, so the data it displays is the output of the Sort stage. This same data is then passed to the Netezza connector, which simply writes the data into a table in the Netezza Performance Server instance.

Go back to your job on DataStage in Cloud Pak for Data. Within the job log, look for the entries similar to the one in the image below that show the Peek results.

View Peek results

NOTE: In this particular case, there will be two such entries with the Peek results because the input data was divided into two partitions based on DataStage presets.

Summary

This tutorial has shown you an example of how you can utilize the power of DataStage on IBM Cloud Pak for Data for performing ETL transformations on data stored in Netezza Performance Server. To keep developing your skills, check out the Getting started with IBM Cloud Pak for Data learning path.