In this tutorial, we’ll learn how to use DataStage to perform extract, transform, load (ETL) operations on data stored in Netezza Performance Server. IBM DataStage® on IBM Cloud Pak for Data enables users to create, edit, load, and run DataStage jobs which can be used to perform integration of data from various sources in order to glean meaningful and valuable information. Netezza® Performance Server for IBM Cloud Pak for Data is a powerful advanced analytics platform that provides a trusted data warehouse and enables true operationalization of data science and machine learning in place and at scale.
Learning objectives
The purpose of this tutorial is to design a DataStage job to satisfy the following problem statement:
As a data engineer, you have been asked by the line of business that you support to create a data table that contains all employees with compensation greater than $50,000. The file must also contain the employee’s department name and the mean salary of all employees in that department of those who earn more than 50,000. In addition, the file must be sorted in descending order, based on the mean salary. Finally, the application that will consume this output data expects the full name of the employee to be in one field, formatted as first name, then middle initial, then last name).
In this tutorial, you will learn how to:
- Create a job in DataStage
- Load data from Netezza Performance Server into DataStage
- Perform transformations, such as modifying tables, joining tables, aggregating table data, and sorting table data
- Write tabular data from DataStage into a table in Netezza Performance Server
- Run jobs
- View logs for jobs
Prerequisites
Estimated time
Completing this tutorial should take about one hour.
About the data
The input for this table consists of the following tables.
The DEPT table contains departmental information and has the following columns:
Column Name | Data Type |
---|---|
DEPTNO | VARCHAR(3) |
DEPTNAME | VARCHAR(28) |
MGRNO | VARCHAR(6) |
ADMRDEPT | VARCHAR(3) |
LOCATION | VARCHAR(5) |
The EMP table contains employee information and has the following columns:
Column Name | Data Type |
---|---|
EMPNO | INTEGER |
FIRSTNME | VARCHAR(9) |
MIDINIT | VARCHAR(1) |
LASTNAME | VARCHAR(10) |
WORKDEPT | VARCHAR(3) |
PHONENO | VARCHAR(4) |
HIREDATE | DATE |
JOB | VARCHAR(8) |
EDLEVEL | SMALLINT |
SEX | VARCHAR(1) |
BIRTHDATE | DATE |
SALARY | DECIMAL(10,2) |
BONUS | DECIMAL(8,2) |
COMM | DECIMAL(8,2) |
The values in the WORKDEPT column of the EMP table come from the DEPTNO column of the DEPT table.
Steps
Step 1. Set up Netezza CLI and the Netezza ODBC driver for DataStage
To connect to Netezza from DataStage, you need to download and install the Netezza ODBC driver in your is-engine-conductor-0 pod:
- Follow the client software installation steps to install the Netezza CLI.
- Follow the instructions to configure access to Netezza databases from DataStage.
NOTE: If you are performing these steps in a Linux environment, you will need to set the Netezza CLI install path location (/usr/local/nz) as
NETEZZA_HOME
instead ofNZClientDir
in the dsenv file.
NETEZZA_HOME=/usr/local/nz
export NETEZZA_HOME
Step 2. Load data into Netezza Performance Server
Before you connect to IBM Netezza Performance Server from DataStage on IBM Cloud Pak for Data, you should create the required tables in your IBM Netezza Performance Server instance. Once the tables are created, you can populate these tables with the data.
Create tables in Netezza Performance Server
Log in to your IBM Netezza Performance Server console. Click on the (☰) hamburger menu in the top-left corner, then click Data.
Click on Create database +.
Provide the name of the database as
DATASTAGE
and click Create.The database is now created and you can see that a schema with your username has been created.
Add data to the DATASTAGE database under this schema. Click on the (☰) hamburger menu, then click Query editor.
Specify the DATASTAGE database and set the schema to the schema with the same name as
<your_username>
. Copy the contents of the CreateNPSTables.sql and paste it in the section below the schema name, then click Run.
If everything runs successfully, you should see a result that says “CREATE TABLE SUCCEEDED.”
NOTE: To drop the tables, you can uncomment and use commented lines 2 and 3 in the SQL.
Load the table data
You can load data into the Netezza Performance Server tables either by using the nzload
command of the Netezza CLI or by running SQL insert statements in the Netezza Performance Server console.
Load the table data using the Netezza CLI
- Download the dept.csv and emp.csv files.
- Run the following commands to load the data from the dept.csv and emp.csv files into the DEPT and EMP tables, respectively:
NOTE: Replace
<user>
with your NPS username,<password>
with your Netezza Performance Server password and<host>
with the Netezza host.
nzload -u <user> -pw <password> -host <host> -db DATASTAGE -t DEPT -delim ',' -df dept.csv
nzload -u <user> -pw <password> -host <host> -db DATASTAGE -t EMP -delim ',' -df emp.csv
If everything runs successfully, you should see the following output saying the load session for the tables was completed successfully.
Load session of table 'DEPT' completed successfully
Load session of table 'EMP' completed successfully
Load the table data using the Netezza Performance Server console
An alternate method of loading data into the tables is to use the Netezza Performance Server console.
Back on the console, go to the Query editor. As before, specify to use the DATASTAGE database and set the schema to the schema with the same name as
<your_username>
. Copy the contents of the LoadDataStageInput.sql and paste it in the section below the schema name, then click Run.
If everything runs successfully, you should see a result that says “INSERT 0 1.”
NOTE: To truncate (empty) the tables, you can uncomment and use commented lines 2 and 3 in the SQL.
Step 3. Provision the DataStage service
Log in to your IBM Cloud Pak for Data console, go to the Services tab, use the Category pull-down and select Data governance, then click on the DataStage Edition tile.
Follow the instructions to provision DataStage. After the service has been provisioned, you should be able to see
Transform data
under (☰) hamburger menu > Organize on IBM Cloud Pak for Data.
Step 4. Add the table definitions
Click Transform data.
You will see that a project named
dstage1
is already available. We will use this existing project. Click on your project to open it.
NOTE: To create a new project, click on + Create, which will open up a modal. Provide the name of the project and click Create. The project takes a few minutes to be created and once ready, it will be visible on the page. Click on your project to open it.
We will need to provide the definitions of the input tables DEPT and EMP here to access their data in DataStage. For this, we will first need to add the Netezza Performance Server instance as a connection in DataStage. We can then import the table definitions using this connection.
Click on the Connections tab and click + Create. A modal is displayed. On the General tab, provide the name of the connection as
Netezza
. Switch to the Connection tab.On the Connection tab, select the connector type as
Netezza
. SpecifyNZSQL
,DATASTAGE
,<your NPS user name>
, and<your NPS password>
as the values for data source, database, user name, and password, respectively. Click OK.On the next screen, click Save.
The new connection named Netezza
should now be on the Connections tab.
Switch to the Table definitions tab, click on the import icon on the right-hand side, and in the modal that opens, select Netezza, and click Next.
Select the schema, which should be the same as
<your NPS user name>
, then click Next.Select the table with the definitions you wish to import. Select the
DEPT
table and click Next.The columns of the DEPT table will be displayed and by default, all columns will be selected. Click Finish.
Click Save to save the table definition. Notice that the table definition will be saved as
<schema name>.DEPT
.Repeat the above steps to import the EMP table definition. The table definition will be saved as
<schema name>.EMP
. Finally, you should be able to see the table definitions for both the DEPT and the EMP tables listed on the Table definitions tab.
Step 5. Create and design a parallel job in DataStage
Click on the Jobs tab, then click + Create > Parallel job.
A new tab with the name
Job_1*
opens up, where you can now start designing the parallel job.
Take a few minutes to look at the various buttons and menus available:
- The
Show/Hide Palette
button can be used to show or hide the palette on the left of the screen, which contains the connectors and stages that can be used while designing the job. - The Save, Compile, and Run icons are used to save, compile, and run the job.
- The Compare icon is used to compare this job with another job in the project.
- The View menu has a number of options, including:
- Log to view the job run log (available once the job has been run)
- View code to view the OSH code (available once the job has been successfully compiled)
- Switch view to switch to another open job
- Properties to view the properties of the job
- The Schedule icon is used to set a schedule to run the job.
- The Settings menu has a number of options, including:
- Apply horizontal layout, which arranges all the connectors and stages in the canvas in a horizontal manner with data flowing from left to right
- The ability to view/hide annotations, arrows, link names, and the type of the connector/stage.
- Auto propagation to specify if you want automatic propagation of metadata
- Smart palette, which applies smart logic based on usage patterns to reorder the items available in the palette; if disabled, the items in the palette display in alphabetical order
- Smart stage suggestions applies smart logic based on usage patterns to suggest the next stage that you might want to add to the job
Before we begin designing the job, let’s save this empty job by clicking on the Save icon. In the modal that pops up, provide the name of the job as <user>_job
where \<user\> is your name, then click Save.
NOTE: Remember to periodically save the job while designing it. Changes are not auto-saved and you will lose your progress if the browser is closed without saving the job.
The first step in the job is to bring in the input tables DEPT and EMP, which have already been loaded into the Netezza Performance Server instance:
Drag a Netezza connector from the palette onto the canvas. In the modal that opens up, select the
DATASTAGE\<schema_name>.DEPT
table definition, then click Next.On the next screen, the columns in the table definition will be displayed. All the columns will be selected by default. Click Add to job.
Drag another Netezza connector to the canvas and repeat the steps given above, but this time, select the
DATASTAGE\<schema_name>.EMP
table definition instead. Once you complete the steps, you should see the two Netezza connectors on the canvas.
Next we will use a Modify stage to perform the following modifications to the MIDINIT
and WORKDEPT
columns in the EMP table:
Change the name of the WORKDEPT column to
DEPTNO
in order to ensure that both the tables have the same name for the column that contains the department number information. This is needed for a future step where we will join the two tables.Replace any NULL
MIDINIT
values with" "
. This is needed for a future step where we will combine the FIRSTNME, MIDINIT, and LASTNAME columns to create the FULLNAME of the employee.Drag and drop a Modify stage next to the Netezza connector for the EMP table. Provide the output of the EMP table Netezza connector as the input to the Modify stage. For this, click on the little blue dot on the right side of the Netezza connector and drag the mouse pointer to the Modify stage.
NOTE: For another method to connect the Netezza connector to the Modify stage, click on the Netezza connector to select it, then drag and drop the Modify stage. The Modify stage will automatically be connected to the Netezza connector.
Now that an output link has been added to the Netezza connector for the EMP table, we can double-click on this connector to open the Stage page.
Here, we will need to provide the connection details for the data that needs to be imported using this connector. Provide the data source, database, user name, and password as
NZSQL
,DATASTAGE
,<your NPS user name>
, and<your NPS password>
. Verify that the table name is<schema_name>.EMP
, then click OK.Next, drag and drop a Join stage to the canvas and provide the output of the Modify stage as the input.
Double-click on the Modify stage to open up the Stage page. On the Properties tab, we will provide two specifications. Click + Add Specification and add the following specifications in the spaces available:
DEPTNO=WORKDEPT
MIDINIT=handle_null(MIDINIT," ")
Switch to the Outputs tab, click on the row for WORKDEPT to expand it, update the column name to
DEPTNO
, and click Apply > OK.Both the tables now have a column called DEPTNO, which can be used to join the tables. Provide the output of the DEPT table Netezza connector as the second input to the Join stage.
Double-click on the Netezza connector for the DEPT table. Just as in the EMP table Netezza connector, provide the data source, database, user name, and password as
NZSQL
,DATASTAGE
,<your NPS user name>
, and<your NPS password>
. Also verify that the table name for this connector is<schema_name>.DEPT
, then click OK.Double-click the Join stage to bring up the Stage page. Here, you can verify that the
DEPTNO
is being used as theJOIN KEY
, and theJoin Type
isInner
. If theJOIN KEY
is blank, click on the drop-down list and select DEPTNO from the list. Click OK to go back to the canvas.Next, add a Transformer stage to the canvas and provide the output of the Join stage as the input.
Double-click the Transformer stage to open the Stage page. We will add two stage variables to this stage. On the Properties tab, click + Add. A new entry is added in the table below. Use this to define a stage variable named
TOTALCOMP
, which will represent the total compensation (the sum of bonus, commission, and salary) of the employee. Double-click on {derivation} to open the Derivation Builder.Build the derivation rule for
TOTALCOMP
by finding the BONUS, COMM, and SALARY input columns in the table. Clicking on the entries in the table will insert them in the Derivation at the top. Type in+
signs between them in order to complete the derivation. You can also use the search bar to find the fields. When inserted in the Derivation, the column names will be prepended with the identifier for the input link. Click OK to go back to the Stage page.Update the name of the stage variable to
TOTALCOMP
, the SQL type toDouble
, the precision to9
, and the scale to2
.Repeat the process above to add another Stage variable of
FULLNAME
, which will represent the complete name of the employee. Provide the Derivation asCompactWhiteSpace(<Link>.FIRSTNME:" ":<Link>.MIDINIT:" ":<Link>.LASTNAME)
, the name of the stage variable asFULLNAME
, the SQL type asVarchar
, the precision as36
, and the scale as0
. Click OK to save the changes and return to the canvas.
NOTE:
<Link>
is a placeholder for the input link coming into the Transformer stage. When you search for theFIRSTNME
,MIDINIT
, andLASTNAME
variables in the inputs table on the Derivation Builder screen, the link name is already specified for all the input variables.CompactWhiteSpace
is a function that will compact any continuous white spaces into a single white space. CompactWhiteSpace is also available in the table shown on the Derivation Builder screen.:
(colon) is the operator used for concatenation.
Next, add a Join and an Aggregator stage to the canvas. Connect the Transformer stage to both these stages such that the output of the Transformer stage is provided as the input to both these stages.
Since the output links have been added, we can provide the two stage variables
TOTALCOMP
andFULLNAME
as outputs of the Transformer stage, and once that is done, these values will be available as inputs in the subsequent stages.Double-click on the Transformer stage to open the Stage page. Go to the Outputs tab and click on +Add twice to add two new entries in the table. Update one entry with the Derivation value as
TOTALCOMP
, the column name asTOTALCOMP
, the SQL type asDouble
, the precision as9
, the scale as2
, and set Nullable totrue
. Update the second entry with the Derivation value asFULLNAME
, the column name asFULLNAME
, the SQL type asVarchar
, the precision as36
, and the scale as0
.We also need to add a constraint here, which will ensure that only the records with
TOTALCOMP
of more than 50000 are sent in the output. Click on the empty space under Constraint to open the Derivation Builder. Specify the derivation asTOTALCOMP>50000
.Switch to the second output link by clicking on the
Link_<number>
under output name and repeat the above steps to add the two stage variables to the output and to add the constraint. Click OK to save the changes and return to the canvas.Provide the output of the Aggregator stage as the input to the Join stage, double-click on the Aggregator stage to open the Stage page, and select DEPTNAME as the grouping key.
Scroll down to the aggregations and select the aggregation type as
Calculation
, column asTOTALCOMP
, calculation type asMean Value
, then click + Add Calculation.Go to the Outputs tab and verify that you can see two output columns: MEAN_TOTALCOMP and DEPTNAME. Click OK to save the changes and return to the canvas.
Drag and drop a Sort stage on the canvas and provide the output of the Join stage as the input.
Double-click on the Join stage to open the Stage page. On the Stage tab, verify that the join key is
DEPTNAME
and the join type isInner
.Go to the Outputs tab and verify that you can see FULLNAME and MEAN_TOTALCOMP in the output column list. Click OK to go back to the canvas.
Double-click on the Sort stage to open the Stage page. Specify the sorting key as
MEAN_TOTALCOMP
and the sort order asDescending
, then click OK.Drag a Peek stage onto the canvas and connect the output of the Sort stage as the input. The Peek stage can be used to look at the intermediate state of the data at any point in the job. Adding a Peek stage here will enable us to look at the data that comes out of the Sort stage.
Double-click the Peek stage. By default, it displays the first 10 records from each data partition. Enable All Records (After Skip) and click OK. This will ensure that the Peek stage outputs all the records and not just the first 10.
Drag a Netezza connector to the canvas. In the modal that opens up, click on Add connector and configure details later.
Provide the output of the Peek stage as the input to the Netezza output connector. Double-click the Netezza connector to open the Stage page. Provide the data source as
NZSQL
, the database asDATASTAGE
, and provide your Netezza Performance Server username and password as the user name and password, respectively.Scroll down and under the Usage section, provide the table name as
<schema_name>.DEPTEMP
and update the table action toReplace
, then click OK.
The job design is complete, and the job is now ready to be compiled and run.
Step 6. Compile and run the parallel job
Click the Save icon to save the job. Once the job is saved, click on the Compile icon to compile it. If compilation is successful, you should see a green check mark and the message “Compiled successfully.”
Click the Run icon to run the job. In the modal that opens up, click Run.
Once the job completes, you should see a message that says “Run result: Successful with warnings.” Click Log to view the warnings.
Scroll down and look for the warnings. You can see that the warning was raised because the job attempted to drop the
<schema_name>.DEPTEMP
table, which does not exist at this point. You can ignore this warning. If you run the job again, the warning should go away as this time the table already exists, and you should see a “Run result: Successful” message.
Step 7. Viewing the output
The output was saved in the <schema_name>.DEPTEMP
table in Netezza Performance Server. We can view the contents of this table in the Netezza Performance Server console.
- Go to the Netezza Performance Server console, go to the Query editor. As before, specify to use the DATASTAGE database and set the schema to the schema with the same name as
<your_username>
. In the section below the schema name, type the following SQL statement and then click Run.
SELECT * FROM DEPTEMP;
If everything runs successfully, you should see the table contents under the Result section.
While you can view the contents of the table by going to the Netezza Performance Server instance, you can also verify the output within the job canvas using the Peek stage, which was added just after the Sort stage, so the data it displays is the output of the Sort stage. This same data is then passed to the Netezza connector, which simply writes the data into a table in the Netezza Performance Server instance.
Go back to your job on DataStage in Cloud Pak for Data. Within the job log, look for the entries similar to the one in the image below that show the Peek results.
NOTE: In this particular case, there will be two such entries with the Peek results because the input data was divided into two partitions based on DataStage presets.
Summary
This tutorial has shown you an example of how you can utilize the power of DataStage on IBM Cloud Pak for Data for performing ETL transformations on data stored in Netezza Performance Server. To keep developing your skills, check out the Getting started with IBM Cloud Pak for Data learning path.