Large-scale enterprises tend to have a large amount of linked data spread across multiple tables and files. There is a real need for the ability to gain insight about the quality and business content of these tables and files in order to identify and locate the correct information quickly. Watson Knowledge Catalog on IBM Cloud Pak for Data enables users to discover and analyze assets present in various data sources.
The data discovery feature of Watson Knowledge Catalog enables users to gain insight about the quality and content of their data assets using one of two methods. When the size and number of assets is unknown, Quick scan provides a quick view of their data quality. When there is a good first-level understanding of the data assets, Automated discovery provides the means to run a deeper analysis on the assets.
In this tutorial, you will learn how to use Quick scan and Automated discovery on IBM Cloud Pak for Data to discover and analyze data assets from data sources.
In this tutorial, you will:
- Discover data assets using Quick scan.
- Look through the results of Quick scan.
- Discover data assets using Automated discovery.
- Look through the results of Automated discovery.
- IBM Cloud account
- Db2 on IBM Cloud
- IBM Cloud Pak for Data
- Watson Knowledge Catalog on Cloud Pak for Data
- The steps in Incorporate enterprise governance in your data must be completed.
This tutorial will take approximately 60 minutes to complete.
About the data
This tutorial uses synthetic patient healthcare data created using Synthea. The data sets provided as part of this tutorial are:
- PATIENTS – demographic information about patients
- PAYERS – information about payers such as an insurance company
- PROVIDERS – information about providers such as a primary care physician
- ORGANIZATIONS – information about organizations that the providers are associated with
- ENCOUNTERS – information about an encounter between patients/providers/payers/organizations
- ALLERGIES – allergy information for a patient
- CAREPLANS – information about care plans devised for a patient
- CONDITIONS – information about identified conditions for a patient
- IMMUNIZATIONS – information about the immunizations provided to a patient
- MEDICATIONS – information about the medications that a patient is taking
- OBSERVATIONS – patients’ healthcare observations such as weight/height/blood pressure taken during an encounter
- PROCEDURES – information about procedures that a patient has undergone
NOTE: This tutorial provides instructions for discovering and analyzing the PATIENTS and ENCOUNTERS tables. For an extended version of this tutorial, load all 12 data sets.
About data discovery
You can gain insight about the quality and business content of the tables and files within your data sources using data discovery. Data discovery comes in two flavors: Quick scan and Automated discovery.
Use Quick scan to get a quick first-level understanding of your data. This is especially helpful when you have an unknown number of assets of unknown size. Quick scan quickly analyzes a sample of each table or file to generate insights such as the quality score of the data, the business terms within the data, and the data classes for the data. You can publish the data to one or more catalogs once you have reviewed it.
Once you have a fair understanding of the kind of data present in your data sources, you can run a deeper analysis on perhaps the most interesting or most useful of the assets using Automated discovery. Unlike Quick scan, with Automated discovery, the metadata and analysis results are automatically imported into the default catalog. The analysis results are also made available for viewing and updating within a project, and these results include the data quality score, automatically assigned data classes, business terms, data types, formats, frequency distributions, and more.
NOTE: As part of this tutorial, you will first run Quick scan on your data and glance through the results. Next, you will run Automated discovery on your data. In the next tutorial within this series, you will perform deeper analysis on the Automated discovery results for your data.
Step 1. Load data into the data sources
NOTE: Ensure that you have provisioned Db2 on IBM Cloud before proceeding.
Download the data sets
- Download and extract the contents of the Healthcare-Data.zip file.
Seed the Db2 database
Open a browser and log into your Db2 instance.
Navigate to the hamburger (☰) menu in the upper-left corner, expand Load, and click on Load Data. Click browse files and select the patients.csv file from the downloaded archive, then click Next.
Choose your schema (the name will match your Db2 username) and click + New Table. Under Create a new Table, provide
PATIENTSas the name of the table and click Create > Next.
On the next screen, you can define the data types for the columns in your table. The screen shows the data types that have been detected by Db2. For now, click Next to continue with the defaults. On the next screen, click Begin Load.
The PATIENTS table will now be created by loading the records found in patients.csv file.
Repeat the steps to load encounters.csv file into ENCOUNTERS table.
NOTE: For the extended version of this tutorial, load all 12 data sets provided in the Healthcare-Data.zip archive. The following table provides the table names for each input file.
|File Name||Table Name|
Get the database connection info
Log into IBM Cloud and navigate to your Db2 resource.
Go to Service Credentials, click New credential +. A new set of Service credentials is created. Expand the entry to look at the credentials.
From the service credentials, extract the following values. You will need them later when you create a platform connection for this Db2 instance in your Cloud Pak for Data instance.
The SSL certificate is in the base64 encoded DER format and Cloud Pak for Data requires that the certificate should be in the PEM format. Save the certificate_base64 value in a file (db2_certificate.pfx). Then run the following command in a terminal (or command prompt) to convert the certificate:
openssl base64 -d -A -in db2_certificate.pfx -out decoded_db2_certificate.pem
- The decoded_db2_certificate.pem file contains the decoded certificate needed while creating the platform connection on your Cloud Pak for Data instance.
Step 2. Add connection to CPD
You will now add the Db2 instance as a platform connection to IBM Cloud Pak for Data.
Log into your IBM Cloud Pak for Data instance.
To add a new data source, go to the upper left hamburger (☰) menu, expand Data and click on Platform connections.
At the overview, click New connection +.
Provide a name for the connection. Provide the connection details (database name, host, port, username and password) that were obtained earlier. Click the checkbox for “The port is configured to accept SSL connections” and provide the converted SSL certificate (from the decoded_db2_certificate.pem file) in the textbox for SSL Certificate. Click on Test to test the connection. Once the test is successful, you will see a notification at the top of the screen. Click Create to create the connection.
Step 3. Perform Quick scan data discovery
The next step is to perform a Quick scan of the data in the connection.
NOTE: Ensure that the steps outlined in the Incorporate enterprise governance in your data tutorial have been completed, and the required governance artifacts are available.
Navigate to the upper-left hamburger (☰) menu, expand Governance and click on Data discovery.
Click on Quick scan.
Click Select a connection > Find or add connection.
Select the Db2 connection you had added as a platform connection in IBM Cloud Pak for Data and click Add.
Click Browse to select the discovery root. This will enable you to restrict which schemas within the Db2 database will be discovered.
Select the schema where you had created the tables and click Select.
Back on the Quick scan job creation screen, click Select a project > Add a project.
Give the project a name (HealthcareAnalysis-QS), an optional description (Data Quality project for Healthcare data using Quick scan) and click Create.
Ensure that the checkboxes for Analyze data quality and Use machine learning to assign terms are checked, then click Discover to start discovering the assets.
A new Quick scan job is kicked off. An entry for the job is displayed on the Pending analysis tab with the status as Analyzing. Give it about a minute and click the Refresh icon to refresh the list.
When the analysis completes, the record won’t be visible on the Pending analysis tab anymore and will be shown under the Action required tab with the status as Ready for review. Click on the Job ID.
Step 4. Review Quick scan discovery results
- Take a few minutes to look at all the information presented on the screen. All the discovered assets are displayed on the screen. You can use the left-hand panel to choose the type of assets you would like to see – files, schemas, tables, and columns. Look through the other filter options available – for example, for Column asset types, you can filter on Tables, Schemas, Quality score, Assigned business term, Assigned data classes, and more.
NOTE: If you have loaded all 12 tables into Db2, as per the extended version of this tutorial, you will see that 12 tables have been discovered during data discovery.
Click on Table to show the table assets and look for the PATIENTS data asset. The quality score for the PATIENTS data asset is shown under the Quality column. Click on PATIENTS.
All the columns in the PATIENTS data asset are shown on the screen. Look through the information provided for each column. Most of the values shown are read-only and cannot be updated. For example, for the BIRTHPLACE column the quality score is 97%, but you cannot see the reasoning behind the score. The Assigned data class is Person Name with a confidence of 67%, which is obviously incorrect, but it is a read-only field, so it cannot be changed. The only thing that can be edited is the Assigned business term. There is no business term assigned to the BIRTHPLACE column, and Patient Birthplace is the Suggested business term. To edit the business term for the column, click on the corresponding pencil icon under Actions.
A pop-up window is displayed for setting the business term for the column. To assign a business term from the Suggested business terms, click on the checkmark against the business term. Or you can use the search box to look for the business term to assign and click on the search result that is shown. Click Update to save the changes.
You can see that the Assigned business term for BIRTHPLACE has now been updated. If you wish to, you can repeat these steps to view and update the business terms for the other columns. (NOTE: you may need to use the arrows at the bottom of the page to view all the columns.) Click on the back arrow to go back to the discovery results for the discovery job.
NOTE: Quick scan provides a first glance at your data, which enables you to figure out which data assets you want to analyze thoroughly using Automated discovery. You will be performing Automated discovery on your entire data in the next few sections of this tutorial, and you will be performing in-depth data analysis on the data assets in the next tutorial within this series.
As in case of PATIENTS data asset, you can continue with updates for the rest of the tables if you wish. Once you are satisfied with the updates, you can publish the assets to a catalog in order to make them available for other users. Select all the tables that you have reviewed and wish to publish and click Publish assets.
A pop-up window is shown, listing the tables that have been selected for publish. Use the drop-down to select the catalog you wish to publish the assets to and click Publish.
The status of the assets will be updated to Submitted and once the assets are published, the status will be updated to Published. Click Data discovery in the breadcrumbs to go back to the Quick scan results screen.
Once the assets are published to the catalog, the Quick scan job status will be updated to Reviewed, and the job will be moved to the Reviewed tab on the Quick scan results screen.
Step 5. Perform automated data discovery
You have performed Quick scan on your data to view the data assets and to get an idea about their data quality. Now, you can perform deeper analysis on your data assets using automated data discovery.
NOTE: Ensure that the steps outlined in the Incorporate enterprise governance in your data tutorial have been completed, and the required governance artifacts are available.
Click on New discovery job > Automated discovery.
Click Select a connection and choose the same Db2 connection you selected for Quick scan.
Click Browse to select the discovery root. This will enable you to restrict which schemas/tables within the Db2 database will be discovered.
Unlike the Quick scan, Automated discovery allows you to drill further down and choose tables within the schema. Thus, you can pick and choose which tables you want to perform the deeper quality analysis on. But for now, select the schema so you can perform Automated discovery on all the tables in the schema. Select the schema where you had created the tables and click Select.
Back on the Automated discovery job creation screen, click Select a project > Add a project.
Give the project a name (HealthcareAnalysis) and click Create.
Click Go to project settings to update the project’s settings.
The project settings will be opened in a new browser tab. Click on the Keys and relationships tab, scroll down to Foreign key settings and update Maximum percentage of allowed orphan values to
99, Minimum percentage of common distinct values to
0.1, and Minimum confidence for the relationships to
0.1. Click Save to save the changes and close the browser tab.
Back on the Automated discovery job creation screen, check the checkbox for Analyze columns, Analyze data quality, and Use data sampling. Select Use a random sampling, provide the seed as
1234and the percentage as
10. Click Discover to start discovering the assets.
NOTE: While the Automated discovery job creation screen lets you publish results to catalog once the discovery and analysis processes have completed (using the Publish results to catalog option), keep the box unchecked for now. You will be publishing the results to the catalog as part of another tutorial in this series.
NOTE: While you can provide a different seed, using the same seed will result in the same records being selected in your sample as in case of this tutorial. As a result, you will be able to see exactly the same results once the data discovery process is complete.
- A new Automated discovery job is kicked off and your schema name should be listed under discovered assets. It should take about a few minutes to import and analyze the assets. The status will be reflected on the screen. Click the refresh icon or refresh your browser to get the updated status. Finally, the status of the import and analyze processes will show up as Finished. Click on the project name (HealthcareAnalysis) to go into your project.
Step 6. Review Automated discovery results
Take a look at the different tabs present on the screen, which give a lot of information about the project. The Data asset tab gives information about the data assets in the project. The Dashboard tab gives analytics about the most recent analysis performed on the assets. The Data rules tab gives information about the rules applied to the project and the ones available to be applied. The Relationships tab gives information about the relationships between the data assets. The Settings tab is where the project specific settings can be applied.
Find and click on the PATIENTS data asset to see more details about it.
The Columns tab shows information about the columns in the asset. You can see that the discovery process found 25 columns in the PATIENTS data asset. The table provides information regarding the analysis status and last analyzed time of the columns, as well as information identified and auto-assigned by the analysis process such as data classes, business terms, format, and data types. Scroll and view all the information present in the table against each record. The data quality analysis results, the data quality score, and the status of primary key analysis are also shown on this page. In the left bottom of the screen, you can see the data quality scores for each column in the PATIENTS table.
Look through the other tabs. The Governance tab gives information about suggested business terms for the data asset and provides the ability to assign them or to reject previously assigned terms. The Data quality tab provides information about the data quality score and lists the columns/values that do not satisfy the criteria as specified by the data quality dimensions. The Data classes tab gives a summary of the data classes that were found and the ones that were selected. Data types tab provides a summary of the types of data found in the data asset. The Rules tab gives information about rules added to the asset and the rule violations found on running the rules.
Finally, click on the Keys tab. Here you can see the results of the primary key analysis process. The status column shows that currently there are six columns that were identified as Candidate primary keys. By default, the filter is set to Non rejected keys, meaning only the Candidate and Selected keys are displayed in the list. You can use the filter to show the rejected keys. (currently there are none) Clicking Show foreign keys can show you all foreign key relationships that the data asset participates in (currently there are none, since relationship analysis has not been run on the data assets in the project). Run primary key analysis provides the ability to re-run primary key analysis on this asset.
Now, let us look at one of the columns within the PATIENTS data asset. Go back to the Columns tab and click BIRTHDATE.
Go through the different tabs to view information about the BIRTHDATE column in the PATIENTS data asset. The Column properties tab provides metadata about values present in the column. The Data quality tab lists the values that do not satisfy the criteria as specified by the data quality dimensions. The Rules tab gives information about rules added to the asset that are related to the column and the rule violations found on running the rules. The Data classes tab specifies the various data classes detected for the column, along with the confidence that Watson Knowledge Catalog has for each data class. It also mentions the data class that was selected for the column and allows you to change the selected data class. The Governance tab gives information about suggested business terms for the column and provides the ability to assign or reject previously assigned terms. The Data types tab specifies the various data types detected for the column, the confidence that Watson Knowledge Catalog has for each detected data type, and specifies the data type that was selected. The Formats tab gives information about the different formats found for the data in the column, along with the number of records that adhere to each of the formats. Finally, Frequency distribution, as the name suggests, gives the frequency distribution of the values in the column.
In this tutorial, you have learned to use data discovery on the IBM Cloud Pak for Data platform to discover data assets in your data sources. You ran both Quick scan and Automated discovery on your data assets. IBM Watson Knowledge Catalog identified the governance artifacts for the assets and performed automatic assignment of these artifacts to your data. It then calculated a quality score for your data assets.
You also observed the differences between Quick scan and Automated discovery. While Quick scan gave a quick view of the quality scores of the assets, Automated discovery — in addition to the scores — also displayed data quality dimension violations, which give more insight into the reasoning behind the quality scores. Both Quick scan and Automated discovery identified and assigned data classes and business terms to assets, but while Quick scan only allows updates to the assigned business terms, Automated discovery allows updates for both. In addition, Automated discovery also identifies and allows updates for data types and formats for each column. Finally, Automated discovery also provided the ability to identify and select primary keys for data assets, as well as foreign key relationships between the data assets.
This tutorial is part of the An introduction to the DataOps discipline series. To continue along, take a look at the next tutorial titled Analyze discovered data to gain insights on the quality of your data, where you will work with the Automated discovery results and learn how to update the auto-identified governance artifacts, how to look for relationships between assets, and learn how the quality score for your data is computed.