Smart data remediation and curation with the Federated DataOps API

As the world becomes more interconnected, data generated at distributed networks by people and machines creates many challenges. According to Gartner, the world generates more than 2.5 quintillion bytes a day, or about 500 million 2-hour HD movies per day. Data at various geographic locations might have different quality and provenance. To gain insights and nuggets of data from this massive amount of information, data scientists and data engineers spend an enormous amount of time cleansing, understanding, curating, and managing the data. It’s even more challenging orchestrating, understanding, and managing data at the edge of networks where devices or infrastructures in the distributed environments continually change as new devices or new network topology changes or the context changes.

To solve these challenges as well as for self-managing without requiring manual intervention and getting the data prepared for AI pipelines, IBM researchers created two APIs: the Data Quality for AI API and the Federated DataOps API. The usage of these APIs depends on the use case.

The Data Quality for AI API performs quantification data quality by remediating data with explanation. The Federated DataOps API extends the Data Quality for AI API by solving data and label remediation for tabular text data sets in distributed environments. Each function provides a quality score and an explanation towards remediation. These metrics quantify data issues as a score between 0 and 1, where 1 indicates no issues were detected and any value less than one indicates that data needs to be corrected. Currently, these metrics are for text tabular data sets and accept the input in the form of a comma or CSV file format.

Version 1.0 of the Federated DataOps API is concerned with accessing data quality for a single distributed site. Future iterations planned for the library are to enable federated distributed data quality assessments, such as schema reconciliation methods like column mapping and label standardization.

Consider a data scientist, Sam, who works at a multinational pharmaceutical company and who is tasked with curating a large amount of data for drug discovery. He and his team are pressed to accelerate data preparation for the next stage of the AI pipeline. The data is sourced from several locations around the world, and the data does not all conform to the same standards. For example, the data headings did not conform to the same standards for a depression drug compound. Both psilocybin and psilocin are used as headings. In some instances, he and his team found that the data set had columns misaligned. Sam used the Federated DataOps API to normalize the terabytes of data to speed up the process.

In this tutorial, we provide a step-by-step guide on using the Distributed AI Federated DataOps API to solve the data challenges and for performing data quality on text and categorical column data and label management methods in tabular formats used for machine learning tasks such as data assessment and remediation.

Following is a complete Python notebook to help you get started using the API. This tutorial also covers a few basic steps such as getting access to a trial subscription of Distributed AI APIs on the IBM API Hub platform

The tutorial uses a Python notebook, which you can run in your preferred IDE.

Prerequisites

To complete this tutorial, you need:

  • An IBM ID
  • Python 3.8
  • Python notebook IDE

Estimated time

It should take you approximately 30 minutes to complete this tutorial.

Steps

Step 1. Environment setup

To set up your environment:

  1. Navigate to the Distributed AI APIs documentation page, and click Get trial subscription.

    Trial subscription

  2. Log in on the registration page if you already have an IBM ID. Otherwise, create a new IBM ID for yourself.

  3. After you log in, the system entitles you with a trial subscription and takes you to the My IBM page. Locate the Trial for Distributed AI APIs tile, and click Launch.

  4. On the My APIs page, click the Distributed AI APIs tile. When the page opens, locate the Key management section, expand the row to see both the Client ID and Client secret, and click the visibility (eye) icon to reveal the actual values. Make a note of these values because they are the API keys you use throughout this tutorial.

    Key management

  1. Create a config.json file with the API key values that you received.

     {  
          "x-ibm-client-id":  "REPLACE_THIS_WITH_YOUR_CLIENT_ID",
          "x-ibm-client-secret": "REPLACE_WITH_YOUR_CLIENT_SECRET"
     }
    
  2. Install a requests Python package using pip.

     pip install torch
    

Step 2. Invoke Distributed AI Federated DataOps API

On the API documentation page, look for Distributed AI Federated DataOps API.

Federated DataOps API

Step 3. Python notebook example

The example Python notebook analyzes, remediates, and curates quality data that is ready for the AI pipline. Step through the Python notebook code in your preferred Python notebook IDE.

The Federated DataOps API is a collection of library functions that are used to perform data quality on data sets used for machine learning and business intelligence tasks such as data assessment and remediation. The functions used in this notebook example are data and label management methods that are used for quality assessment and remediation for text and categorical columns in tabular data sets.

Methods for data and label management are:

  • Model Creation: Creates a fasttext model to learn the relationship between column values in the data set to validate or impute data
  • Data Validation: Determines if data input is out of vocabulary (that is, not present in the data set)
  • Data Imputation: Offers remediation to impute null column values
  • Data Noise: Identifies data inputted in the incorrect column
Import libraries for calling APIs and model creation
import requests
import pickle
import json
import os

Data/label management requirements

The data/label management functions expect a pretrained fasttext model and a tabular data set (in a .csv format) to use many of the library functions. To create a fasttext model, you can use the generate_nlp api provided by this library, but first you must provide part of the data set to be used as the corpus. This can either be a subset of the data carefully curated for quality samples or a subset generated as part of the test set.

In this example, we use the Drug Review data set.

The Drug Review data set, used for machine learning tasks such as classification, regression, and clustering, is already split into a training set and test set. We use the test set as the corpus for the fasttext model, which allows users to learn text representation and text classifiers. Before loading the training and test set, convert the file format to a CSV format.

Preparing the corpus

  1. Load the test set (renamed to a .csv file) as a Pandas DataFrame.
  2. From reviewing the data set, notice that there are columns that might not be useful as part of the training corpus (that is, not text columns, or data values that are natural language such as the review columns). Therefore, we remove them from the DataFrame.
  3. Save the DataFrame locally as a .csv (corpus.csv) file. The Federated DataOps API currently accepts a .csv file format.
  4. Call the generate_nlp_model API using the corpus.csv file that you generated in the previous step.

     df_test = pd.read_csv('drugLibTest_raw.csv', encoding='utf8')
     df_test
     test = df_test[["urlDrugName","effectiveness","sideEffects","condition"]]
     test.to_csv('corpus.csv', encoding='utf-8', index=False)
    

Invoke REST service endpoint to create utilities for data evaluation

Retrieve the REST endpoint (IP and Port) as reported when the REST server was started, and invoke the data imputation service. The data service accepts multipart/formdata requests with the following arguments.

  • data_file: Path to data file to use for corpus. Type=file
url = 'URL/generate_nlp_model'
data_file = [('data_file', ('corpus.csv', open('corpus.csv', 'rb'), 'file/csv'))]
r = requests.post(url, files=data_file, verify=False)

Check response code

r

Download the compressed file that is returned from the API call that contains the utilities for the data management methods.

  1. The Fasttext model (fasttext file) is used to validate and impute data during the assessment.
  2. The Cooccurrence matrix (pickle file) is used to impute data based on the statistics of the test set.
  3. The data/column mapping (pickle file) is used to validate that data is correctly remediated in the appropriate column.

     with open('dataset_utils.zip', 'wb') as fd:
             for chunk in r.iter_content(chunk_size=128):
                 fd.write(chunk)
    

Before proceeding, unzip the dataset_utils.zip file that was downloaded in the previous cell.

Data imputation API example

The data imputation API evaluates a data set and provides data values for which to impute missing column values by using either a cooccurrence matrix to impute values based on the statistical likelihood given by the column neighbors, or by using n surrounding neighbors to predict the null value.

To test the data imputation API, you must first modify the training set for example purposes because the Drug Review data set does not contain missing values. Before proceeding, unzip the dataset_utils.zip file that is returned from the previous cell.

  1. Load the training set of the Drug Review data set.
  2. Add null values to the condition column, where “xanax” is the drug in the urlDrugName column for example purposes.
  3. Use these columns to build a subset CSV file for this example (the API expects a CSV file for the data file).

     df_train = pd.read_csv('drugLibTrain_raw.csv', encoding='utf8')
     df_train = df_train[["urlDrugName","condition"]]
     df_train.loc[df_train['urlDrugName'] == "xanax", 'condition'] = ''
     df_train.to_csv('imputation_example.csv', encoding='utf-8', index=False)
    
Invoke REST service endpoint

Retrieve the REST endpoint (IP and Port) as reported when the REST server was started, and invoke the data imputation service. The data service accepts multipart/formdata requests with the following arguments.

  • data_file: Path to the data file to evaluate. Type=file
  • model: Path to saved keyed vectors of the fasttext model that is used to predict. Type=file
  • matrix: Path to the pickled file of the cooccurrence matrix that is used to determine the likelihood of the value to impute. Type=file
  • dictionary: Path to the pickled file of the data/column map to ensure that the imputed value is appropriate. Type=file
url = 'URL/data_imputation'

multiple_files = [
    ('data_file', ('imputation_example.csv', open('imputation_example.csv', 'rb'), 'file/csv')),
    ('model', ('dataset_utils/fasttext.kv', open('dataset_utils/fasttext.kv', 'rb'), 'file/kv')),
    ('matrix', ('dataset_utils/fasttext_matrix.pkl', open('dataset_utils/fasttext_matrix.pkl', 'rb'), 'file/pickle')),
    ('dictionary', ('dataset_utils/label_col_map.pkl', open('dataset_utils/label_col_map.pkl', 'rb'), 'file/pickle'))]

r = requests.post(url, files=multiple_files, verify=False)
Check response code
r
Retrieve the results of the data quality score and remediation recommendations

In this example, the score returned from the server is the number of null values in the data set compared to the number of complete values. Remediation shows the row number and column with the missing value, followed by the percentages of the data values that should be used for imputation

r.text

Data validation example

The data validation API evaluates a data set and determines if there is data that is out of vocabulary (OOV), that is, was not used to train the fasttext model. This could be because there were not enough samples to be represented or data is misspelled from its API original representation. This service identifies those OOV values and uses the fasttext model’s keyed vectors to find the most similar in-vocabulary data value to replace the OOV value with.

To test the data validation API, you must use a subset of the training set of the Drug Review data set as well as use the categorical text columns because they are the values most likely to be invalid.

  1. Load the training set of the Drug Review data set.
  2. Use the ‘effectiveness’,’sideEffects’, and ‘condition’ categorical columns to build a subset .csv file for this example (the API expects a CSV file for the data file).

     df_train = pd.read_csv('drugLibTrain_raw.csv', encoding='utf8')
     df_train = df_train[["effectiveness","sideEffects","condition"]]
     df_train.to_csv('data_validation_example.csv', encoding='utf-8', index=False)
    
Invoke REST service endpoint

Retrieve the REST endpoint (IP and Port) as reported when the REST server was started, and invoke the data validation service. The data service accepts multipart/formdata requests with the following arguments.

  • data_file: Path to the data file to evaluate. Type=file
  • model: Path to the saved keyed vectors of the fasttext model used to predict. Type=file

      url = 'URL/data_validation'
    
      multiple_files = [
          ('data_file', ('data_validation_example.csv', open('data_validation_example.csv', 'rb'), 'file/csv')),
          ('model', ('dataset_utils/fasttext.kv', open('dataset_utils/fasttext.kv', 'rb'), 'file/kv'))]
    
      r = requests.post(url, files=multiple_files, verify=False)
    
Check response code
r
Retrieve results of data quality score and remediation recommendations

In this example, the score returned from the server is the number of OOV values in the data set compared to the number of in-vocabulary data. Remediation shows the row number and column with the missing value, followed by the most similar in vocabulary substitutions.

r.text

Data noise example

The data noise API evaluates a data set and determines if there is data that is in the wrong column, likely due to a manual input error. To test the data imputation API, you must first modify the training set for example purposes because the Drug Review data set does not contain noisy column values.

  1. Load the training set of the Drug Review data set.
  2. Change the values in the condition column, where “adhd” is the condition, and substitute it as “Mild Side Effects” for example purposes.
  3. Build a subset .csv file for this example (the API expects a CSV file for the data file).

     df_train = pd.read_csv('drugLibTrain_raw.csv', encoding='utf8')
     df_train = df_train[["urlDrugName","effectiveness","sideEffects","condition"]]
     df_train.loc[df_train['condition'] == "adhd", 'condition'] = 'Mild Side Effects'
     df_train.to_csv('data_noise_example.csv', encoding='utf-8', index=False)
    
Invoke REST service endpoint

Retrieve the REST endpoint (IP and Port) as reported when the REST server was started, and invoke the data noise service. The data service accepts multipart/formdata requests with the following arguments.

  • data_file: Path to the data file to evaluate. Type=file
  • model: Path to the saved keyed vectors of the fasttext model used to test OOV. Type=file
  • dictionary: Path to the pickled file of the data/column map to ensure that the imputed value is appropriate. Type=file
url = 'URL/data_noise'

multiple_files = [
    ('data_file', ('data_noise_example.csv', open('data_noise_example.csv', 'rb'), 'file/csv')),
    ('model', ('dataset_utils/fasttext.kv', open('dataset_utils/fasttext.kv', 'rb'), 'file/kv')),
    ('dictionary', ('dataset_utils/label_col_map.pkl', open('dataset_utils/label_col_map.pkl', 'rb'), 'file/pickle'))]

r = requests.post(url, files=multiple_files, verify=False)
Check response code
r
Retrieve results of data quality score and remediation recommendations

In this example, the score returned from the server is the number of noisy data values in the data set compared to the number of correct data values. Remediation shows the row number and column with the incorrect value, followed by its true column value.

r.text

Note: The API documentation page also has a Try this API feature, which is a REST client UI. You can use this to invoke the APIs by reading the documentation in the same context.

Summary

This tutorial explained how to secure API keys and easily invoke the Distributed AI APIs hosted in IBM Cloud. The APIs in the suite help with invoking different algorithms to work with your application needs. If you have any questions or queries after the trial subscription, email us at resai@us.ibm.com.