Article

Detect and mitigate model bias using Watson OpenScale

Discover how IBM Watson OpenScale identifies and tackles bias in Amazon SageMaker models

By

Sunil Gajula

This article demonstrates how IBM Watson OpenScale can be used to monitor models created using Amazon SageMaker to identify and reduce bias and drift. It also shows how AWS Services and IBM Cloud Pak for Data can be effortlessly integrated to create an effective solution.

Artificial intelligence (AI) and machine learning (ML) are now used across industries to address business challenges. For instance, the financial sector may employ AI and ML to address issues with customer segmentation, fraud detection, or loan defaults.

Data residing on AWS Services or customer data centers can be hosted on Amazon’s Simple Storage Service (S3) bucket for model training and deployment. The AI models and deployments in an Amazon SageMaker service instance are connected with Watson OpenScale to detect and mitigate bias and drift. This can also increase the quality and accuracy of predictions.

To accomplish this, it's important to keep the following in mind:

  • Data can reside on-premises (in a customer data center or on AWS cloud).
  • IBM Cloud Pak for Data can connect with various data sources using a platform connection.
  • Through the use of data virtualization, many data sources from various locations are connected and combined into a single virtual data view.
  • The IBM Watson Knowledge Catalog (WKC) ensures that data access and data quality adhere to corporate policies and guidelines.
  • Curated data can be pushed to Amazon S3 from various data sources using IBM DataStage or Data Refinery jobs through extract, transform, and load (ETL).
  • The curated data that’s available in S3 can be used to build and train custom models in Amazon SageMaker.
  • IBM Watson OpenScale monitors and evaluates AI model results to make sure they are fair, understandable, and compliant.
  • To understand organizational data and aid in making wise business decisions, IBM Cognos Analytics integrates with reporting, modeling, and dashboards.

Here is the reference architecture for this process:

Reference architecture

Overview of IBM Cloud Pak for Data

Most businesses have a ton of data that they can use to produce useful insights to help them solve problems and achieve their organizational objectives.

IBM Cloud Pak for Data enables users to connect to data, manage it, locate it, and analyze it. All data users can access the data from a single, unified interface that supports numerous services that work together.

Learn more about IBM Cloud Pak for Data.

Platform connections

With IBM Cloud Pak for Data, users can connect to many different data sources so that data can be accessed quickly and easily. Data can reside on either a corporate data center or on any cloud.

The platform connections page provides access to these platform-level connections.

Here are the supported AWS services:

  • Amazon RDS for MySQL
  • Amazon RDS for Oracle
  • Amazon RDS for PostgreSQL
  • Amazon Redshift
  • Amazon S3

IBM Cloud Pak for Data supports nearly 80 different data sources. Check out the full list of data sources that can be connected to from Cloud Pak for Data.

IBM Watson Knowledge Catalog

Enterprise governance is a set of processes and practices that enable you to align with strategic objectives, assess and manage risk, and ensure that your company's resources are used responsibly. IBM Cloud Pak for Data includes several services and features that can help you govern your enterprise more effectively.

Watson Knowledge Catalog provides a secure enterprise catalog management platform that is supported by a data governance framework. With the Watson Knowledge Catalog service, you can create catalogs of curated assets that are supported by a governance framework.

The data governance framework ensures that data access and data quality adhere to the established standards and guidelines for the organization. By combining user roles, permissions, and collaborator roles that control which actions users can perform, Watson Knowledge Catalog offers fine-grain control over which users can complete which tasks.

Watson Knowledge Catlog includes data lineage. Data movement is tracked by lineage, which keeps track of the data's origin, transformation, and destination. This offers businesses tools for tracing the history of their data and the technical details of the data’s use.

Watson Knowledge Catalog also includes model inventories and AI Factsheets, which track the lifecycles of machine learning models from training to production as part of an AI governance approach. This indicates which models are in use and which still require development or validation.

IBM DataStage

IBM DataStage services can be used to create and execute data flows that move and transform data. DataStage allows you to connect to a variety of data sources, integrate and transform data, and transport it to your target system in batches or in real time -- all of which enables you to compose data flows quickly and accurately.

DataStage is a data integration tool that moves and transforms data between operational, transactional, and analytical target systems. Data integration specialists use DataStage to create flows that process and transform data. It includes hundreds of pre-built transformation functions, parallel processing capabilities, and platform connectivity to connect directly to enterprise applications, cloud data sources, relational and NoSQL systems, REST endpoints, and more. These flows can be deployed, managed, administered, and reused to integrate data across numerous systems throughout the enterprise.

DataStage flow

IBM Data Virtualization

Data virtualization connects many data sources across different locations and unifies all of this data into a single virtual data view. This read-only data view makes it easier to get value out of your data.

After you create connections to your data sources, you can quickly view all of your organization's data. This virtual data view enables real-time analytics without moving data, duplication, ETLs, or additional storage requirements, so processing times are greatly accelerated.

Centralized authentication and authorization are enforced for platform users to access data sources in a trusted environment. All communication between the environment and the application is securely encrypted with robust IBM technology, and SSL/TLS encryption by using standard protocols

IBM Queryplex is a unique new data virtualization technology that enables real-time distributed analytics that access numerous sources without the need to move or copy data.

There are three key benefits to using Queryplex as a data virtualization solution:

  1. With Queryplex, you can access data from anywhere. Through a small software agent, each data repository joins the Queryplex constellation. These sources may be broadly distributed among all data centers or confined to only one.

  2. The constellation can include a wide variety of data repositories. Without having to connect to or write code for each individual database, data analytics can operate against a variety of sources at once.

  3. The close network ties between data repositories within the Queryplex constellation enable those sources to work collaboratively to answer the analytical queries in the applications. Instead of sending information back to a coordinating service for processing, nearby nodes collaborate to carry out aggregation, joins, and other actions.

Federation and edge computing vs. Queryplex’s computational mesh

Amazon SageMaker

Amazon SageMaker is a fully managed machine learning service. With SageMaker, data scientists and developers can quickly and easily build and train machine learning models, and then directly deploy them into a production-ready hosted environment. SageMaker includes an integrated Jupyter authoring notebook instance for easy access to the data sources for exploration and analysis.

You can use Amazon SageMaker to log payload and feedback, and to measure performance accuracy, bias detection, explainability, and auto-debias function in IBM Watson OpenScale.

Find out more about Amazon SageMaker.

Integrating Amazon SageMaker with Watson OpenScale

Machine learning models are deployed in high-stakes scenarios across numerous industries, so it is crucial to ensure a model's performance after deployment. Monitoring models while they are in use is essential for ensuring their continued dependability and performance.

You can use one of the following approaches to set up Watson OpenScale to work with Amazon SageMaker.

Specify an Amazon SageMaker ML service instance

You can use the configuration interface to add a machine learning provider to Watson OpenScale. The Amazon SageMaker service instance should be configured using the Watson OpenScale tool. The AI models and deployments are kept in an Amazon SageMaker service instance.

Connect your Amazon SageMaker service instance

The AI models and deployments in an Amazon SageMaker service instance are connected using Watson OpenScale. Go to the Configure tab, add a machine learning provider, and click the Edit icon to link the service to Watson OpenScale. You should then add a name and description, and specify whether the environment is pre-production or production. Then you must provide the following information, which is specific to this type of service instance:

  • Access key ID: The AWS access key ID (aws_access_key_id) verifies who you are and authenticates and authorizes calls you make to AWS.
  • Secret access key: Your AWS secret access key (aws_secret_access_key) is required to verify who you are and authenticate and authorize the calls you make to AWS.
  • Region: This is the region where your access key ID was created. Keys are stored and used in the region where they were created and cannot be transferred to another region.

Next, select the deployed models and configure your monitors. On the Insights dashboard, you should see a list of deployed models. Click Add to dashboard, select the deployments that you want to monitor, and click Configure.

Payload logging with the Amazon SageMaker machine learning engine

Add your Amazon SageMaker machine learning engine

A non-IBM Watson Machine Learning engine is bound as “custom” using metadata. It is not possible to directly integrate IBM Watson Machine Learning with any other service. For more information, see Add your Amazon SageMaker machine learning engine.

Check out this video for an in-depth look at AI governance with IBM OpenScale and OpenPages with Amazon Sagemaker:


Video will open in new tab or window.

IBM Watson OpenScale

With IBM Watson OpenScale, you can monitor model quality and log payloads, regardless of where the model is hosted. This article uses Amazon Web Services (AWS) SageMaker model to demonstrate the independent and open nature of Watson OpenScale. Watson OpenScale is an open environment that enables organizations to automate and operationalize their AI. OpenScale provides a powerful platform for managing AI and machine learning models on the cloud or anywhere else that they might be deployed.

OpenScale offers the following benefits:

  • Open by design: Watson OpenScale enables the monitoring and management of deep learning and machine learning models that are deployed on any model-hosting engine, and created using any available framework or IDE.

  • Drive fairer outcomes: Watson OpenScale identifies and assists in mitigating model biases in order to identify fairness issues. The platform offers plain-text explanations of the data ranges that have been impacted by model bias, along with visualizations that make the impact on business results clear to data scientists and business users. As biases are identified, Watson OpenScale automatically builds a de-biased companion model that runs alongside the deployed model, giving users a preview of the anticipated fairer results without replacing the original.

  • Explain transactions: Watson OpenScale helps enterprises bring transparency and auditability to AI-infused applications by generating explanations for the individual transactions being scored, including the attributes that were used to make the prediction and weightage of each attribute.

  • Automate the creation of AI: Neural Network Synthesis (NeuNetS) creates neural networks by essentially architecting a unique design specifically for a given data set. NeuNetS supports text and image classification models.

IBM Watson OpenScale flow

IBM Cognos Analytics

IBM Cognos Analytics provides self-service analytics that are infused with AI and machine learning, which enables you to produce attractive visualizations and communicate the results through dashboards and reports.

Using the Cognos Analytics service makes it easier for you to interpret data with features such as:

  • Automated data preparation
  • Automated modeling
  • Automated creation of visualizations and dashboards
  • Data exploration

Cognos Analytics is a fast, flexible, and complete business intelligence and analytics solution that enterprises can use to improve decision quality and accelerate decision making.

With AI capabilities that go beyond traditional business intelligence (BI), IBM Cognos Analytics with Watson expands on typical BI by forecasting future events, predicting outcomes, and explaining why they might occur.

This figure gives a typical view of the Cognos Analytics dashboard:

Cognos Analytics dashboard

Summary

In this article, you learned how to use Watson OpenScale to monitor models that have been created using Amazon SageMaker, and how easy it is to integrate AWS Services with IBM Cloud Pak for Data.

With the help of a single, unified interface, Cloud Pak for Data is a cloud-native solution that helps streamline the data ecosystem by allowing you to connect to data, manage it, discover it, and analyze it.

IBM Watson OpenScale enables businesses to move AI initiatives from development into production. This helps them ensure fair outcomes, comply with relevant laws, and boost confidence in AI by offering complete explainability and monitoring.

Learn more about IBM Cloud Pak for Data.

Visit the IBM Developer AWS hub page for more on how IBM solutions integrate with AWS.