Digital Developer Conference: Cloud Security 2021 – Build the skills to secure your cloud and data Register free

Train a classification algorithm with past decisions in a business process workflow

Artificial intelligence (AI) can be combined with business process management in many ways. For example, AI can help with transforming unstructured data into data that a process can work with through techniques such as image recognition or natural language processing. Assistants and bots can provide a better user experience to business users, and several IBM Watson services can help achieve those goals. Lots of business data are also going through the business processes and we can use this data with AI. This article demonstrates how to take benefit of this data and inject machine learning techniques to optimize process execution. If for every decision that needs to be taken as part of a business process you can get a recommendation based on the decisions that have been taken in the past in similar situations, then your processes are greatly enhanced.

The recommendation service scenario

Imagine an insurance company that has set up a workflow process to approve or reject insurance claims. Some of those insurance claims are simple because, typically, the amount is small and the customer’s history and claim circumstances are straightforward. Such claims can be approved automatically or at least, follow a fast approval path. Some claims are more complex, and their approval path includes more steps. Assume that the approval decision or the decision on which path to follow is a human task and this task is defined as part of a workflow process. Then, it becomes interesting to consider whether a machine learning algorithm can help figure out which decision to take, based on past decisions.

This scenario can be adapted to any human decision process. In the insurance claim example, the decision consists of approving or rejecting a claim, which amounts to a yes-or-no decision. Such decisions can translate into a binary classification machine learning problem. However, if the decision consists of dispatching a process into many other sub processes, the scenario becomes a multiclass classification problem.

This article uses IBM Business Automation Workflow to build the claim approval process, IBM Business Automation Insights to store the business process data, and IBM Watson for artificial intelligence. In particular, it uses Watson Studio for building the machine learning model and Watson Machine Learning for deploying that model.

Overview of the solution

First, a schema outlines how all of the different elements and cloud services are used together to build the expected service.


Everything starts with the business process itself, which runs in IBM Business Automation Workflow. As the process is running, the business data of the process, which in this scenario contains information about the insured person and the claim, is captured by the Business Automation Insights service, which stores all of the process operational data and in particular the claim data in an HDFS data lake. The role of this Business Automation Insights service is really to capture and store this data so that the processes can be monitored and as the name indicates, provide you with insights on the process. Business Automation Insights can render various dashboards, for example, to monitor the process efficiency. In this insurance claim scenario, you are more interested in the data that is associated with activities and processes (the claim information) rather than in the operational data.

After the data is captured in HDFS, it can be used to train a machine learning model. After the model is trained with existing claim data and approval decisions, it should be able to provide recommendations on whether to approve or reject new claims.

The trained model needs to be deployed, which is the role of the IBM Watson Machine Learning service. This service stores the machine learning model and provides an endpoint that allows it to run the model to provide a classification. Finally, the AI model can be invoked through this endpoint by the Business Automation Workflow process and the result transformed into a recommendation within the process user interface.

In this article, you will:

  • Learn how to load time series data, in IBM Business Automation Insights, from a specific tracking point in the Business Automation Workflow process
  • Explore the format of the data and read it
  • Create an Apache Spark machine learning pipeline, which will be the recommendation model
  • Train and evaluate the model
  • Persist a pipeline and model in the Watson Machine Learning repository
  • Deploy a model for online scoring using a Watson Machine Learning API
  • Score sample scoring data by using the Watson Machine Learning API
  • Invoke the data to create a recommendation service in a Business Automation Workflow coach

Setting up the solution

To illustrate how to combine all of the technologies, this article comes with a business process definition that you can download.

The overall code is also available as a Jupyter Notebook that you can download.

To run the solution that is presented in this article, make sure that the following elements are installed:

  • IBM Business Automation Workflow.
  • IBM Business Automation Insights. Business Automation Insights must be installed and connected to an HDFS data lake.
  • IBM Watson Machine Learning service on IBM Cloud. You can use a free tier.

After you’ve installed the various elements, ensure that you have:

  • Credentials for your Business Automation Workflow instance
  • Credentials for the HDFS used by Business Automation Insights
  • Watson Machine Learning credentials

Note that the Notebook requires a Python 3.5 and Spark 2.1 kernel to run.

Tracking data in Business Automation Insights

Download and import the process definition from the Business Automation Workflow Center.

tracking data

Then, open the Claim Approval Sample process application in Process Designer. As you explore the process application, you see one Claim Approval process, which has been defined as a single user task.

claim approval

For this process, four classes of business data have been created: claim, customer, vehicle, and recommendation. The claim business data represents the data of the insurance claim. It references a customer and a vehicle. The recommendation object contains information from the AI recommendation that’s being built. This object will be examined later.

Note that this example is not intended to reflect a real claim approval system, which is notably more complex. The claim contains information on the vehicle such as the make, type, model, and year. It also contains customer information, in particular, the creditScore property, which represents the customer’s insurance score, as well as information about the claim itself such as the estimated amount, the assessment that was made, and the assessor. The example uses only some of this information.

Because this is not a real process, I initialize the claim object with some random data.

random data

The main task in this process is to approve or reject an insurance claim and to decide (based on the claim data) whether to set the approved attribute of the claim to true or to false.

After the approval decision is made, that is, when the approval task is finished, this piece of information is stored in Business Automation Insights so that it can be fed to the machine learning model. For this purpose, a tracking point is introduced after the approval task. The tracking point in a process is a moment when the status and data is sent to Business Automation Insights. Each tracking point can store the appropriate data. This example stores the data of the claim that the machine algorithm is to learn from. The decision value of the approved property of the claim is stored, too.

tracking points

Each tracking point stores the information that has been specified when a tracking group has been created. The tracking group is really a model of the data that needs to be stored in Business Automation Insights. The tracking point definition specifies the tracking group and the mapping from the claim data to the tracking group data.

Also, note the name of the tracking group, IBMBPMRSTraining_Claims, which is necessary to find the data within HDFS in the next step.

Creating some data to train the system

At this point, it’s necessary to create some data to train the system. You can continue the exercise even with little data, but you must run the process from the Process Portal 10 – 20 times.

As you run the process, you can see the coach making some recommendations for you. Because no recommendation service has been created yet, those recommendations are fake. However, you should still follow those recommendations when you create the initial data because by doing so, you create a set of initial data for which the machine learning model will be easy to create.

If you want to be able to train a model without running the process, the alternative is to download the training file and place this file in your HDFS system. You then must make sure that you update the part of the code that reads the Business Automation Insights data so that it points to this file instead of the Business Automation Insights data path.

The format of the Business Automation Insights data

After the process has run several times, events are stored in Business Automation Insights. Business Automation Insights stores many different types of events, but in this scenario I am interested in the events that are registered when the tracking point is reached by the process. Every time a process goes through the tracking point, a record is added to HDFS in the form of JSON data.

Within HDFS, the tracked data is partitioned by the following elements:

  • The identifier and version number of the Business Automation Workflow business process application
  • The tracking group identifier

Therefore, HDFS file names start with the following path:

[hdfs root]/ibm-bai/bpmn-timeseries/[processAppId]/[processAppVersionId]/tracking/[trackingGroupId]
Remember, the tracking group name is IBMBPMRSTraining_Claims.

How to find an application ID and version and the tracking group ID

In this example, when the process is imported into the Business Automation Workflow instance, the process application IDs, versions, and the tracking group ID do not change. Therefore, to run the example, predefined IDs could be used. However, in a real scenario you want to be able to retrieve all of the IDs by using the Business Automation Workflow REST API.

This API is documented in details in the IBM Knowledge Center.

To retrieve the process application ID and version number, use the processApps REST API that provides information on process applications. The following code searches for the Claim Approval Sample application and assumes that only one version or snapshot is installed. Make sure that you change the host and credentials to your Business Automation Workflow credentials.

import urllib3, requests, json
bpmrestapiurl = 'https://<bpmhost>:<port>/rest/bpm/wle/v1'

headers = urllib3.util.make_headers(basic_auth='{username}:{password}'.format(username=bpmusername, password=bpmpassword, verify=False))

url = bpmrestapiurl + '/processApps'
response = requests.get(url, headers=headers, verify=False)

[processApp] = [x for x in json.loads(response.text).get('data').get('processAppsList') if x.get('name') == 'Claim Approval Sample']
processAppId = processApp.get('ID')

# Note that the 5 first characters of the process app id below are removed
# because the REST API returns the process application id with a 5-letter prefix that is '2066.'.
# This prefix marks the identifier as a process application id but you won't need this prefix later.

print("the process application id: " + processAppId[5:])
snapshot = processApp.get('installedSnapshots')[0]
processAppVersionId = snapshot.get('ID')
print("the process application version id: " + processAppVersionId)

You can now retrieve the tracking group ID. For this, you use the Business Process Manager assets API that provides information on the assets that are contained in the process application.

url = bpmrestapiurl + '/assets'
response = requests.get(url, headers=headers, verify=False, params={'processAppId': processAppId, 'filter': 'type=TrackingGroup' })

[trackingGroupId] = [x.get('poId') for x in json.loads(response.text).get('data').get('TrackingGroup') if x.get('name') == 'IBMBPMRSTraining_Claims']

# Note that the 3 first characters of the tracking group id below are removed
# because the REST API returns the tracking group id with a 3-letter prefix that is '14.'.
# This prefix marks the identifier as a tracking group id but you won't need this prefix later.

print('The tracking group id : ' + trackingGroupId[3:])

Now that you know the process application ID, version, and the tracking group ID, you can query the tracked data in HDFS.

Using Spark SQL to read Business Automation Insights data

Business Automation Insights stores data in HDFS. As described previously, the events coming from the Business Automation Workflow instance are stored in JSON files. The following code creates the Spark session and uses the construct to read the JSON files.

from pyspark.sql import  SparkSession

hdfs_root = 'hdfs://your hdfs root here'

processAppId = '638d314f-12db-43c3-9051-89f3ce992393'
processAppVersionId = '2064.4310cecf-969e-48ce-9ac3-00e73de5dfb9'
trackingGroupId = 'f1cf87ab-29ae-4b54-901a-6601b4539132'

spark = SparkSession.builder.getOrCreate()
spark.conf.set("dfs.client.use.datanode.hostname", "true")

  timeseries = + "/ibm-bai/bpmn-timeseries/" + processAppId + '/' + processAppVersionId + '/tracking/' + trackingGroupId + '/*/*')
  print ('The data containts ' + str(timeseries.count()) + ' events')
  print('Exception while reading data, please ensure data was created in BAI')

Note that the various IDs for the path are specified in the JSON path. This HDFS path could also use HDFS wildcards. Here, the * character replaces any directory or file name in the path. As part of the JSON structure, the trackedFields member regroups all of the tracked data of the tracking point. You can create a Spark SQL query to query the tracked fields only.

businessdata = spark.sql("SELECT trackedFields.* from timeseries")

At this point, you should get a result that looks like the following example.

|           true|                  1905|                646|                  1905|                VW|               Golf|               car|               2015|
|           true|                  1842|                731|                  1842|                VW|               Golf|               car|               2015|
|          false|                  2605|                506|                  2605|                VW|               Golf|               car|               2015|
|           true|                   641|                872|                   641|                VW|               Golf|               car|               2015|
|          false|                  2853|                789|                  2853|                VW|               Golf|               car|               2015|

 |-- approved.string: string (nullable = true)
 |-- approvedAmount.integer: long (nullable = true)
 |-- creditScore.integer: long (nullable = true)
 |-- estimateAmount.integer: long (nullable = true)
 |-- vehicleMake.string: string (nullable = true)
 |-- vehicleModel.string: string (nullable = true)
 |-- vehicleType.string: string (nullable = true)
 |-- vehicleYear.integer: long (nullable = true)

You can see that the Spark data frame contains the tracking data, and you can now train a machine learning classifier with this data.

Create an Apache Spark machine learning model

Watson Machine Learning supports a growing number of IBM or open source machine learning and deep learning packages. This article uses Spark ML, and in particular, the Random Forest Classifier algorithm.

Adaptation of data

As you have seen in the data frame schema, the name of the fields in JSON, and therefore the columns names, contain the data types such as string or integer. The following code starts by renaming the columns to remove the type information.

Then, the StringIndexer method transforms the approved column, which is a column of type string that contains only true or false values, into a numeric column with 0 and 1 values so that the classifier can understand and learn from the decision.

The VectorAssembler class creates a new features column that contains the features from which to build the model. This is required by the Random Forest Classifier algorithm. The IndexToString method transforms the prediction/classification returned by the model, which will be 0 or 1 integer values, back into true or false strings.

from import OneHotEncoder, StringIndexer, IndexToString, VectorAssembler
from import RandomForestClassifier
from import MulticlassClassificationEvaluator
from import Pipeline, Model

businessdata = businessdata.withColumnRenamed("approved.string", "approved")
businessdata = businessdata.withColumnRenamed("creditScore.integer", "creditScore")
businessdata = businessdata.withColumnRenamed("estimateAmount.integer", "estimateAmount")
businessdata = businessdata.withColumnRenamed("approvedAmount.integer", "approvedAmount")

features = ["approvedAmount", "creditScore", "estimateAmount"]
approvalColumn = "approved"

approvalIndexer = StringIndexer(inputCol='approved', outputCol="label").fit(businessdata)

assembler = VectorAssembler(inputCols=features, outputCol="features")

labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=approvalIndexer.labels)

Creating the model

The model can then be built from the RandomForestClassifier algorithm.

rf = RandomForestClassifier(labelCol="label", featuresCol="features")

I then split the data into training data and test data, train the model, and compute the resulting accuracy of the model with the test data.

businessdata = businessdata[features+['approved']]
splitted_data = businessdata.randomSplit([0.8, 0.20], 24)
train_data = splitted_data[0]
test_data = splitted_data[1]
pipeline = Pipeline(stages=[approvalIndexer, assembler, rf, labelConverter])
model =

predictions = model.transform(test_data)
evaluator = MulticlassClassificationEvaluator(labelCol="label", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)

print("Accuracy = %g" % accuracy)
print("Test Error = %g" % (1.0 - accuracy))

Storing the model in Watson Machine Learning

I use Watson Machine Learning to store the resulting model. After the model is stored, Watson Machine Learning makes it possible to create an HTTP scoring endpoint, which is then used as the recommendation service.

The following code stores the created model and pipeline in Watson Machine Learning using the Python client API for Watson Machine Learning. Note that you need to specify the authentication information from your instance of the Watson Machine Learning service in the following code.

!pip install watson-machine-learning-client

from watson_machine_learning_client import WatsonMachineLearningAPIClient

# Authenticate to Watson Machine Learning service on IBM Cloud.

  "url": "",
  "access_key": "place access key here",
  "username": "place username here",
  "password": "place password here",
  "instance_id": "place instance id key here"
# wml_service_path, user and wml_password can be found on Service Credentials tab of service instance created in IBM Cloud.

client = WatsonMachineLearningAPIClient(wml_credentials)

You can now save the model and the training data.

published_model_details = client.repository.store_model(model=model, meta_props={'name':'Claim Approval Recommendation Model'}, training_data=train_data, pipeline=pipeline)

Deploying the model

Now that the model is stored in the Watson Machine Learning repository, I need to deploy it in a runtime environment. I start by retrieving the model ID.

model_uid = client.repository.get_model_uid(published_model_details)

I list already installed deployments. A free tier in Watson Machine Learning allows no more than five deployments.


I use the deployments client API to create a new deployment for my model.

deployment_details = client.deployments.create(asset_uid=model_uid, name='Recommendation Prediction Model')

The URL that lets me score against the published model is part of the deployment details.

recommendation_url = client.deployments.get_scoring_url(deployment_details)

Testing the recommendation URL

I can now test the scoring URL with some data to see how it works. I do so by providing the values for a new claim.

import json
recommendation_data = {"fields": ["approvedAmount", "creditScore", "estimateAmount"],"values": [[2000, 500, 2000]]}

scoring_response = client.deployments.score(recommendation_url, recommendation_data)

print(json.dumps(scoring_response, indent=3))

Invoking the recommendation REST endpoint from the Business Automation Workflow process

To display a recommendation for a decision on a claim within the Business Automation Workflow process user interface itself, I invoke the AI model from a Business Automation Workflow service. If you go back to the Process Designer, within the list of available service flows, you see a service flow called ‘Invoke Watson ML Service Flow’. This is the service that calls the recommendation REST endpoint in Watson Machine Learning.

Invoke machine learning flow service

This service flow is implemented as JavaScript, and the code itself as described in the previous image is doing a POST toward the Watson Machine Learning model endpoint.

In this script, you must specify the credentials to Watson Machine Learning as well as the recommendation URL.

The result of the recommendation service is displayed in the process user interface (the coach) after the service has been called. In the following image, you see the definition of the coach. It contains two different parts, one for the ‘I recommend’ and another one for ‘I do not recommend.’ The visibility of each portion depends on the result of the recommendation service.

Claim approval

The system is now ready to return recommendations about the insurance claim.



This article helps you understand how to create a recommendation service for your Business Automation Workflow process with Watson Machine Learning and Business Automation Insights. You are encouraged to explore more possibilities of Watson Studio and Watson Machine Learning, in particular, the capability to retrain the model when more data becomes available.