Kubernetes with OpenShift World Tour: Get hands-on experience and build applications fast! Find a workshop!

Data loss prevention in an API world

Data is the lifeblood of many organizations. Organizations that let data leak often face an embarrassing PR crisis. If the data belongs to their customers, they might also face civil or criminal liability. To reduce the risk of data leaks, many organizations deploy data loss prevention (DLP) software to inspect information in outgoing documents, emails, and other communications. Only documents that do not appear to contain suspicious amounts of restricted data are allowed to pass automatically. Documents that appear to have lists of sensitive or private data, such as credit card numbers, are returned to the sender.

Another mechanism that organizations use to communicate with the outside world is calling services like IBM Text to Speech or Twilio SendGrid application programming interfaces (APIs). These APIs are typically called from internal business applications, which often have information that should not be allowed to leave the company. There are two ways to implement a data loss prevention approach. You can integrate it into each and every application. Or, you can write a proxy and have all the API calls go to that proxy first. From there it goes to the third-party API.

The examples in this tutorial show how to use the proxy method, because it is much more modular. It is easy to add support for additional applications and additional third-party APIs. Modular software is also easier to debug. Finally, in many organizations there is a disconnect between the application department and the security department. It is often much easier to add a security feature in a proxy than it is to make the application department modify their applications.

Learning objectives

In this tutorial, you learn the following skills:

  • Write an HTTP proxy in Node.js and run it on the IBM Cloud.
  • Parse API calls to identify the potentially sensitive information and avoid relaying it.

Prerequisites

To follow this tutorial, you need the following skills and tools:

Estimated time

  • Completing the steps in this tutorial should take around 3 hours.

Steps

First you create the sample application for this tutorial, which is an app that accepts text input from users and turns it into audio. Then you create a proxy that relays information, and you identify the API for a service. Finally, you learn how to capture potentially dangerous calls and how to identify and address dangerous content.

Step 1: Create the sample application

To demonstrate the techniques in this tutorial you need an application that calls a service. I wrote an application that accepts textual input from the user and turns it into audio using IBM Text to Speech. You can read the Service endpoint documentation.

To get started with the Text to Speech app, complete the following steps:

  1. Create an AI > Text to Speech resource type.
  2. Make a note of the API key and the URL.
  3. Start a Node.js application on the IBM Cloud. For detailed instructions, see Getting started with IBM Cloud Node.js applications.
  4. Replace the app.js file with the source code in the 01_sayme_app.js sample file Make sure to change the textToSpeechConf configuration on line 24 to use your own API key and URL (the direct URL, for now).
  5. Replace the public/index.js with the source code in 02_sayme_index.html.
  6. Run the modified application. It displays a form. Users can submit text and download an MP3 with the spoken text.

This Node.js application is simple for the purpose of this tutorial. The interesting part is that it calls the Text to Speech service.

Understand the text2Speech function

The following function converts text to speech. It has three parameters: the text to be converted, an indicator of whether it goes through the proxy or not, and a callback for after the audio is available:

const text2Speech = (text, useProxy, cb) => {

You receive the audio file as multiple chunks, each as a Buffer object. The following array stores these objects until you are ready to concatenate them into a single file:

    var audio = [];

Understand the HTTP request

You send the following request to the Text to Speech service. In this case, the request is very simple. You just the text that you are converting to audio:

    const reqBody = {text: text};

You can use the following configuration parameters when you create the HTTP request. Use the POST method because the data could be long:

    var reqOpts = {
        method: "POST",
        auth: `apikey:${textToSpeechConf.iam_apikey}`

For more information, see the Authentication documentation for the Text to Speech service.

The following header fields are passed to the Text to Speech service and affect its function:

        headers: {
            "Content-Type": "application/json",
            "Accept": textToSpeechConf.audioFormat
        }
    };   // reqOpts

For more information, see the Synthesize audio documentation.

To decide which URL to use, the app uses the ternary operator, which is more readable than using an if statement:

    const useUrl = useProxy ? textToSpeechConf.proxyUrl :
textToSpeechConf.directUrl;

The Node.js HTTPS library expects to get the host name, path, and other data from the parameters inside the request options rather than a URL. The HTTPs library documentation for Node.js says that you can put the URL first, but that approach doesn’t work. The url.parse function creates those parameters, and then Object.assign combines those parameters with the parameters that you previously put in reqOpts:

    reqOpts = Object.assign(url.parse(`${useUrl}/v1/synthesize`),
        reqOpts);

The following code to creates the HTTPS request itself:

    const httpsReq = https.request(reqOpts, res => {

Understand how to handle the request

The callback function (the last parameter) is called after part of the result is available. However, in the case of long responses such as an audio file, the first part of the result is available much earlier than the entire file is available. Therefore, to gather the entire file, you need to register callbacks for when a chunk of data is available and when the file is received completely:

    const httpsReq = https.request(reqOpts, res => {
        res.on("data", chunk => audio[audio.length] = chunk);
        res.on("end", () => cb(Buffer.concat(audio)));
    }); // https.request

    httpsReq.on("error", err =>
         console.log(`Text 2 Speech API error: ${err}`));
    httpsReq.write(JSON.stringify(reqBody));

The request isn’t sent out until it is fully created. The .end function tells the system it can send it out:

        httpsReq.end();                        
};    // text2Speech

Note: There is also an API for the Text to Speech service (see Synthesize audio). This tutorial does not use that API, to make it clearer to see the functioning of the protocol using HTTPS. If you decide to use it in your own applications, the process to use the proxy is similar. Just change the url parameter in the constructor. For more information, see Authentication.

Step 2: Create a proxy that relays information

The next step is to create a proxy to relay requests to the Text to Speech service and responses back to the application.

  1. You create an application using the 03_proxy.js source code. If necessary, replace the host name on line 33 with the one you use, and modify the proxy URL in your application to use the proxy application’s host name. See 01_sayme_app.js, on line 27.

    When you use this technique with your own applications, determine where the application is provided with the URL to the service. The applications might be in the appEnv variable or a configuration file. Modify that URL to point to the proxy.

  2. After you use the proxy for a few requests, go to /log to see those requests, as shown in the following screen capture.

Log of requests to the proxy

Note that after you view the log, it is deleted. If you refresh it you get an empty page.

Now, to understand how the proxy works, learn more about the initial proxy definitions, the HTTP request bodies, the log file, the actual proxy, the new HTTP request, callbacks, and messages to the log file.

Initial proxy definitions

The proxy program at 03_proxy.js starts with the following standard IBM Cloud Node.js definitions:

const express = require('express');

// cfenv provides access to your Cloud Foundry environment
// for more info, see: https://www.npmjs.com/package/cfenv
const cfenv = require('cfenv');

// create a new express server
const app = express();

// get the app environment from Cloud Foundry
const appEnv = cfenv.getAppEnv();

A proxy needs to forward requests, which requires either the http library or the https library. In this tutorial, you need to use https, like the following example, to avoid sending a password (the API access key) in cleartext:

// We need to send HTTPS requests
const https = require("https");

The querystring library not only parses query strings, it also creates them. This is important because Express parses the query string, and you need the unparsed version to relay to the service:

const qs = require("querystring");

If the back-end application uses POST, PUT or PATCH, you need to read the data:

const bodyParser = require("body-parser");

Note that body-parser does not handle multi-part bodies (request bodies that are so long that it requires more than one TCP/IP packet to send them. If the proxied application uses them, you need a different solution.

If you want to use the same proxy for multiple applications, you might need a more sophisticated method to provide the host name. One possibility is to embed it in the authentication information (because the authentication information is always an input to the API rather than something the API handles by itself). In this tutorial, however, the proxy uses a constant host name for simplicity:

// The host for which we are a proxy
const host = "stream.watsonplatform.net";

HTTP request bodies

Most HTTP requests use methods that don’t use the HTTP body. For the exceptions, POST, PUT, and PATCH, you need to use the body-parser library to read the body:

// If needed read the body.
// bodyParser.text() normally doesn't deal with all mime types - 
// the type parameter forces that behavior
app.post("*", bodyParser.text({type: "*/*"}));
app.put("*", bodyParser.text({type: "*/*"}));
app.patch("*", bodyParser.text({type: "*/*"}));

Even if the body is in a different format, such as JSON or URL encoding, it is still easiest to read it as text when you just need to relay it.

The log file

To be able to write a policy about API calls, you need to know what API calls are made, so you log them. The log information is stored in the following array:

// log file for requests
var log = [];

The following function call is the way you display the log (if the proxied service already uses GET on /log, switch to a different path):

// Show and delete the log
app.get("/log", (req, res) => {
    res.send(`<PRE>${log.reduce((a, b) => a+b, "")}</PRE>`);

It uses a template literal to create the HTML. To turn the array into a single string, the program uses the reduce function.

Make sure you consider the use case for the log. The person in your organization who builds the security policy performs an operation with the application and then views the resulting API calls. To make it easy to identify the calls that are part of the operation, you need to delete the log after it is displayed. Then after the next operation you only have the entries for that operation:

    log = [];

});

The actual proxy

The following call sets up the actual proxy:

app.all("*", (req, res) => {

It applies to all methods and all paths (except for those previously handled – case /log).

The new HTTP request

Most the request headers stay unchanged. The exception is the host. You don’t want to call yourself, but you want to call the service you are a proxy for, as shown in the following code:

    var headers = req.headers;
    headers["host"] = host;

If there is a query, you need to create it again in the original format, as shown in the following code fragment from 03_proxy.js:

    // Deal with the query is there is one. Don't add any
    // characters if there isn't.
    var query = "";
    if (req.query) 
        query = "?" + qs.stringify(req.query);

The program creates the HTTP header for the new request, as shown in the following code fragment:

    // The options that go in the HTTP header
    var proxiedReqOpts = {
          host: host,
          path: req.path + query,
          method: req.method,
          headers: headers
    };

Gather the returned values in the following array:

    var retVal = [];

Create the actual request that gets sent to the service, as shown in the following code:

    const proxiedReq = https.request(proxiedReqOpts, proxiedRes => {

Callbacks to the HTTP request

As previously shown, you gather the data until it is done, and then you relay it:

        proxiedRes.on("data", chunk => retVal.push(chunk));
        proxiedRes.on("end", () => 
res.send(Buffer.concat(retVal)));
        proxiedRes.on("error", err => 
res.send(JSON.stringify(err) + "<hr />" + retVal));
   });

This approach is simpler than forwarding each chunk as you get it, and efficient enough for files of a few kilobytes, such as the MP3 file that you receive.

If the request has a body, send that to the service too, as shown in the following code:

    // Some requests have a body
   if ((req.method === "POST") || 
(req.method === "PUT") || 
(req.method === "PATCH"))
        proxiedReq.write(req.body);

The following call tells Node.js that you are done with the request and it can send it:

    proxiedReq.end();

Messages written to the log file

The log is purely a development feature, and you should delete it before you go into production. But if the log is needed in the production environment (and it shouldn’t be, it would be a security violation), add the following code so at least your proxy won’t suffer from having a huge data structure.

    if (log.length > 1000)
        log = [];

Add the new message to the log (and the body, if relevant), as shown in the following code:

    log.push(`${new Date()}: ${req.method} to ${req.url}\n`);
    if ((req.method === "POST") || 
           (req.method === "PUT") || (req.method === "PATCH"))
       log.push(`\t${req.body}\n`);
});

// start server on the specified port and binding host
app.listen(appEnv.port, '0.0.0.0', () => 
console.log("server starting on " + appEnv.url));

Step 3: Identify an API for a service

Now you identify the calls, especially the potentially dangerous calls, and block them. For this tutorial, you don’t need to identify and block calls, because you already know the exact format from the application’s source code (01_sayme_app.js, lines 36-47).

However, other applications might use an API library that is not so clear. In that case, you can run each operation through the proxy and view the log, as shown in the following screen capture:

Run operations through the proxy and view the log

The screen capture shows that there is one operation, which uses the POST method to the /text-to-speech/api/v1/synthesize path. The operation gets information in a body in JSON format, and the text to be spoken (where you need to apply the DLP) is in the field called text.

Step 4: Capture potentially dangerous calls

Now that you know what you need, the next step is to capture the potentially dangerous calls and get the text that could contain keywords that your company doesn’t want to leak out. The sample code at 04_proxy_modify.js, lines 60-71, shows how to do it.

The following sections show how it works.

Capturing and understanding a request

The method is POST, so to capture it you use app.post. The URL is the one you discovered. The callback function gets three parameters, rather than the standard two: the request, the response, and the function to call next. The following code snippet from the 04_proxy_modify.js file shows how to create express middleware, which does something with the request and then returns it for normal processing:

// Get attempts to say something and allow them or not
app.post("/text-to-speech/api/v1/synthesize", (req, res, next) => {

The parser you use for the body, bodyParser.text(), keeps it as a string. But here you need it parsed:

    var requestBody = JSON.parse(req.body);

Modifying a request

You can modify fields in the request body, for example, the text:

    requestBody.text = "I said that " + requestBody.text;

To send the new body, you need to turn it into text. Also, the content length is no longer correct. Therefore, you should update it, as shown in the following code:

    req.body = JSON.stringify(requestBody);
    req.headers["content-length"] = req.body.length;

After the request is modified, return it to the normal processing stream. Then the proxy function call gets it and sends it to the service:

    next();    
});

Step 5: Identify and handle dangerous content

Now that you can get the text, you can finally decide what content is dangerous and what to do with it using regular expressions. For example, imagine you fear people using the text to speech service with credit cards. A credit card number (except for American Express) is sixteen digits, which are typically written as four groups of four digits each, possibly separated by a white space (like a space or tab). The regular expression is /(\d{4}\W*){4}/g.

Capturing and understanding a request

To get the requests, you need to authorize it, as shown in the following code:

// Get attempts to say something and allow them or not
app.post("/text-to-speech/api/v1/synthesize", (req, res, next) => {

    // Here we DO need the parsed request body
    const requestBody = JSON.parse(req.body);

Identifying and classifying credit card numbers

The following line retrieves all the strings inside the text that match the regular expression, all those that look like credit card numbers. When you use the g flag to find matches globally, rather than the first one, the result is an array of strings. To learn more, see match() method.

Redacting credit card numbers and modifying the request accordingly

If none of the credit card numbers start with a five, the replace function replaces the credit card numbers with REDACTED. Note that the following code uses the same regular expressions as the match function call earlier:

    else {   // Assume everything else is Visa  
       requestBody.text =   
                requestBody.text.replace(/(\d{4}\W*){4}/g,
                "REDACTED");        
       req.body = JSON.stringify(requestBody);
       req.headers["content-length"] = req.body.length;
       next();        
   }    
});

Summary

Implementing data loss prevention on an API is more complicated than implementing it for a document type, because there are relatively few document types in common use. However, APIs are a lot more diverse. As you can see in the steps of this tutorial that you completed, data loss prevention for APIs isn’t very difficult, and it does allow you to block another potential channel for information leaks.

Now you know how to put a proxy in front of your own organization’s applications to both review the API calls (if they are RESTful APIs called over HTTP) and then reject those that leak information.

Ori Pomerantz