Win $20,000. Help build the future of education. Answer the call. Learn more

Generating data for anomaly detection

Every data project starts with data. Data is a very broad term. It can be structured or unstructured, big or small, fast or slow, and accurate or noisy.

I’ve written a series of articles on deep learning and developing cognitive IoT solutions starting with Introducing deep learning and long-short term memory networks. The next articles are about using Deeplearning4j, ApacheSystemML, and TensorFlow (TensorSpark) for anomaly detection. My use case is anomaly detection for IoT time-series data from vibration (accelerometer) sensor data. To effectively demo the process of creating a deep learning solution on these different technologies, I need data. I need structured, fast, and big data, which can be noisy too.

To have a simple framework for creating data, I’ve written a test data simulator, which is part of a bigger time series and machine learning toolkit.

This simulator generates data from sampling various physical models, and you can decide on the degree of noise and switch between different states (healthy and broken) of the physical model for anomaly detection and classification tasks.

For now, I’ve implemented the Lorenz Attractor model. This is a very simple, but still very interesting, physical model. Lorenz was one of the pioneers of chaos theory and he was able to show that a very simple model that consists of just three equations and four model parameters can create a chaotic system that is highly sensitive to initial conditions and that also oscillates between multiple semi-stable states where state transitions are very hard to predict.

I’m using Node-RED as the runtime platform for the simulator because it is a very fast way of implementing data-centric applications. Node-RED is open source and runs entirely on Node.js. If you want to learn more on Node-RED, check out these videos.

Because the data simulator is completely implemented as a Node-RED flow, we can use Node-RED from the IoT Starter Boilerplate on the IBM Cloud. Of course, the data simulator can run on any Node-RED instance, even on a Raspberry Pi, where it can be used to simulate sensor data on the edge.

Creating the test data simulator

While it was challenging to create the test data simulator, you can get the simulator up and running in four main steps:

  1. Deploy the Node-RED IoT Starter boilerplate to the IBM Cloud.
  2. Deploy the test data simulator flow.
  3. Test the test data simulator.
  4. Get the IBM Watson IoT Platform credentials to consume the data using MQTT from any place in the world.

Before you get started, you’ll need an IBM Cloud account. (Sign up for an IBM Cloud account.)

  1. Log in to your IBM Cloud account.
  2. Create the IoT app including Node-RED. Follow these steps in this tutorial.
  3. After successfully deploying the app, in the left menu, click Connections.
  4. On the Internet of Things Platform tile, click View credentials.
  5. Write down the values of the following properties as you need them later when you work with one of the three technologies (DeepLearning4j, ApacheSystemML, and TensorFlow (TensorSpark)):

    • org
    • apiKey
    • apiToken

  6. Open the NodeRED flow editor
  7. Using your mouse, select all the nodes in Flow 1; click the Delete key to empty it.

  8. From the upper-right menu, click Import > Clipboard.
  9. Open this simulatorflow.json file in my GitHub repo; copy the JSON object to the clipboard.
  10. On the Import nodes window, paste the JSON object to the text field, and click Import.

    Note: Make sure that you are pasting a JSON document from your clipboard and not HTML. In its next version, NodeRED will have GIT integration, so this step will become easier.

    The following flow is displayed in the Flow 1 tab.

  11. Click Deploy. The message “Successfully deployed” will display.

The debug tab displays the generated messages. Congratulations! Your Test Data Generator is working.


Understanding this Node-RED flow

You’ve got it working, but what is going on in this Node-RED flow?

Consider the node labeled with the word timestamp.


This node is an inject node and it is generating messages in defined intervals. This is very useful as a starting point for our simulator. In a real-life scenario, this node would be replaced with some nodes that are connected to accelerometer sensors. Because we are generating the accelerometer values using the Lorenz Attractor Model, we can ignore the timestamp payload on the messages and only react on the message object itself, which we will see later.

Double-click the timestamp node. Notice the sample rate generates 100 messages per second (or a sampling rate of 100 Hz).

Note: The sampling rate has to be twice as high as the highest frequency that we want to capture because of the Nquist theorem. While 100 Hz is actually far too low, for the sake of this tutorial, it is high enough. In a real-world scenario, you should sample at 20 or 40 kHz (every 0.01 second is equal to 100 Hz).


Next, look at the function node. It is the heart of the simulator.


Double-click this node and see the following function code:

var h ='h')||0.008;
var a ='a')||10;
var b ='b')||28;
var c ='c')||8/3;
var x ='x')||2;
var y ='y')||3;
var z ='z')||4;

return msg;

Note: The initial parameters of the model are h, a, b and c. We also initialize x, y, and z to some values; the equations are the actual model. They are dependent on h, a, b, c, x, y, and z. In every time step (currently 100 per second), the model is advanced one step into the future because x, y, and z are updated using values from constants h, a, b, and c and also from previous x, y, and z values.

You need to set a limit on the output for two reasons:

  • At the current sample rate (100 messages per second) you’ll using up the free 200 MB per month on the Watson IoT Platform within a couple of hours.
  • The downstream analysis might not be able to cope with this data rate.

Now let’s look at the limit to max 3000 function node. Currently, the maximum is set to 30 seconds worth of data using a simple count.


Double-click the node to see the function code:

var count ='count') || 0;
count += 1;'count',count);
if (count <= 3000) {
   return msg;

Now, consider the reset node. The function node associated with this node is set to send the next 30 seconds worth of data to the message queue.


Double-click the function node. It is implemented as follows:'count',0);'count');
return msg;

The next to last steps are to switch this simulator between the broken and healthy states. To simulate faulty or broken data we can click the function node associated with the broken inject node.


The only thing this node does is update the Lorenz Attractor model constants.'h',0.008);'a',30);'b',128);'c',28/3);

return msg;

And, of course, take a look at the function to switch it back to a healthy state.


return msg;

Last, but not least, let’s look at how this data travels to the IBM Watson IoT Platform’s MQTT message broker.


You can leave the configuration as it is and the credentials are injected for you using Cloud Foundry running on IBM Cloud.



You’ve successfully deployed a test data simulator creating a time series of events sampled from a physical model. You can also switch between two states (healthy and broken) for anomaly detection and classification.

You can now use this test data in the other tutorials in this series. You’ll be able to develop deep learning cognitive IoT solutions for anomaly detection with Deeplearning4j, ApacheSystemML, and TensorFlow (TensorSpark).