Understand Anomaly Detection using moving z-score (optional)
Although not essentially necessary a basic understanding how the anomaly detection algorithm used in this tutorial works is very beneficial. I won’t make it too complicated, I promise.
Basically the algorithm depends on two statistical measures, mean and standard deviation:
Mean (in laymenâ€™s terms also called average) is basically a measure of central tendency of your data. It can be simply calculated as follows:
This means you just sum up all individual values and divide by the number of values you’ve summed up
- Standard Deviation
Standard Deviation is a measure on how wide data is spread around the mean and can be calculated as follows:
(Note that the mean x-dash from the previous formula is now denoted as Greek letter mu but I won’t go into the details here). So what you can observe is that (although the formula looks slightly more complicated) is that the distance between every data point (or measurement) and the mean is evaluated and somehow summed up. So the more distant data points are spread around the mean the higher the measure for standard deviation is. This is important because if your data is already widely distributed around the mean detecting an anomaly needs data to be spread even more far away from the mean.
So now we have all ingredients to calculate the z-score which is defined as follows:
This means, for every measurement just subtract the mean and divide it by standard deviation. So we are nearly done, the only thing what we have to do is to turn this into a “moving z-score”, we want to detect anomalies on time series, right?
The trick is as simple as follows. Instead of using ALL measurements for calculating mean and standard deviation we just take the latest k measurements into account. This approach is called windowing and is described here very nicely.
Deploy the application
In order to make it fast-track, you can just click on the deploy button below which will automatically deploy a NodeRED data flow tool acting as device simulator in the IBM Cloud. It also comes with a pre-configured “edge” implementation of the algorithm mentioned above. As already mentioned, we are based on a full length tutorial which can be found here.
The full tutorial mentioned above explains you the concept of “Cognitive IoT” where advanced machine learning algorithms and neural networks can be trained and run on various locations (on the edge of an IoT system, in a batch processing system or real-time data processing system in the cloud). But in this fast-track tutorial we concentrate on the edge-rule only which has been obtained by running the full-stack process mentioned above.
Please click on the following deploy button:
So what this basically does is it will create a NodeRED instance in the IBM Cloud with a data flow pre-configured for our application. Please login with your IBM Bluemix account and click on “deploy”.
Understand whatâ€™s happening
After successful deployment you’ll see a screen like this, please click on “view app”:
You’ll be taken to the NodeRED flow editor where you can see the already deployed and running application. Please have a look; this should somehow look like the following:
So let me walk you through each element:
- NodeRED is free, Open Source and runs everywhere! In the IBM Cloud, in every other cloud or data center, on your laptop and even on an IoT Gateway like a Raspberry Pi. So consider this flow to run on an IoT Gateway connected to an elevator and measuring voltage for the main driver motor. As we don’t have a Raspberry Pi in place we are just simulating these sensor values using an “Inject” node in NodeRED. Otherwise you would see a dedicated sensor node here
- In addition we want to send data upstream to the cloud so let’s add a time-stamp. It is always good to generate a time-stamp (temporally and spatially) as close as possible to the sensor. So this value can be referred to as “event time” rather than “processing time”
- To stream these data to the IBM Watson IoT Platform via MQTT only one simple node is necessary
- In addition we want to create a little dashboard to monitor the voltage sensor values
- In order to achieve this we need to shift and shuffle the values a bit
- Of course we want to plot the moving z-score as well – in parallel to the voltage in order to really understand what’s going on
- Again we need to do some shifting and shuffling as a preparation
- Now we generate an alert messages in case the z-score drops below -0.5 which means some major fluctuation has been taken place recently
- We just display this alert message under the two other charts
- In order to get rid of the message once in a while we reset it
- And delay the deletion 5s, so the message keeps displayed for 5s
Observe what's going on using the real-time dashboard
In order to open the dashboard just click on the dashboard tab and on the dashboard icon as shown here:
You will observe two time series chart (run charts), one for voltage and one for the moving z-score:
Wait for some time until you observe a z-score below -0.5 and you’ll see that an alert message is being generated. Of course you can also trigger something more important like initiate an emergency shutdown of the system or raise an alert, either by sending an email/SMS from the Edge directly using a NodeRED node (see twilio or email for this) or you can also send the alert upstream to the cloud using MQTT. The latter would be a perfect example on how Edge analytics can reduce the amount of data transferred to the cloud by adding intelligence to the Edge gateway device:
Thanks for sticking with me – we are done. I hope this tutorial was of value for you. Please let me know your questions, comments and thought below…