This article demonstrates how to use the AnomalyDetector operator, which is capable of detecting anomalous subsequences in a streaming time series.

Introduction

The AnomalyDetector operator is capable of performing online anomaly detection of a time series. More specifically, the AnomalyDetector operator reports anomalies with the pattern of the incoming time series. This type of operator has many different uses and can be utilized in a number of different industries. One example of where this operator may be useful is in the medical industry. By using this operator in conjunction with monitor patients, medical staff can be alerted immediately to changes in patient vital signs.

A time series is a sequence of data points, typically consisting of successive measurements made over a time interval. Examples of time series are ocean tides, counts of sunspots, and the daily closing value of the Dow Jones Industrial Average.

Time series – Wikipedia, the free encyclopedia
https://en.wikipedia.org/wiki/Time_series

The following image was developed using actual output from the AnomalyDetector operator. As the time series was ingested by the operator, the anomaly detection algorithm analyzed the patterns to determine if there were any anomalies. The orange area was reported by the AnomalyDetector operator as being anomalous.

 

ad

 

How it works

The AnomalyDetector operator maintains a recent history of the input time series, which is referred to as the reference pattern. Whenever the AnomalyDetector ingests a tuple, that tuple is added to a buffer called the current pattern (the current pattern is essentially the most recent set of data points received). When this occurs, the operator compares the current pattern with the reference pattern. This comparison operation calculates a score that indicates how similar or dissimilar the current pattern is compared with the reference pattern. The higher the score, the more dissimilar the patterns are.

The following example will demonstrate in more detail how the underlying anomaly detection algorithm works.

Example

In this example, I will provide a high-level demonstration of the algorithm that is used by the AnomalyDetection operator. Rather than discuss every possible parameter, I will focus only on those parameters that are necessary to understand the algorithm.

For this example, the following parameter values will be used:

referenceLength: 10
patternLength: 3
patternCount: 5 (default)

The referenceLength parameter specifies the size of the reference pattern.
The patternLength parameter specifies the size of the current pattern.
The patternCount parameter specifies how many times the current pattern will be compared against sub-sequences of the reference pattern.

 

For this example, I will use the following time series. The red square represents the reference pattern, which has a length of 10 (as defined by the referenceLength parameter). The blue square represents the current pattern, which has a length of 3 (as defined by the patternLength parameter).

Note: The boxes represented in the following images include both the start and end data points. For example, the blue box in the image below includes points 8, 9 and 10 (in integral notation, this would be written as [8,10]).

ad-5

 

The first step is to add a new point to this time series. When the new point is added, the current pattern will be updated to include the new point. (The reference pattern does not get updated until the end, once all of the comparison operations are performed.)

ad-6

 

Once the new point has been added, the operator will begin comparing the current pattern with sub-sequences of the reference pattern. The sub-sequences will have a length of 3, which are the same size as the current pattern length (defined by the patternLength parameter). The following image demonstrates what the first sub-sequence look like:

ad-11

The above image shows that the first sub-sequence spans 3 points (from 1 to 3, inclusive). The anomaly detection algorithm will compare the sub-sequence reference pattern with the current pattern and calculate a score. Once this has completed, the sub-sequence reference pattern will shift one step to the right and another comparison will be done (the number of steps that the sub-sequence shifts can be set using the stepSize parameter). There will be a total of  5 sub-sequences comparisons performed. The number of comparisons performed is specified by the patternCount parameter.

The following images demonstrate the remaining sub-sequence comparisons.

ad-9

ad-12

ad-13

ad-14

Once all of the compare operations have completed, an aggregated score is calculated. This aggregated score is then compared against the value specified by the confidence parameter. If the calculated score is greater than the confidence parameter, the current pattern is considered to be anomalous and the AnomalyDetector operator will submit a tuple containing information about the anomalous pattern.

The last step is to update the reference pattern to include the new time series point. Once this is done, the process will repeat.

ad-15

Operator Details

In the previous section, I discussed the underlying algorithm that drives the AnomalyDetector operator. In this section I have provided information about various important aspects of the AnomalyDetector operator. The complete set of documentation for the AnomalyDetector operator can be found on the AnomalyDetector Knowledge Center page.

Parameters

The AnomalyDetector operator comes with a number of parameters. Details for each of the available parameters can be found on the AnomalyDetector Knowledge Center page. However, there are some important parameters that I want to highlight here.

patternLength – Specifies the length of the ‘current pattern’
referenceLength – The number of tuples to store as part of the ‘reference pattern’
patternCount – The number of subsequence patterns that the current pattern will be compared against
stepSize – Specifies how many steps the sliding window will shift (default value is 1)
confidence – Limits the output to only those sequences that have a score equal to or greater than the specified value

Inputs

The AnomalyDetector operator analyzes a single, continuous time series. The inputTimeseries parameter must be set to an attribute on the input port with a type of float64.

Outputs

There are four output functions that can be used to return the information about detected anomalies. These output functions include:

getSubsequence() – Returns a list<float64> that contains the anomalous pattern.
getScore() – Returns the calculated score of the anomalous pattern.
getStartTime() – Returns the start time of the anomalous pattern (can only be used if the inputTimestamp parameter is specified)
getEndTime() – Returns the end time of the anomalous pattern (can only be used if the inputTimestamp parameter is specified)

 

Sample on GitHub

You will find a working sample on GitHub that contains the AnomalyDetector operator: https://github.com/IBMStreams/samples/tree/master/timeseries/AnomalyDetectorSample

In this sample, the incoming time series represents the number of packets per second that a NIC received, sampled every second over a 3 minute (180 second) period. Here is an example of what the incoming data looks like:

ad11

As can be seen from the above, there are 2 obvious anomalies around 60 seconds and 130 seconds. After streaming the data through the AnomalyDetector operator, the following scores (confidence values) were calculated.

ad12

From the above chart, we can see that around the same time that the packet count spiked, the score returned by the AnomalyDetector jumps dramatically.

Conclusion

The AnomalyDetector operator is easy to implement and yet powerful in it’s capabilities. The operator is available in the com.ibm.streams.timeseries toolkit packaged with Streams 4.0.0.0 and later.

 

 

10 comments on"Anomaly Detection in Streams"

  1. wonderful article …. very well explained.. & thank you very much

  2. DanDebrunner January 07, 2016

    Great article.

    I looked at the docs and it’s not obvious to me what range the score can take, so it’s unclear how I would select a confidence value. Na√Įvely I would have expected between 0 and 1, but your graph shows much higher values.

    The documentation for getScore() is less than helpful, a most interesting sentence:

    “Returns the anomaly score of the patternscore of anomaly score of the pattern”

    • James Cancilla January 08, 2016

      Hi Dan,

      This is a great question and something I have thought about before. One of the challenges with the score is that it does not have a defined range of values. The value of the score is relative to the input data, so the range of values that the score can take will be vary depending on the data you send it.

      Normalizing the score to be between 0 and 1 was something that was discussed early on in the design process. The challenge here is that, in theory, the score does not have an upper bound. The more anomalous the current pattern is to the reference pattern, the larger score will be. Without an upper bound, normalization becomes difficult.

      One of the approaches I generally take in order to determine the confidence value is by using a trial and error approach. I stream the expected data through the operator and get a baseline for the score values. Then I stream the expected data with anomalous patterns through the operator and compare the anomalous scores with the baseline scores. From there, I make a judgement call as to what the confidence value should be. That being said, I am certainly open to hearing other ideas you may have with regards to determining the confidence value.

      With regards to the docs, I’ll take a look at fixing that. Thanks for bringing that to my attention.

  3. YMDH_sathish_Palaniappan September 23, 2016

    Nice article !

    I would like to understand is there a way to retrieve the score even there is no anomaly (within the confidence level) in the current input data?

    • James Cancilla September 23, 2016

      Unfortunately it is not possible to retrieve only the score from the operator when there is no anomaly detected.

      However, there is a “trick” you can use if you really want to access all of the scores. In the AnomalyDetector operator, set the confidence level to 0. This will cause the operator to submit a tuple for every input tuple it receives. Then use a Custom operator downstream to filter out tuples that are below a certain confidence level. Since the Custom operator is receiving all of the scores, you could chose to perform a different action on the scores that are below the confidence level (i.e. save it, submit it via a second output port, etc).

  4. Undoubtedly great article!

    Can I get hold of the mathematical function in place or even an empirical version of it which will bring closure on how high an anomalous score can be? Like in the example on github (https://github.com/IBMStreams/samples/tree/master/timeseries/AnomalyDetectorSample) you have used 3E9 as the value of the confidence parameter which makes the operator consider inputs in [57,66] as anomalous. So without any trial error how can I estimate the confidence value I should use?

    Thanks!

    • James Cancilla September 08, 2017

      The underlying algorithm used to calculate the score is proprietary and is not something I can provide details on. What I can say is that the score is determined, at least in part, by the distance between the current pattern and the reference pattern.

      The challenge with this operator (and the underlying algorithm being used) is that the score values are highly dependent on the input data. Different data sets are going to result in a different range of scores. Currently, the only way to determine the confidence value is via trial and error. I have discussed the issue of normalizing the scores with our research team in order to simplify this process, but at present there is no solution available.

  5. abhinavankur January 18, 2018

    Hello James,

    I’ve used this operator and it works considerably well. However, I have a question on seasonality. How should I go about incorporating seasonality in the data. For example if I need to find out anomaly during a week and the confidence for Wednesday is remarkably high than that of Saturday [because of the domain], how should I design my application or configure the parameters of the operator to address this concept drift.

    Thanks!

  6. If the AnomalyDetector is using a very small history (reference pattern) compared with the season length, it may follow the seasonality without treating seasonal effects as anomalies. As the calculated score depends on the input data, which behaves seasonally, the score itself can have a seasonal component. This makes it hard to define a single threshold for the score without getting false positives.

    The better way to deal with seasonal data is to decompose the data first into trend, seasonal, and residual component and then to analyze only the residual component with the AnomalyDetector operator. For the decomposition of seasonal timeseries, the STD2 operator can be used. This approach is discussed in the article “Detecting Anomalies in Seasonal Data” here:

    https://developer.ibm.com/streamsdev/2016/05/03/detecting-anomalies-in-seasonal-data/

Join The Discussion