The¬†AnomalyDetector operator in the time series toolkit can be an invaluable tool for detecting¬†anomalies in real-time. If you are not familiar with the AnomalyDetector operator,¬†please take a look at this article first:¬†Anomaly Detection in Streams.¬†In summary, this operator works by comparing the current¬†time series data with a reference pattern (the reference pattern is updated based on a recent history of input time series).¬†The more the current time series pattern diverges from the¬†reference pattern, the more anomalous the current pattern is.
When dealing with time series data that is seasonal, detecting anomalies can be difficult. Let’s take the following example:
The above graph contains seasonal time series data, with each season having the same general shape. As the graph indicates, there is¬†an unexpected spike (i.e. an anomaly) at t=126. This spike may actually be a useful data point and it would be helpful¬†if the AnomalyDetector operator could detect this. Unfortunately,¬†by using the data as it is, the AnomalyDetector operator will not be able to definitively detect this spike. The following graph shows¬†the anomalous scores returned by the AnomalyDetector operator. The higher the score, the more likely the data is anomalous at the given time.
There are some interesting observations that from the above first. First, there are 3 larger spikes in the scores, which¬†seems to indicate that multiple anomalies. The first spike is likely a false positive due to the fact that the AnomalyDetector has not yet had time to settle. The second spike does occur around t=126, which seems to indicate that the anomaly was detected to some degree. However, there is a third spike that occurs around t=135 and is actually larger than the spike at t=126. This third spike may be due to the fact that the anomaly at t=126 is now¬†part of the reference pattern and so it’s causing data around t=135 to have a larger score.
Regardless of how you choose to interpret this graph, what is clear is that the AnomalyDetector was not able to definitively determine¬†where the anomaly exists¬†from the input time series graph. Now let’s look at one possible technique¬†that can be used to aid in detecting the anomaly.
Decomposing the Time Series Data
When I originally tackled this problem, my first thought was “wouldn’t be great if I could¬†somehow¬†remove or flatten the seasons so that I was left with only the noise?” Thankfully, this is exactly what the STD2 operator in the time series toolkit does. The STD2 operator (which standands for¬†Seasonal¬†Trend¬†Decomposition) is capable of decomposing a time series graph into three distinct components: season, trend and residual. The residual component is the time series with the season and the trend remove. For more information on how the STD2 operator works, take a look at the article¬†Real-Time Decomposition of Time Series.
By streaming the original¬†time series data through the STD2 operator and extracting the residual component, I get the following graph:
From this graph, it is¬†very obvious¬†that there is an anomaly at t=126. By using¬†the residual component as the input to the AnomalyDetector operator, the operator should easily be able to to detect¬†the anomaly¬†from the above graph. Here is a graph showing the anomalous scores returned by the AnomalyDetector operator when using the¬†residual values¬†as the input:
As expected, the result of the AnomalyDetector¬†clearly indicates that there is an anomaly¬†around t=126. These results are far more definitive than the results¬†I originally received when streaming the data directly into the AnomalyDetector operator.
For reference, here is what the graph of the SPL application that contains both the STD2 operator and AnomalyDetector operator. The source code for this application¬†can be found in the¬†STD2Samples project on GitHub, called “STD2Anomaly”.