Did you know you could use Anomaly Detection within Watson Discovery Service to identify and respond to changes in your data over time? You can track anomalies in trends of specific entities, sentiment, categories and more over time with a simple modification of an API call and no in-depth data science skills. For example, Anomaly Detection allows you to monitor a particular brand or organization for mentions in the news, so you can quickly take action if there is a spike in the number of mentions. There are many business domains where such a spike may be particularly relevant, including, but not limited to, advertising technology and brand monitoring. In this post, we’re illustrating how this works with pre-enriched news data, but Anomaly Detection is available for any data within Watson Discovery Service, including your own data that you ingest into the service.
You can create your own collections and upload files into your instance of Watson Discovery (Lite, Standard, or Advanced plan). All files uploaded to the collections go through an enrichment pipeline, which enables cognitive search of the data. Each instance also comes with a pre-enriched news collection. For a technical introduction to Watson Discovery please visit: https://developer.ibm.com/tv/building-with-watson-a-technical-intro-to-watson-discovery-service/. For this post, we will use the Watson Discovery pre-enriched news collection, but you can use your private Discovery collections as well.
All Discovery collections can be queried programmatically through the Discovery API. Several types of queries, filters and aggregation are supported by Discovery, and documentation of Discovery queries and aggregation support can be found online. In this example, we’re using anomaly detection with a Timeslice Query from news data.
Timeslice Queries in Watson Discovery
A Timeslice Query is a type of aggregation query in Watson Discovery. With this query, you can get a count of the number of documents grouped by a date field within your collection. For example, in a collection containing hotel reviews, a Timeslice Query can be used to return number of reviews per day by grouping on the date the review was posted.
Valid date interval values are minute, hour, day, week, month, and year. The syntax is timeslice(<field>,<interval>,<time_zone>). To use Timeslice, the time fields in your documents must be of the date data type and in UNIX time format. You can create a Timeslice if your documents contain date fields with values such as 1496228512.
Anomaly Detection in Timeslice Queries
With anomaly detection, you can identify unusual spikes or dips in the number of documents within a time period.
To request anomaly detection in a Timeslice query, simply add “anomaly:true” to the Timeslice aggregation. There is no extra coding to be done besides adding the parameter.
To run these queries programmatically, you can download this Python script. (Note: The script is saved as a .txt file, please rename to
- Perform a timeslice aggregation to find anomalous timeslices in news article frequency for a given target
- For each anomalous timeslice, extract the top 5 keywords in that timeslice
- For each anomalous timeslice, search for the news article title which best matches the extracted top 5 keywords.
To use the script, you must set the following environment variables:
DISCO_USER: Username for discovery account
DISCO_PASS: Password for discovery account
DISCO_EID: News Environment ID
DISCO_CID: News Collection ID
To get the Discovery user and password, login to your Bluemix account and click on the Discovery instance. If you do not have a Discovery instance created already, go ahead and create one by visiting the Catalog section. You will find Discovery under Watson Services.
Once the Discovery instance is created, click on it to access service credentials.
Copy the user and password and set the environment variables by entering this in a terminal.
For the pre-enriched news collection, the collection_id is “news”, and environment_id is ”system”.
If you are using your own private collection then make sure you use the corresponding values.
Go ahead and copy those two values and set the environment variables.
Before we run the Python script, there are few dependencies we need to download.
Download the requirements.txt file and use pip to download the required dependencies by typing following in a terminal.
pip install -r requirements.txt
Now we are ready to execute the Python script. You can invoke the script with two positional arguments: target and interval. Target is the entity name for which you would like to check for anomalies. Interval is the length of time intervals you would like to process. For example, 1hour, 6hour, 1day etc.
Go ahead and type this in the terminal in the directory where the script is located.
anomaly_workflow.py Symantec 1day
At the time of writing this post, four anomalies were detected for Symantec Corporation. You should see output similar to following:
You should also see the plot window showing the news articles which were the reason for the anomaly.
The Python script contains comments against all the important sections. So, go ahead and read through the source code if you are interested and start tweaking!
This is just one of the many ways you can use the power of Watson Discovery to get insights from your content or the pre-enriched news content. The ability to detect and automatically react to anomalies in your news stream can be a very valuable tool within the Watson Discovery service.
Learn more about Watson Discovery
- Journey: Bring your own data to Watson Discovery Service
- Integrating Slack with Watson Discovery News
- Watson Discovery service on Bluemix
- Watson Discovery on IBM