“Those who fail to learn from history are doomed to repeat it”
These words are often attributed to Winston Churchill (although probably originally written by the philosopher George Santayana). These words have recently taken on new meaning in the context of security analytics.
The most important development in the world of security is the massive adoption of analytical techniques in order to distill massive amounts of logs, flows and raw data into a few meaningful and actionable events. This has been made possible and commonplace by adopting machine learning techniques and applying them to the world of security. It is changing the way we implement security and is clearly the most important change for us security professionals.
What is sometimes forgotten is that machine learning and analytics are more effective when there is more historical data than when there is less. This is true for all machine learning techniques – both supervised and unsupervised. Problems of small-data are many, a few being:
- Noise becomes a real issue – both in target variables as well as in features. I recommend the book “The Signal and the Noise: Why So Many Predictions Fail – but Some Don’t” by Nate Silver to understand more.
- “Normal” can easily be flagged as outliers
- Over fitting becomes very hard to avoid
Hence the words by Churchill/Santayana or as Mark Twain is reputed to have said, “history doesn’t repeat itself but it often rhymes”. Simply put – the more history you have, the better and more reliable your analytical results are.
Long-term retention of data has always been a mandate driven by compliance. For example, most companies interpret PCI as requiring data to be retained for 13 months. But this has often been implemented using impractical “frozen archives” – archived data that takes weeks to bring back online, making historical compliance reporting possible but very painful.
Since the quality of the analytics is directly proportional to how much data the algos have to work with, the need to retain data online is coming to the forefront. These new security retention systems allow you to kill two birds with one stone – making long-term retention for compliance an easy thing while helping the machine learning algorithms perform better. As an example, read the developerWorks article on how to enable years of readily accessible online storage of Guardium data here – https://www.ibm.com/developerworks/library/se-sonarg-big-data-security-guardium-trs/index.html
More on machine learning and analytics for cybersecurity