Digital Developer Conference: Cloud Security 2021 -- Build the skills to secure your cloud and data Register free

Developing a distributed, real-time threat detection engine

Large enterprises process a vast amount of logs every day through their SIEM (Security, Information, and Event Management) systems in an attempt to prevent security breaches. The volume of these logs, along with the false positives and queries that come with them, are only ever increasing. This makes it more and more difficult over time to run large searches and detect threats before any incidents can take place.

In the case of an enterprise company like IBM, there is an overwhelming amount of traffic that comes through the network perimeter. Firewalls on the network perimeter, in DMZs, and throughout other zones of the IBM network must all be able to handle enormous amounts of data in real time.

To make analysts’ lives easier in responding to threats in a timely manner, our team set out to architect an improved custom rule engine that could process complex events in real time with very high throughput speed and low latency, all while remaining distributed to maintain robust fault tolerance. This was the driving goal behind developing the System of Logs Intrusion Detection Engine (SLIDE) as our solution, with Apache Flink as the central hero of our project.

The team who developed SLIDE included myself and four interns: Ayush Khandelwal, Tabor Kvasnicka, Taner Seytgaziyev, and Rory Yang. This article describes the process of designing and developing the threat detection engine, including the technologies we used, use cases that SLIDE supports, and our implementation.

High level architecture of SLIDE

We primarily used open-source Apache software products for our threat detection engine. The highly scalable and efficient architecture is meant to ingest, parse, filter, and provide alerts on a data stream to a custom Kibana dashboard. We used these Apache products in the architecture (see the following figure): NiFi, HBase, Kafka, and Flink.

High level architecture of the SLIDE threat detection system

You will notice that SLIDE is ingesting data from two sources: indicators of compromise (IOCs) from ThreatConnect and firewall logs from Palo Alto Networks.

Here is the complete list of technologies that we used:

  • Palo Alto Networks. Palo Alto Networks offers a flexible networking architecture, which makes it simple for security teams to deploy the next-generation firewalls into almost any networking environment.
  • ThreatConnect. ThreatConnect is a threat intelligence platform that provides us with Indicators of Compromise (IOCs) that aid in mitigating threats. ThreatConnect’s REST API was used to pull new IOCs to be correlated with Palo Alto Firewall logs to help in contextualizing potential adversaries.
  • NiFi. Apache NiFi is a flow-based programming tool that allows for powerful and reliable data processing and distribution through an intuitive graphical interface. It’s great for ingesting, enriching, and transporting data from multiple sources to analytic platforms.
  • Kafka. Apache Kafka is a highly scalable, fault-tolerant distributed streaming platform that uses a publish-subscribe messaging system to pass messages in real-time from one endpoint to another. We used Kafka to pull Palo Alto next-gen firewall logs from NiFi and push processed firewall logs into Apache Flink.
  • HBase. Apache HBase is a column-oriented NoSQL database that provides fast record lookups and updates for very large tables on top of Hadoop and HDFS. We use HBase as our database system because it allows for fast lookups and updates of IOC datapoints that were previously ingested.
  • Flink. Apache Flink is a next-generation framework for stateful operations over bounded data sets and unbounded data streams. The asynchronous and incremental checkpointing algorithm enables Flink to maintain application state that is several terabytes in size. This aspect was especially important for SLIDE since we needed a way to process enormous amounts of data efficiently. Most of our use cases for SLIDE were implemented using the Flink DataStream API because of its transformation operators that work seamlessly with streams and windows. Furthermore, Flink is the core of our engine where much of the performance improvements are contained. By leveraging Flink, we can match the huge volume of logs and alerts that enterprises constantly receive on a scale that is orders of magnitude greater than current systems. Meanwhile, we are also using it to design an engine that reduces false positives, reduces time-to-detection, and other refined features to support the work of SOC analysts.
  • Elastic Stack. The Elastic Stack includes Elasticsearch, Logstash, and Kibana. Elasticsearch is a distributed search and analytics engine with the ability to hold unstructured data. It defines different feeds using indexes and the data within those indexes via mappings. While there are several different tools designed for interacting and visualizing the data from Elasticsearch, we used Kibana because it is made by the Elastic team and is designed specifically to be the industry standard for interacting with Elasticsearch data. Kibana is a data-visualization tool used for log and time-series analytics, application monitoring, and operational intelligence use cases. We created several dashboards in Kibana to visualize the alerts from Flink which provide a high-level overview of the alerts, designed to be a method to quickly see when something odd is happening on the network, further allowing a SOC analyst to dig deeper into the alerts.
  • Docker. Docker is a platform which enables the concept of containerization: the use of containers to deploy applications. A container is a running process that is isolated from the host and other running processes, that may also be containers, by means of its added encapsulation features. The features of Docker allowed us to build, run, and interact with the end-to-end solution without worrying about issues related to dependency failure or improperly-configured components.

Use cases for SLIDE

It made the most sense that an intrusion detection engine would be used in a Security Operations Center (SOC) by SOC analysts to investigate alerts. There are many SOC use cases, but the ones for SLIDE centered around firewalls since the engine is absorbing data from Palo Alto Logs and ThreatConnect indicators of compromise (IOCs). After deciding on the end user, six use cases were researched, documented, and implemented.

Most of the use cases dealt with mitigating threats at layers three, four, and seven of the OSI Model. It was found that a lot of malicious activity happens at these layers, especially in regard to DoS and DDoS attacks; therefore, some focus was aimed toward helping a potential SOC Analyst to mitigate the most malicious threats to an organization.

Here are the use cases we implemented (and the layer of the OSI Model each use case addresses):

  • Flag firewall logs that contain malicious IPs as identified by IBM’s Internal Command and Control blacklist (Layer 3).

  • Pinpoint potential firewall port-filtering misconfiguration issues:

    • Flag IP addresses that are associated with malicious ports, with number of connection attempts (layer 7).

    • Flag IP addresses that are associated with insecure ports, with number of connection attempts (layer 7).

  • Correlate the Indicators of Compromise (IOCs) provided by ThreatConnect with the firewall logs by using indicator ratings and confidence rating to help in flagging the threats.

  • Define a baseline of normal traffic by gathering real-time metrics over a window of time to later give alerts on abnormal activity, or spikes, in traffic.

  • Mitigate potential LAND (Local Area Network Denial) attacks by pushing alerts to a dashboard if the source and destination IP addresses are the same X amount of times in a given window of time, and the action value in the firewall log is deny. Otherwise, don’t alert (Layer 4).

  • Gather real-time metrics on website destination IP addresses to detect potential phishing campaigns from nation-states (Layer 3).

The following figure shows the flow of processors that was created in NiFi to generate the destination IP addresses.

Flow of processors created for SLIDE in NiFi

View image larger

Implementation of our threat detection engine

The real power of the threat detection engine consists of Flink DataStream API programs on the backend and a Kibana dashboard on the front end. These two major pieces working together allowed for a full end to end solution, from the filtering and parsing of firewall logs to color-coded bar graphs representing the volume of each different category of threat for actionable insights.

In regard to the technical implementation of use cases for SLIDE, most of the use cases were done using the Flink DataStream API because of its transformation operators that work seamlessly with streams and windows (windows represent time). Functions for the DataStream API include map(), reduce(), and aggregate(), among others, and can be defined by writing classes that extend interfaces, writing an Anonymous Inner Class in Java, or writing lambda functions in Java or Scala. We used Java for the DataStream API programs to filter the data we wanted, as most of us were already accustomed to this language.

The pictures below represent snapshots of the DataStream API programs and the standard outputs after running the main functions. These are all running in the IntelliJ editor. You may also notice that Kafka and Zookeeper have to be running with a JSON log file being passed into each program in order for everything to run properly. This was done using commands inside of a basic Unix terminal, and allowed us to test each individual use case.

The following figure shows which Botnet Command and Control (C&C) Server IP addresses were found as well as the DataStream API code.

Botnet Command and Control Server IP addresses found and the DataStream API code

The following figure shows abnormal activity alerts for TCP and UDP.

abnormal activity alerts for TCP and UDP

The following figure shows an alert for matching source and destination IP addresses that were found, helping to mitigate LAND attacks.

alert for matching source and destination IP addresses found

The following figure shows a POC for what could be a phishing campaign detection with London, United Kingdom showing up several times, and the IP addresses associated with it. This use case would have to be iterated upon further as to not generate too many false positives from friendly cities. It would also need to be associated with click events from email clients.

POC for what could be a phishing campaign detection with London, United Kingdom showing up several times, and the IP addresses associated with it

Kibana dashboard

As mentioned above, Kibana is a data-visualization tool used for log and time-series analytics, application monitoring, and operational intelligence use cases. We created several dashboards in Kibana to visualize the alerts from Flink which provide a high-level overview of the alerts. The Kibana dashboard allows us to view the output from Flink in an aesthetically pleasing format, and also allows for quick insights.

The following figure shows the total alerts of all use cases with the Home tab selected:

Kibana dashboard showing total alerts of all use cases with the Home tab selected

These next two figures represent a visualization of two specific use cases: the Botnet C&C and ThreatConnect IOCs tabs selected respectively.

Kibana dashboard showing visualization of the Botnet C&C

Kibana dashboard showing visualization of ThreatConnect IOCs

Summary

Through our intern project, we were able to explore the efficiency, reliability, and scalability of our architecture. We successfully met our goal to design and develop an end-to-end threat detection engine that can handle millions of logs a day and provide us with real-time analytics and alerts. SLIDE is a custom rule engine that allows for the speedy ingestion and parsing of logs, addressing numerous use cases, and for visualizing alerts through a personalized dashboard. And, all of this was achieved by using the latest open-source Apache technologies, such as Nifi, Kafka, HBase, and Flink along with ElasticSearch, Kibana, and Docker.

In the future, we plan to connect more log sources to our engine and expand its functionality. All components in SLIDE are horizontally scalable and the architecture is set up in a way that makes it incredibly easy to connect new log sources and import a new set of rules using our rule engine parser. SLIDE is a highly efficient, scalable, fault-tolerant, low-latency distributed threat detection engine that can run real time event processing to provide metrics and analytics.