Cybercrime is growing, and as it grows, it becomes more costly and time-consuming to manage. At issue is the evolution of the threats and their techniques to attack systems and their defenses. In this article, we’ll explore the use of machine learning algorithms in threat detection and management.

As our lives become more digital, the data aggregated about us becomes at risk. Whether it’s social media, commerce websites, IoT devices, or tracking data from our smartphones and web browsers, entities on the Internet know more about us than we can fathom. At issue is the value of this data and in many cases, the simplicity for gaining access to it.

It’s not surprising then that machine learning is being applied to computer security to help protect computer systems and our data online. Accenture found in 2019 that only 28 percent of organizations deploy machine learning methods, but they provided the second highest cost savings for security technologies. In this article, we’ll explore security technologies, how machine learning (and deep learning) are being applied to these technologies, and what’s next in the fight against cybercrime and data theft.

Early non-AI approaches

Early security threats appeared with some of the earliest personal computers (such as the “Brain” virus that infected early PC computers in 1986). Without a network to travel, this virus used the floppy disk to infect a system and propagate itself further. After computers were accessible by multiple people (either in the context of multi-user systems or through networking), large-scale security challenges emerged.

An early method, which is still in use today, is blacklisting (see Figure 1). Using this approach, a virus database includes signatures (such as the cryptographic hash) of known viruses and uses this information to regulate access to a system.

Blacklisting through cryptographic signatures

Figure 1. Blacklisting through cryptographic signatures

A related approach to blacklisting is called whitelisting. A whitelist defines the list of acceptable entities that can access or execute on a system. Both methods are useful, and both can be used to restrict access (using the whitelist) and deal with threats when they get brought (using the blacklist).

Anti-virus applications use signatures to detect potential threats, with the downside that it’s restricted to known viruses.

Network security commonly relies on firewalls and whitelist-like configurations to limit accessibility. Firewalls are configured through rules that define the hosts, applications, and protocols that might converse with the network.

Email has become a common transport for security threats, in addition to unwanted communication. Early methods to limit this so-called spam included crude filters that used keywords or senders to block access.

Early AI approaches

All of the previously discussed approaches, from signature-based scanning to firewalls and filtering rely on knowledge of the threats. They’re unable to adapt to new threats without being updated. Machine learning can help to fill this gap.

In the context of anti-virus mechanisms, one interesting approach is to ignore signatures and focus on the potential behavior of a program (see Figure 2). Binary analysis of a program, through what’s called static analysis, can reveal the intent of a program as a set of features such as registering for auto-start or disabling security controls. These might not always represent malicious intent, but by analyzing and clustering these programs based on their discovered behaviors (encoded as a feature vector), it can be possible to identify malware programs based on their relationship to other malicious programs. When a program is clustered with malicious programs, it might represent a threat.

Behavior-based malware detection

Figure 2. Behavior-based malware detection

Network security can also benefit from machine learning. Training a neural network based on web attack payloads can then be applied to a firewall to identify potentially malicious packets and block them from entering a network.

One of the earliest examples of AI in security was in the filtering of spam email. Instead of applying crude filters to email, probabilities were applied using Bayesian filters. In this approach, the user would identify whether an email was spam. The algorithm would adjust the priorities of all of the words based on whether it was spam. Over time, the algorithm would learn that the word “refinance” has a higher probability of being related to spam than others and attribute it as spam. Bayesian techniques have been the baseline methods for email filtering since the 1990s.

Immunity-based approaches

Immunity-based approaches use a biologically inspired mechanism for security. Similar to the way our immune system uses T cells to find and identify infections, artificial immune systems (AIS) discriminate what files and applications should be present in a host and which should be removed.

One approach within AIS is to allow a security application learn about the normal application of a host or device. This training provides the AIS model with a view of how a healthy host operates. After training is complete, the model then monitors a host in operation and after it detects anomalous behavior, it identifies and isolates the threat.

AIS has been studied extensively since its inception in the 1990s resulting in four major algorithms: negative selection, artificial immune networks, clonal selection, and Dendritic cell. Each is inspired by their biological counterparts providing solutions for complex problems. Even elements of immune systems are studied for their application, such as T cells for detection and offense and B cells for immune memory for future encounters.

The negative-selection mechanism exists to provide tolerance for cells that should be present in the host. Those cells selected for destruction are the foreign antigens. In this way, negative selection is similar to the whitelist approach.

But like our own immune systems, more sophisticated attacks could turn these AIS-based systems against their own hosts, resulting in a denial-of-service style of attack.

Deep learning applied

Deep learning is a relatively new machine learning model that builds on earlier neural network research. Deep learning, as the name implies, is made up of deep pipelines of independent neural networks (see Figure 3). One of the most common network types is the convolutional neural network (CNN) that has been proven effective in object identification in images. Other networks include the Long Short-Term Memory (LSTM) network, which is recurrent and ideal for time-series problems.

Deep learning model

Figure 3. Deep learning model

The early stages of these networks detect generic features of the input, and the latter stages detect the more specific features related to the task at hand. For example, early layers capture features that would not be recognizable to a human attempting to understand them, but the later features could represent stripes seen in an object, which would be useful in discerning between a zebra and a horse.

In the context of malware classification, deep learning has been found effective in this domain. In this context, an executable application consists of a series of bytes that have a defined structure along with numeric instructions that are run on the given processor architecture. Rather than use the executable’s numerical instructions directly, these instructions are encoded using an embedded encoding (see Figure 4). This entails taking an instruction numeric and translating it into a higher dimensional space (similar to the way words are encoded into vectors with word2vec for use by deep learning).

Malware detection with deep learning

Figure 4. Malware detection with deep learning

The embedded encodings can then be applied to the deep neural network (DNN) through the convolutional and pooling layers to yield a classification. The DNN was trained on a set of executables that represented normal programs and those that were malware. The DNN could successfully isolate the features that make up a malware program versus a typical program. Fireeye demonstrated this approach with upwards of 96-percent accuracy in detecting malware in the wild.

Another interesting example of deep learning in security is in the domain of intrusion detection systems (IDS). In one example, an LSTM-based DNN was built that used system-call traces to categorize anomalous intrusions. But instead of a single detector that models the language of system calls, an ensemble of detectors exists to minimize false alarm rates. Each detector represents an LSTM DNN that is fed the sequences of system calls. Detectors can be different, trained on different data sets, or use different parameters during their training process.

Intrusion detection with LSTM deep learning

Figure 5. Intrusion detection with LSTM deep learning

This approach was used successfully at the Seoul National University where the detection rate of 99.8 percent with a 5.5-percent false alarm rate. This method shows promise with just understanding the operating system calls that are made instead of the entire application itself.

Deep learning has been applied to numerous areas in security, from malware detection and intrusion detection to malicious code detection. DNNs have proven to be very effective in various security tasks. But one problem with DNNs is that they are vulnerable to what’s called adversarial examples. In the context of visual CNNs, these are images that are constructed in a way to fool the DNN into thinking that an image that appears to be noise is actually classified with high accuracy as something else.

Adversarial Robustness Toolbox

While deep learning is the state of the art for various problem domains, they are susceptible to attacks. Luckily, there are ways to validate a DNN to understanding the robustness of a given DNN implementation in the face of adversarial examples. The Adversarial Robustness Toolbox (ART) is one way to help defend against adversarial examples and understand the robustness of a given DNN implementation.

ART is written in Python and includes implementations of many state-of-the-art methods for attacking and defending DNN classifiers. It supports most of the popular deep learning frameworks (such as TensorFlow, Keras, and PyTorch), and while it is focused primarily on visual classifiers, future adaptations will cover speech, text, and time-series applications.

Human-in-the-loop AI

When we build machine learning models today, humans are obviously in the loop. We define the algorithms, collect and label data, and deploy the trained models. But human in the loop in this context relates to humans supporting models after deployment, helping to increase the effectiveness of the model and also train in the field. As shown in Figure 6, our machine learning model can take an action when it’s confident of its classification of the input data. But when the model is uncertain around whether the input represents something that’s innocuous or if it represents a threat, it can rely on a human to inspect the input.

Improving accuracy with human-in-the-loop

Figure 6. Improving accuracy with human-in-the-loop

Based on the human decision, new learning can take place where the input is labeled as a threat so that in the future, a quick and autonomous action can take place to deal with the threat. The human could also randomly inspect autonomous actions to understand the model’s high-confidence actions if false-positives are an issue.

Machines are good at recognizing patterns based on existing knowledge, but aren’t as good at dealing with new scenarios. For this reason, human in the loop provides important feedback mechanism to increase the effectiveness of your model.

Going further

Security is a typical arms race. Systems and software are secured against a known and perceived set of threats. Exploits are created to surpass those protections, which drives systems and software vendors to update. This vicious cycle continues as new system exploits (known as zero-days) fetch larger amounts of money based on the device of interest, software version, and capability gained through the exploit.

Machine learning can help to end this cycle by addressing not only existing threats, but new threats that have not yet been seen. This is a crossroads for machine learning and security. The machine learning community generally shares algorithms and techniques related to machine learning, but it’s the data that’s the fuel for machine learning and also the competitive advantage.

Machine learning will fundamentally improve security solutions, but it requires a new openness and a new level of collaboration that extends beyond algorithm research. Deep learning shows promise for the future of security technologies, and with robustness techniques like the adversarial robustness toolbox, our online interactions should be safer.