Intrusion Detection Systems and Confusion Matrix

Rahul Bhardwaj
6 min readJun 3, 2021

--

Introduction

In this article I’ve tried to tell a little about the types of IDS systems and confusion matrix and how confusion matrix plays a vital role in it in simple language.

IDS stands for Intrusion Detection Systems. An IDS identifies different kinds of malicious network traffic and computer usage, which cannot be identified by a traditional firewall.

This is vital to achieving high protection against actions that compromise the security of computer systems especially when this happens for an important organization like a bank or some national/international agency.

On the basis of how the intrusion is detected IDS systems can be broadly categorized into two groups : Signature-based Intrusion Detection System (SIDS) and Anomaly-based Intrusion Detection System (AIDS).

Signature-based intrusion detection systems (SIDS)

In this type of IDS the system works on the concept of pre-trained data. The system is already trained about what kind of behavior in the system should be considered as suspicious. For example an earlier intrusion behavior is stored in the database and finding any similarity with that the IDS will go active. This intrusion detection system is also termed as Knowledge-Based Detection or Misuse Detection.

However it has been found an way of intrusion detections but becomes useless when it comes to an anomaly that doesn’t match any pre-written database command. To overcome that another way of detection is Anomaly-based intrusion detection system (AIDS)

Anomaly-based intrusion detection system (AIDS)

In AIDS, a normal model of the behavior of a computer system is created using machine learning, statistical-based or knowledge-based methods. Any significant deviation between the observed behavior and the model is regarded as an anomaly, which can be interpreted as an intrusion. The assumption for this group of techniques is that malicious behavior differs from typical user behavior. The behaviors of abnormal users which are dissimilar to standard behaviors are classified as intrusions.

AIDS is developed by training the model with some known knowledge which is phase one. For phase two of the training it is introduced with some behavior unknown to it and its reaction is observed as its performance.

Based on Input Data Sources

Now the types of IDS can also be classified on the basis of input data sources. If the intrusion comes from the host system and audit sources, such as operating system, window server logs, firewalls logs, application system audits, or database logs then the IDS used is called a Host-based IDS.

If the intrusion is from somewhere outside i.e. the network used than this in this type the IDS used is Network-based IDS which would monitor the network trafficking through extraction of the data packets.

Now, a newly created IDS cannot be implemented for any purpose unless it has been properly tested. What I mean by that is we should now how effective our model is. We should know of from how many intrusions is our IDS going to protect us from.

There are many metrics for doing that but one of the main and simplest is the confusion matrix.

What is Confusion Matrix?

Confusion matrix is a very useful way to calculate the performance of any classification problem. This measure is used in Machine Learning to find out how much of the output is true and how much of it is false. The name confusion comes from the terms in which the output is presented as they are confusing due to the similar keywords in the terminologies.

For example if a we have a binary classification model of positive and negative outcome the matrix would look something like this :

Positive and negatives are the outputs given by the model but if it comes with true before it than it means that it matches with the actual outcome and if it comes false before it , it means its a wrong output.

But, a confusion matrix isn’t only 2×2. It depends on the classification model how many discrete outputs it is going to give and according to that the matrix is formed an example of what can be :

The same logic can be used here. The outcomes of the classifier gives you 10 outputs (0–10). So on the horizontal axis lies the predicted values of the model and in the vertical axis lies the actual values. So if we choose any of the matrix, say (3,5), here the no of outputs are 1. That means the machine gave an output of 5 for the input values but the actual value was 3 and this happened 1 time.

Now this kind of analysis has a huge application in multiple fields. E.g the Covid-19 test’s are conducted and the results are categorized into four columns: True Positive, False Positive, True Negative, False Negative. I think you might have gotten some idea how important this kind of analysis is.

Now let us see in a little more detail how this can be used in IDS performance check. Let us again consider the 2×2 confusion matrix.

With a little math we can calculate some more performance measures out of it:

  • True Positive Rate (TPR): It is calculated as the ratio between the number of correctly predicted attacks and the total number of attacks. If all intrusions are detected then the TPR is 1 which is extremely rare for an IDS. TPR is also called a Detection Rate (DR) or the Sensitivity. The TPR can be expressed mathematically as
  • False Positive Rate (FPR): It is calculated as the ratio between the number of normal instances incorrectly classified as an attack and the total number of normal instances. This is something we never want our system to have.
  • False Negative Rate (FNR): False negative means when a detector fails to identify an anomaly and classifies it as normal. The FNR can be expressed mathematically as:
  • Classification rate (CR) or Accuracy: The CR measures how accurate the IDS is in detecting normal or anomalous traffic behavior. It is described as the percentage of all those correctly predicted instances to all instances:

A curve can be plotted between FPR on the x-axis and TPR on the y-axis. Each point of the curve is an observation of the threshold condition given to the testing system. For different condition the IDS will given different results among the four pairs in confusion matrix. But we’ll just focus on the important part i.e. FPR and TPR relation.

An IDS with the curve in blue passes as the rate of False Positive is exponentially low with respect to the True Positive. Obviously you cannot have a model that will give you 100 % result but a model with this accuracy can surely be added to work.

So, that’s it with the article. Now you know how a simple concept like confusion matrix is of such a great use.

Reference : Survey of intrusion detection systems: techniques, datasets and challenges

--

--