K-Means Clustering and its Use Cases in Security Domain
What is K-Means clustering ?
K-Means clustering is a Machine Learning algorithm that helps in making groups or clusters in our data. But why do we want to our data to be grouped in that way? Grouping data provides a better understanding of the data even if its already been presented in a graphical way.
Not only this, grouping our data can be helpful segregate different types of things in different categories for example Market segmentation, document clustering. It is even helpful in computer vision where you can go into the granular depth of identifying the objects and how much of the image’s area are they in, called image segmentation.
Let us start from a smaller dataset’s point of view.
Suppose we have a dataset of Age and Salary whose graph looks something like this:
As you can see we can’t be sure but from the visuals we get an intuition that there are three kinds of people in this dataset : one having younger age and less salary, one having older age and more salary and one having less salary and older age. So, this is something that K-Means algorithm helps us to do.
Below is the graph made after implementing the K-Means algorithm on python with matplotlib library:
Made pretty simple right? So, this is the basic thing done with the help of K-Means clustering.
How does the K-Means clustering work? What is the “K” in K-Means?
Now you might be thinking okay that’s a cool thing that it finds out the clusters in the graph. But how does it do that? What is this algorithm. So, let me explain it in this way. First we have to choose how many clusters we want. Of course we also have a method for that but let me come to that in the later part. First, let us assume the number of clusters here which is the K.
So, for the previously given dataset let us take the K to be 3. Now what our algorithm will do, it will take three centroids which will become the centers of these (about to be) clusters with the help of Euclidean geometry.
First, it will randomly set these centroids anywhere in the graph:
Now, it will group the data points according to there distances from these centroids. For that it will use the perpendicular bisector which will easily segregate the data points according to the distances.
And the centroids will change their positions according to the new data points sorted for them.
Now again new sorting will be done according to the new perpendicular bisectors of these new centroids.
As you can see there has been some changes in the clustering with respect to the previous clusters. So, this will keep on going in cycle until the centroids no more change their positions. And thus, we get the final clusters:
NOTE: The above method shows just the way how K-Means work. In actual manipulation of data first the scaling needs to be done due to the difference in X and Y scale.
How to choose the value of K?
To choose the value of K we use the Elbow method. In this method we choose a series of Ks for our dataset and find the square of sum of distances of each point from the respective centroid in the respective cluster and add these distances along with those for the other centroids.
For example let K=3, as in above example. Then for each of the three centroids find the square of distance from each point of the respective cluster and add them. Let the sums be S1, S2 and S3 called sum of squared errors. Now for K the total sum of squared errors will be S1+S2+S3. Similarly, calculate this for multiple values of K (say 1 to 9) and plot a graph for all of them.
Here as you can see there is an abrupt change in the slope for k=3. So, 3 is the appropriate value of k. Why only the elbow point? Well, the point where the elbow is made is the junction point of the scenario ahead of what there will be multiple clusters having almost the same distance from centroids. And before that, the distance between the points and clusters will be too much and will be changing. So the elbow point creates a perfect picture for our desired state by grouping into clusters.
Uses of K-Means clustering in Security Domain
Due to its ability to categorize data and simplicity of implementation it has been found by quite some researchers to be useful in anomaly detection, outlier detection, fraud detection etc.
Let’s us understand in brief about these kind of detections first:
Anomaly detection: Detection of a behavior in the system that do not conform to the expected behavior.
Outlier detection: Detection of behavior in the system that do conforms to the expected behavior but vary a little too much from the rest of the records.
Fraud Detection: These are the false pretenses made in the intention of gaining something from the system.
In 2015 Chauhan and Shukla in their research paper laid an approach to for outlier detection using K-Means cluster algorithm that was useful to give basics of concepts in outlier detection for beginner researchers.
Another paper was published by Münzet in 2007 proposing a novel anomaly detection method in data mining using the K-Means clustering algorithm. The model designed was concluded to be faster in terms of detection quality. The concept used was basically the same that we earlier discussed. He used unlabeled records in the dataset and divided it into clusters of regular traffic and anomalies using the K-means algorithm.
K-Means is but a basic algorithm which can be integrated with other algorithms to get more efficiency. In 2018 Aung& Min published a paper where they created a hybrid model for detecting Denial of Service (DoS) attacks, Probing (Probe) attacks, User-to-Root (U2R) attacks and Remote- to-Local (R2L) attacks. Let’s understand these in brief before going any further:
Denial of Service (DoS) attack: It is an attack meant to shut down a machine or network, making it inaccessible to its intended users.
Probing attach: A probe is an attack which is deliberately crafted so that its target detects and reports it with a recognizable “fingerprint” in the report. The attacker then uses the collaborative infrastructure to learn the detector’s location and defensive capabilities from this report.
User-to-Root attack: Initially attacker access normal user account, later gain access to the root by exploiting the vulnerabilities of the system.
Remote-to-Local attack: Unauthorized access to the system through methods like guessing password etc.
In the paper they found out that they can group similar nature of attack group by using K-means algorithm. And then with Random Forest algorithm they can be classified into normal and attack connections. They used KDDCup 99 dataset which show some impressive results.
As you can see from the above table the precision went up to some really great level, the false-positive rate also showed an enhancement result nearly equal to zero.
So, that’s all about this article. Thanks for reading.
References:
- A Review of Current Machine Learning Approaches for Anomaly Detection in Network Traffic by Wasim A. Ali, ManasaK. N, Malika Bendechache, Mohammed FadhelAljunaid, P. Sandhya.
- An Analysis of K-means Algorithm Based Network Intrusion Detection System by Yi YI Yung and Myat Myat Min.