Hackers, I see you: Network Intrusions

In the hyperconnected world these days, billions of packets are sent back and forth between computers every second. Unfortunately not all of these are well intentioned and can cause significant harm. With machine learning, we can use classification models to differentiate between normal network traffic and nefarious ones. However, to do this effectively, choosing the right model is critical.

Challenge

To tackle this problem, we’ll train/test against nearly 500,000 data points coming from an old competition held at SIGKDD 1999. Their original task was to correctly identify intrusions or attacks versus normal connections. You can read up more about the challenge here. Back then though, even the results from the winners were not great. Let’s see what almost 20 years of machine learning advancement can do for this problem now.

We’ll try the following two models: Random Forest, Gradient Boosting.

Data

data

Because our data set is so unbalanced and many attack types barely have any occurrences, I will be grouping together all attacks with less than 100 occurrences into a single “Others” group. This will yield better classification performance for the sum of them than trying to predict them individually.

Random Forest

Confusion Matrix

Confusion matrices are great for visualizing misclassifications. On the left are the actual network traffic types and on the bottom is what that connection was predicted as. Pay attention to the “normal” axis. With a random forest model, we are mostly misclassifying 21.6% of the “others” category.

rf_cm

Precision-Recall Curve

rf_pr

With the precision-recall curve, it is quite clear that our classification accuracy for all categories except “others” is near perfect. Only the “others” category is giving us trouble simply because we did not have enough data points to train on for them.

Gradient Boosting

Generally speaking, gradient boosting yields better results than a random forest, but at a cost. They require lots of hyperparameter tuning and traditionally weren’t parallelizable. This time constraint usually meant choosing random forests, but nowadays, with XGBoost, we can do gradient boosting much faster and benefit from the better results. Let’s see if we can squeeze a little bit more performance with this model.

Confusion Matrix

Compared against random forest’s confusion matrix, XGBoost’s has a noticeable improvement in “others” classification. We are now only missing 15.7% instead of random forest’s 21.6%.

gb_cm

Precision-Recall Curve

The precision-recall curve for XGBoost is nearly the same as our random forest model too.

gb_pr

Performance

Since gradient boosting is a technique that can keep running indefinitely, we would need to stop it at an appropriate epoch. Our log-loss and classification error graphs shows our performance plateau-ing around 60 epochs. This would probably be a good time to stop our model.

gb_ll

gb_ce

comments powered by Disqus