In the hyperconnected world these days, billions of packets are sent back and forth between computers every second. Unfortunately not all of these are well intentioned and can cause significant harm. With machine learning, we can use classification models to differentiate between normal network traffic and nefarious ones. However, to do this effectively, choosing the right model is critical.
To tackle this problem, we’ll train/test against nearly 500,000 data points coming from an old competition held at SIGKDD 1999. Their original task was to correctly identify intrusions or attacks versus normal connections. You can read up more about the challenge here. Back then though, even the results from the winners were not great. Let’s see what almost 20 years of machine learning advancement can do for this problem now.
Because our data set is so unbalanced and many attack types barely have any occurrences, I will be grouping together all attacks with less than 100 occurrences into a single “Others” group. This will yield better classification performance for the sum of them than trying to predict them individually.
Confusion matrices are great for visualizing misclassifications. On the left are the actual network traffic types and on the bottom is what that connection was predicted as. Pay attention to the “normal” axis. With a random forest model, we are mostly misclassifying 21.6% of the “others” category.
With the precision-recall curve, it is quite clear that our classification accuracy for all categories except “others” is near perfect. Only the “others” category is giving us trouble simply because we did not have enough data points to train on for them.
Generally speaking, gradient boosting yields better results than a random forest, but at a cost. They require lots of hyperparameter tuning and traditionally weren’t parallelizable. This time constraint usually meant choosing random forests, but nowadays, with XGBoost, we can do gradient boosting much faster and benefit from the better results. Let’s see if we can squeeze a little bit more performance with this model.
Compared against random forest’s confusion matrix, XGBoost’s has a noticeable improvement in “others” classification. We are now only missing 15.7% instead of random forest’s 21.6%.
The precision-recall curve for XGBoost is nearly the same as our random forest model too.
Since gradient boosting is a technique that can keep running indefinitely, we would need to stop it at an appropriate epoch. Our log-loss and classification error graphs shows our performance plateau-ing around 60 epochs. This would probably be a good time to stop our model.