This is the report for adAPT machine learning challenge. This document will
- Describe the Challenge
- Describe the implemented Solution and Rationale
- feature engineering
- model selection
- cross validation technique
- evaluation
- results
- improvements upon my results
- A diagram of a pipeline for your model deployed in Elastic. Describe this diagram in a section of your report.
In this challenge, I was asked to use network packet captures (pcap files) to classify malware using a machine learning model.
The full set of rules can be identified in challenge.md or README.md
This section will describe my solution and rationale. The time limit on this challenge forced me to make design decisions that were a working solution vs. a very robust working solution.
TCP/IP network packets have a lot of inherent features by the nature of their structure and protocols. Some of those structural features were used directly in the feature engineering. Some were derived from the structural features. The table below describes the selected features and how they were generated.
I used scapy
as my pcap parser which also has the notion of application layers. A DNS packet may have these layers
- IP
- UDP
- DNS whereas an HTTP request packet may have
- IP
- TCP
- HTTP
- HTTPRequest
Feature Name | Raw / Derived | Possible Values | Selection Rationale | Notes |
---|---|---|---|---|
protocol | derived | True / False (One-Hot Encoded) | Different software can utilize various IP protocols and this can identify likelihood of malware using one over the other | The protocol was derived from the scapy IP layer and describes IPv4, IPv6, and UDP |
app_layer | derived | True / False (One-Hot Encoded) | Will help capture the application layer behavior of benign and malicious software | The protocol was derived from the scapy layers and custom classes in the feature generation |
source_port | raw | True / False (One-Hot Encoded) | If malware uses unique ports to communicate this feature will be valuable | The port from which the packet was sent |
dest_port | raw | True / False (One-Hot Encoded) | See source_port |
The port to which the packet was sent |
proto_packet_length | raw | Continuous [0-1) | Different functions of software may have predictable packet lengths | The length of the packet as captured at the TCP/UDP scapy layer |
ip_packet_length | raw | Continuous [0-1) | See proto_packet_length |
The length of the packet as captured at the IP scapy layer |
base_domain | derived | True / False (One-Hot Encoded) | The domain itself may lead to detection in a similar fashion as IP address. This is a poor implementation of a domain reputation service and a real service with a score would serve this purpose better | The Domain portion of a URL (e.g., "microsoft.com" from "www.microsoft.com") |
tld | derived | True / False (One-Hot Encoded) | This is a poor implementation of TLD reputation. A real reputation service with a score for a given TLD would be more useful | The Domain portion of a URL (e.g., "microsoft.com" from "www.microsoft.com") |
url_entropy | derived | Numeric >= 0 | The entropy of the URL may lead to discovery of generated (i.e., malicious) domains | The Shannon entropy of the URL string (if present, otherwise 0) |
host_entropy | derived | Numeric >= 0 | Malware may utilize domains that are also widely used for legitimate purposes (think raw.githubusercontent.com) so analyzing just the host information can lead to identifying malware that uses this technique | The Shannon entropy of the Host portion of the URL string (if present, otherwise 0) |
base_domain_entropy | derived | Numeric >= 0 | The base domain entropy is searching for the inverse of the host_entropy - when malware uses legitimate hostnames on malicious domains (e.g., www.evil.ga) |
The Shannon entropy of the Domain portion of the URL string (if present, otherwise 0) |
host_length | derived | Numeric >= 0 | The length of only the hostname may indicate certain strains of malware that use particularly long names | The Shannon entropy of the Domain portion of the URL string (if present, otherwise 0) |
proto_packet_entropy | derived | Numeric >= 0 | The scapy packet contains the packet cache in bytes and I am making an assumption that this value can be used to capture smuggling of predictable data (low-entropy) by malware |
The Shannon entropy of the TCP/UDP packet cache |
source_ip_class_a | derived | True / False (One-Hot Encoded) | The source IP address first octet may indicate popular networks for malware (e.g., national infrastructure) | The first octet of the source IP address |
source_ip_class_b | derived | True / False (One-Hot Encoded) | Similiar to source_ip_class_a rationale |
The first 2 octets of the source IP address |
source_ip_class_c | derived | True / False (One-Hot Encoded) | Similiar to source_ip_class_a rationale |
The first 3 octets of the source IP address |
dest_ip_class_a | derived | True / False (One-Hot Encoded) | The destination IP address first octet may indicate popular networks for malware (e.g., national infrastructure) | The first octet of the destination IP address |
dest_ip_class_b | derived | True / False (One-Hot Encoded) | Similiar to dest_ip_class_a rationale |
The first 2 octets of the destination IP address |
dest_ip_class_c | derived | True / False (One-Hot Encoded) | Similiar to dest_ip_class_a rationale |
The first 3 octets of the destination IP address |
For my model I opted for a supervised binary classification neural network. I chose this model because I already had labelled data and the challenge was to develop a model based on those data. I feel that an unsupervised model would sufficiently identify similar types of software, though it would better serve different questions such as classifying atypical software (anomaly detection) or software that behaves similarly (multi-class categorization).
I used tensorflow
to implement the neural network.
For cross-validation, I split the labelled data into 3 parts:
- 60% for training, 40% for testing and cross-validation
- Of the 40% for testing and cross-validation, an even split of 20% for each was generated
This technique was highlighted while taking courses on machine learning and this particular situation.
Initially, I was using both IP addresses (source and destination) as well as a hostname of packets from which that could be retrieved (e.g., HTTP requests and DNS packets). When using those one-hot encoded features, I was getting ~97% classification accuracy.
While writing this report I was determining that I felt the IP addresses and hostname were weak signals to include in a feature set and removed them. After reducing the features in that way, the model was able to achieve 99.314% accuracy. That confirmed my suspicion that removing those weak signals increases the accuracy fo the model.
This challenge did not specify how to evaluate results and so I decided to train and test (on both testing and cross-validation sets) to identify whether the model was performing well or not. Finally, I ran the entire data set through the trained model to identify the accuracy of the model on this set of data.
The final model accuracy is ~99.314% with ~11.37% loss. The results are still clouded by the lack of features that allow generalization on future inputs.
- Feature improvements
- More derived features / Fewer OHE features - As of this writing, the URL is not included in the feature set but the domain of identified URLs is, as well as certain IP address features (Class A, etc). I would prefer to do string analysis on the URL (counting non-ASCII, vowels, consonants, etc) instead of the one-hot techniques I employed. The number of features is currently almost 7000 which is a huge number and the value of all those one-hot encoded features is not necessarily high. That being said, it is what I chose to complete this challenge due to time constraints.
- URL features are not necessarily reliable - The domain is difficult to classify though reputation engines do this at greater scale and utilizing that threat intelligence would be useful
- Utilizing GeoIP to resolve autonomous system number and geographic region would provide stronger features than simply using IP addresses
- More application layer protocols would provide a richer feature set
- ssh
- ftp
- smtp
- smb
- Technique improvements
- I expect more in-depth analysis of the packet will result in better results. For example, I did not inspect HTTP Response results because they were difficult to parse in the time alloted. Also, HTTPS packets would help identify if a certificate was compromised and having a cert reputation engine would be useful
- Integrate more threat intelligence - Integrating a WHOIS service would allow me to enrich the IP address data with stronger features including GeoIP and ASN, as well as potentially the registrar
- Evaluation improvements
- I realize while writing this that I could have omitted one of the malicious data sets to determine if the model could generalize to detecting it as malware (without having been trained on that data). Further, I might identify additional pcaps of data sets to determine if the model properly classifies them. I did not perform these steps due to time constraints.