The ongoing battle between malware distributors and those seeking to prevent the onslaught of malicious code has, so far, favored the former. Anti-virus methods are faltering with the rapid evolution and distribution of new malware, with obfuscation and detection evasion techniques exacerbating the issue. Recent research has monitored low-level opcodes to detect malware. Such dynamic analysis reveals the code at runtime, allowing the true behaviour to be examined. While previous research uses machine learning techniques to accurately detect malware using dynamic runtime opcodes, underpinning datasets have been poorly sampled and inadequate in size. Further, the datasets are always fixed size and no attempt, to our knowledge, has been made to examine the cost of retraining malware classification models on datasets which grow continually. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The research presented here examines this problem, and makes several novel contributions to the current body of knowledge. First, the performance of 23 machine learning algorithms are investigated with respect to the largest run trace dataset in the literature. Second, following an extensive hyperparameter selection process, the performance of each classifier is compared, on both accuracy and computational costs (CPU time). Lastly, the cost of retraining and testing updatable and non-updatable classifiers, both parallelized and non-parallelized, is examined with simulated escalating datasets. This provides insight into how implemented malware classifiers would perform, given simulated dataset escalation. We find that parallelized RandomForest, using 4 cores, provides the optimal performance, with high accuracy and low training and testing times.
- machine learning
- Cybersecurity, research,inovation, trust, resilience, translation