A Cost Analysis of Machine Learning Using Dynamic Runtime Opcodes for Malware Detection

    Research output: Contribution to journalArticle


    View graph of relations

    The ongoing battle between malware distributors and those seeking to prevent the onslaught of malicious code has, so far, favored the former. Anti-virus methods are faltering with the rapid evolution and distribution of new malware, with obfuscation and detection evasion techniques exacerbating the issue. Recent research has monitored low-level opcodes to detect malware. Such dynamic analysis reveals the code at runtime, allowing the true behaviour to be examined. While previous research uses machine learning techniques to accurately detect malware using dynamic runtime opcodes, underpinning datasets have been poorly sampled and inadequate in size. Further, the datasets are always fixed size and no attempt, to our knowledge, has been made to examine the cost of retraining malware classification models on datasets which grow continually. In the literature, researchers discuss the explosion of malware, yet opcode analyses have used fixed-size datasets, with no deference to how this model will cope with retraining on escalating datasets. The research presented here examines this problem, and makes several novel contributions to the current body of knowledge. First, the performance of 23 machine learning algorithms are investigated with respect to the largest run trace dataset in the literature. Second, following an extensive hyperparameter selection process, the performance of each classifier is compared, on both accuracy and computational costs (CPU time). Lastly, the cost of retraining and testing updatable and non-updatable classifiers, both parallelized and non-parallelized, is examined with simulated escalating datasets. This provides insight into how implemented malware classifiers would perform, given simulated dataset escalation. We find that parallelized RandomForest, using 4 cores, provides the optimal performance, with high accuracy and low training and testing times.


    • A Cost Analysis of Machine Learning Using Dynamic Runtime Opcodes for Malware Detection

      Rights statement: Copyright 2019 Elsevier. This manuscript is distributed under a Creative Commons Attribution-NonCommercial-NoDerivs License (https://creativecommons.org/licenses/by-nc-nd/4.0/), which permits distribution and reproduction for non-commercial purposes, provided the author and source are cited

      Accepted author manuscript, 888 KB, PDF-document

      Embargo ends: 07/05/2020


    Original languageEnglish
    Pages (from-to)138-155
    JournalComputers & Security
    Journal publication date01 Aug 2019
    Early online date07 May 2019
    Publication statusPublished - 01 Aug 2019

      Research areas

    • malware, machine learning, Cybersecurity, research,inovation, trust, resilience, translation, security

    ID: 164943350