Abstract
In this paper, we present a benchmark dataset for training and evaluating static PE malware machine learning models, specifically for detecting known vulnerabilities in malware. Our goal is to enable further research in defense against malware by exploiting their bugs or weaknesses. After recognising limitations in current malware datasets regarding exploitable malware, our dataset addresses these gaps by utilizing the malware vulnerability database Malvuln, and software vulnerability database ExploitDB to create a new malware dataset with 864 vulnerable malware samples, 35,241 non-vulnerable malware samples, 1,425 vulnerable benign samples, and 7,905 non-vulnerable benign samples, detailed with timestamps, families, threat mapping, vulnerability mapping, and obfuscation analysis. This 4-class dataset lays the foundation for advancing future research in analysis and vulnerability exploitation in malware using machine learning. We also provide baseline results using state-of-the-art models for malware classification to benchmark the performance of the dataset, where the binary tasks achieve F1 scores above 0.90, while the multi-class task attains an F1-Score of 0.958.
Original language | English |
---|---|
Title of host publication | Proceedings of the Conference on Applied Machine Learning for Information Security (CAMLIS 2024) |
Publisher | IEEE Xplore |
Publication status | Accepted - 04 Aug 2024 |
Event | Conference on Applied Machine Learning for Information Security - Arlington, United States Duration: 24 Oct 2024 → 25 Oct 2024 Conference number: 2024 https://www.camlis.org/ |
Conference
Conference | Conference on Applied Machine Learning for Information Security |
---|---|
Abbreviated title | CAMLIS |
Country/Territory | United States |
City | Arlington |
Period | 24/10/2024 → 25/10/2024 |
Internet address |