PEVuln: a benchmark dataset for using machine learning to detect vulnerabilities in PE malware

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we present a benchmark dataset for training and evaluating static PE malware machine learning models, specifically for detecting known vulnerabilities in malware. Our goal is to enable further research in defense against malware by exploiting their bugs or weaknesses. After recognising limitations in current malware datasets regarding exploitable malware, our dataset addresses these gaps by utilizing the malware vulnerability database Malvuln, and software vulnerability database ExploitDB to create a new malware dataset with 864 vulnerable malware samples, 35,241 non-vulnerable malware samples, 1,425 vulnerable benign samples, and 7,905 non-vulnerable benign samples, detailed with timestamps, families, threat mapping, vulnerability mapping, and obfuscation analysis. This 4-class dataset lays the foundation for advancing future research in analysis and vulnerability exploitation in malware using machine learning. We also provide baseline results using state-of-the-art models for malware classification to benchmark the performance of the dataset, where the binary tasks achieve F1 scores above 0.90, while the multi-class task attains an F1-Score of 0.958.
Original languageEnglish
Title of host publicationProceedings of the Conference on Applied Machine Learning for Information Security (CAMLIS 2024)
PublisherIEEE Xplore
Publication statusAccepted - 04 Aug 2024
EventConference on Applied Machine Learning for Information Security - Arlington, United States
Duration: 24 Oct 202425 Oct 2024
Conference number: 2024
https://www.camlis.org/

Conference

ConferenceConference on Applied Machine Learning for Information Security
Abbreviated titleCAMLIS
Country/TerritoryUnited States
CityArlington
Period24/10/202425/10/2024
Internet address

Fingerprint

Dive into the research topics of 'PEVuln: a benchmark dataset for using machine learning to detect vulnerabilities in PE malware'. Together they form a unique fingerprint.

Cite this