Workload-aware DRAM error prediction using machine learning

Lev Mukhanov, Konstantinos Tovletoglou, Hans Vandierendonck, Dimitrios Nikolopoulos, Georgios Karakonstantis

Research output: Chapter in Book/Report/Conference proceedingConference contribution

19 Citations (Scopus)
480 Downloads (Pure)

Abstract

The aggressive scaling of technology may have helped to meet the growing demand for higher memory capacity and density, but has also made DRAM cells more prone to errors. Such a reality triggered a lot of interest in modeling DRAM behavior for either predicting the errors in advance or for adjusting DRAM circuit parameters to achieve a better tradeoff between energy efficiency and reliability. Existing modeling efforts may have studied the impact of few operating parameters and temperature on DRAM reliability using custom FPGAs setups, however they neglected the combined effect of workload-specific features that can be systematically investigated only on a real system. In this paper, we present the results of our study on workload-dependent DRAM error behavior within a real server considering various operating parameters, such as the refresh rate, voltage and temperature. We show that the rate of single- and multi-bit errors may vary across workloads by 8x, indicating that program inherent features can affect DRAM reliability significantly. Based on this observation, we extract 249 features, such as the memory access rate, the rate of cache misses, the memory reuse time and data entropy, from various compute-intensive, caching and analytics benchmarks. We apply several supervised learning methods to construct the DRAM error behavior model for 72 server-grade DRAM chips using the memory operating parameters and extracted program inherent features. Our results show that, with an appropriate choice of program features and supervised learning method, the rate of single- and multi-bit errors can be predicted for a specific DRAM module with an average error of less than 10.5 %, as opposed to the 2.9x estimation error obtained for a conventional workload-unaware error model. Our model enables designers to predict DRAM errors in advance for less than a second and study the impact of any workload and applied software optimizations on DRAM reliability.

Original languageEnglish
Title of host publicationIEEE International Symposium on Workload Characterization (IISWC) 2019: Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages106-118
Number of pages13
ISBN (Electronic)9781728140452
ISBN (Print)9781728140469
DOIs
Publication statusPublished - 19 Mar 2020
EventIEEE International Symposium on Workload Characterization - Orlando, United States
Duration: 03 Nov 201905 Nov 2019
http://www.iiswc.org/iiswc2019/index.html

Conference

ConferenceIEEE International Symposium on Workload Characterization
Country/TerritoryUnited States
City Orlando
Period03/11/201905/11/2019
Internet address

Fingerprint

Dive into the research topics of 'Workload-aware DRAM error prediction using machine learning'. Together they form a unique fingerprint.

Cite this