Abstract
Improving the energy efficiency of DRAMs becomes very challenging due to the growing demand for storage capacity and failures induced by the manufacturing process. To protect against failures, vendors adopt conservative margins in the refresh period and supply voltage. Previously, it was shown that these margins are too pessimistic and will become impractical due to high power costs, especially in future DRAM technologies.
In this paper, we present a new technique for automatic scaling the DRAM refresh period under reduced supply voltage that minimizes the probability of failures. The main idea behind the proposed approach is that DRAM error behavior is workload-dependent and can be predicted based on particular program inherent features. We use a Machine Learning (ML) method to build a workload-aware DRAM error behavior model based on the program features which we extract from real workloads during our DRAM error characterization campaign. With such a model, we identify the marginal value of the DRAM refresh period under relaxed voltage for each DRAM module of a server that enable us to reduce the DRAM power.
We implement a temperature-driven OS governor which automatically sets the module-specific marginal DRAM parameters discovered by the ML model. Our governor reduces the DRAM power by 24% on average while minimizing the probability of failures. Unlike previous studies, our technique: i) does not require intrusive changes to hardware; ii) is implemented on a real server; iii) uses a mechanism that prevents any abnormal DRAM error behavior; iv) can be easily deployed in data centers.
In this paper, we present a new technique for automatic scaling the DRAM refresh period under reduced supply voltage that minimizes the probability of failures. The main idea behind the proposed approach is that DRAM error behavior is workload-dependent and can be predicted based on particular program inherent features. We use a Machine Learning (ML) method to build a workload-aware DRAM error behavior model based on the program features which we extract from real workloads during our DRAM error characterization campaign. With such a model, we identify the marginal value of the DRAM refresh period under relaxed voltage for each DRAM module of a server that enable us to reduce the DRAM power.
We implement a temperature-driven OS governor which automatically sets the module-specific marginal DRAM parameters discovered by the ML model. Our governor reduces the DRAM power by 24% on average while minimizing the probability of failures. Unlike previous studies, our technique: i) does not require intrusive changes to hardware; ii) is implemented on a real server; iii) uses a mechanism that prevents any abnormal DRAM error behavior; iv) can be easily deployed in data centers.
Original language | English |
---|---|
Journal | IEEE Transactions on Computers |
Early online date | 26 Oct 2020 |
DOIs | |
Publication status | Early online date - 26 Oct 2020 |
Keywords
- DRAM, GuardBands, reliability, low-power electronics, energy consumption