Abstract
The rapid increase of processed data is driving the aggressive scaling of DRAM for meeting the needs of higher memory density and bandwidth. As a result of the high memory demand, projections forecast that the memory sub-system will be responsible for a considerable portion of the overall power consumption within most multicore systems. However, the DRAM scaling is hampered by the adoption of pessimistic circuit parameters, that are selected based on the worst-case conditions for reliable operation. Such an approach might guarantee error-free storage of data, but the incurred power and performance overheads raise doubts about its efficiency in the future. This dissertation is focused on characterization and modeling of the DRAM behaviour under non-nominal DRAM circuit parameters, and design of energy-saving techniques in real systems to ensure the reliable operation of the system.Initially, we present the characterization of the DRAM reliability under relaxed circuit parameters and various conditions. We are able to understand the major effects of the workloads on the DRAM error behaviour under realistic conditions. In order to achieve this, we have developed an experimental framework on two server systems and a thermal testbed to control the DRAM temperature. We analyze the correlation between the DRAM error behavior, and the circuit parameters, temperature and workload-depended features. We apply supervised learning methods to construct a prediction model of the DRAM error behaviour based on the main features identified by our characterization. The prediction allows us to relax the DRAM circuit parameters just enough to avoid errors while having the maximum energy savings possible. We develop a benchmark that is able to stress the system even when the server is in the field.
Furthermore, we propose a heterogeneous-reliability memory framework that enables allocation of critical data in a reliable memory domain, while the rest of the data are allocated on a variably-reliable memory domain. This ensures the reliable operation and storage of critical data in memory that is operating under nominal circuit parameters. While data that can tolerate errors are stored in memory that is operating under relaxed circuit parameters and is more energy-efficient. We introduce a programming interface to expose the capabilities of the framework to the users and a governor for scaling the DRAM circuit parameters dynamically. We extend the system with a checkpoint and restart mechanism to ensure even in the worst-case that data can be restored. We further enable the user of the framework to evaluate the fault tolerance and approximate techniques of their applications by implementing it on a real system.
Finally, we devise software techniques to enable the exploitation of the refresh-by-access property of DRAM. We modify the scheduling order of accesses to the memory controller by re-ordering of the tasks in an application to minimize the duration of data residing in memory. This results eventually in decreased probability of errors. We extend our methodology in an application specific compiler to bound the access interval for all application data. We achieve this by controlling the size of each task and the order, while tracing the data accessed for each task. In the process to understand the refresh-by-access property, we develop a simulator for fast measurement of the interval between accesses through binary instrumentation. We use the outputs of the simulator to better understand the refresh-by-access property and to improve the existing DRAM fault injection schemes. By taking into consideration of the real duration that the data were stored in memory, we have more representative error fault injection when DRAM is operating under relaxed circuit parameters.
Date of Award | Jul 2021 |
---|---|
Original language | English |
Awarding Institution |
|
Sponsors | EC-Horizon 2020 |
Supervisor | Georgios Karakonstantis (Supervisor), Dimitrios S. Nikolopoulos (Supervisor) & Maire O'Neill (Supervisor) |
Keywords
- DRAM
- memory
- energy efficiency
- error characterization
- real system
- relaxed circuit parameters
- heterogeneous-reliability memory
- dependable systems
- modeling
- design
- refresh rate