The exponential growth in user and application data entails new means for providing fault tolerance and protection against data loss. High Performance Com- puting (HPC) storage systems, which are at the forefront of handling the data del- uge, typically employ hardware RAID at the backend. However, such solutions are costly, do not ensure end-to-end data integrity, and can become a bottleneck during data reconstruction. In this paper, we design an innovative solution to achieve a flex- ible, fault-tolerant, and high-performance RAID-6 solution for a parallel file system (PFS). Our system utilizes low-cost, strategically placed GPUs — both on the client and server sides — to accelerate parity computation. In contrast to hardware-based approaches, we provide full control over the size, length and location of a RAID array on a per file basis, end-to-end data integrity checking, and parallelization of RAID array reconstruction. We have deployed our system in conjunction with the widely-used Lustre PFS, and show that our approach is feasible and imposes ac- ceptable overhead.
|Title of host publication||Handbook on Data Centers|
|Editors||Samee Ullah Khan, Albert Y. Zomaya|
|Publication status||Published - 2015|