Java, with its dynamic runtime environment and garbage collected GC memory management, is a very popular choice for big data processing engines. Its runtime provides convenient mechanisms to implement workload distribution without having to worry about direct memory allocation and deallocation. However, efficient memory usage is a recurring issue. In particular, our evaluation shows that garbage collection has huge drawbacks when handling a large number of data objects and more than 60% of execution time can be consumed by garbage collection. We present a set of unconventional strategies to counter this issue that rely on data layout transformations to drastically reduce the number of objects, and on changing the code structure to reduce the lifetime of objects. We encapsulate the implementation in Apache Spark making it transparent for software developers. Our preliminary results show an average speedup of 1.54 and a highest of 8.23 over a range of applications, datasets and GC types. In practice, this can provide a substantial reduction in execution time or allow a sizeable reduction in the amount of compute power needed for the same task.
|Title of host publication||25TH IEEE International Conference on High-Performance Computing, Data and Analytics|
|Number of pages||10|
|Publication status||Published - 11 Feb 2019|
Fenacci, D., Vandierendonck, H., & Nikolopoulos, D. (2019). Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing. In 25TH IEEE International Conference on High-Performance Computing, Data and Analytics (25 ed.). IEEE . https://doi.org/10.1109/HiPC.2018.00040