Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing

Research output: Chapter in Book/Report/Conference proceedingConference contribution

133 Downloads (Pure)


Java, with its dynamic runtime environment and garbage collected GC memory management, is a very popular choice for big data processing engines. Its runtime provides convenient mechanisms to implement workload distribution without having to worry about direct memory allocation and deallocation. However, efficient memory usage is a recurring issue. In particular, our evaluation shows that garbage collection has huge drawbacks when handling a large number of data objects and more than 60% of execution time can be consumed by garbage collection. We present a set of unconventional strategies to counter this issue that rely on data layout transformations to drastically reduce the number of objects, and on changing the code structure to reduce the lifetime of objects. We encapsulate the implementation in Apache Spark making it transparent for software developers. Our preliminary results show an average speedup of 1.54 and a highest of 8.23 over a range of applications, datasets and GC types. In practice, this can provide a substantial reduction in execution time or allow a sizeable reduction in the amount of compute power needed for the same task.
Original languageEnglish
Title of host publication25TH IEEE International Conference on High-Performance Computing, Data and Analytics
Publisher IEEE
Number of pages10
ISBN (Electronic)978-1-5386-8386-6
Publication statusPublished - 11 Feb 2019

Publication series

ISSN (Electronic)2640-0316

Fingerprint Dive into the research topics of 'Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing'. Together they form a unique fingerprint.

Cite this