On overcoming HPC challenges of trillion-scale real-world graph datasets

Mohsen Koohi Esfahani, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, Sebastiano Vigna

Research output: Chapter in Book/Report/Conference proceedingConference contribution

1 Citation (Scopus)

Abstract

Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets. To ensure continuation of this progress, we (i) investigate and optimize the process of generating large sequence similarity graphs as an HPC challenge and (ii) demonstrate this process in creating MS-BioGraphs, a new family of publicly available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently. The largest graph is created by matching (i.e., all-toall similarity aligning) 1.7 billion protein sequences. The MSBioGraphs family includes also seven subgraphs with different sizes and direction types. We describe two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. The datasets are available online on https://blogs.qub.ac.uk/ DIPSA/MS-BioGraphs.


Original languageEnglish
Title of host publication2023 IEEE International Conference on Big Data (BigData): proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9798350324457
ISBN (Print)9798350324464
DOIs
Publication statusPublished - 22 Jan 2024
Event2023 IEEE International Conference on Big Data - Sorrento, Italy
Duration: 15 Dec 202318 Dec 2023
https://doi.org/10.1109/BigData59044.2023

Conference

Conference2023 IEEE International Conference on Big Data
Abbreviated titleBigData 2023
Country/TerritoryItaly
CitySorrento
Period15/12/202318/12/2023
Internet address

Fingerprint

Dive into the research topics of 'On overcoming HPC challenges of trillion-scale real-world graph datasets'. Together they form a unique fingerprint.

Cite this