Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs

Mohsen Koohi Esfahani, Paolo Boldi, Hans Vandierendonck, Peter Kilpatrick, Sebastiano Vigna

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Progress in High-Performance Computing in general, and High-Performance Graph Processing in particular, is highly dependent on the availability of publicly-accessible, relevant, and realistic data sets.

In this paper, we announce publication of MS-BioGraphs, a new family of publicly-available real-world edge-weighted graph datasets with up to 2.5 trillion edges, that is, 6.6 times greater than the largest graph published recently.

We briefly review the two main challenges we faced in generating large graph datasets and our solutions, that are, (i) optimizing data structures and algorithms for this multi-step process and (ii) WebGraph parallel compression technique. We also study some characteristics of MS-BioGraphs.
Original languageEnglish
Title of host publication2023 IEEE International Symposium on Workload Characterization (IISWC’23)
Publication statusPublished - Oct 2023


  • high-performance computing
  • Graph datasets
  • graphs
  • graph algorithms


Dive into the research topics of 'Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs'. Together they form a unique fingerprint.

Cite this