MS-BioGraphs MS

Dataset

Description

This datasets contains a edge-weighted graph that represents the similarity between protein sequences of the Metaclust dataset. The LAST sequence aligning algorithm (https://gitlab.com/mcfrith/last) has been used for matching the sequences and the graph is compressed in WebGraph format (https://webgraph.di.unimi.it).

This dataset is the whole graph with 1.7 billion vertices and 2.5 trillion edges.

For more information about the dataset, its features, and downloading, please visit: https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-MS

For more information about creation of dataset, please visit: https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Sequence-Similarity-Graph-Datasets/

For sample code for loading and validating the dataset, please refer to: https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs-Validation/

The following link contains the other datasets of this family: https://blogs.qub.ac.uk/DIPSA/MS-BioGraphs
Date made available10 Aug 2023
PublisherQueen's University Belfast
Date of data production2022

Cite this