New data science methods for precision medicine based on topological data analysis

  • Ciara F. Loughrey

Student thesis: Masters ThesisMaster of Philosophy

Abstract

The Mapper algorithm is a topological data analysis tool that can extract meaningful information from high-dimensional datasets. The algorithm simplifies complex data by using both partial clustering and dimensionality reduction to produce a low-dimensional similarity graph. This graph is a robust summary of the intrinsic structure of a dataset, capturing global and local relationships between samples. By transforming a high-dimensional dataset to a graph representation, the Mapper algorithm allows easy visualisation and exploration of big data. As such, Mapper is a promising tool for precision medicine, a field of research associated with extremely large and complicated biomedical datasets that require powerful techniques for data analysis. Mapper has been used to perform patient stratification from healthcare data, identifying subgroups of patients with similar molecular profiles, disease mechanisms, treatment responses, and prognoses. However, the algorithm has numerous parameters, and its implementation is constrained by the subsequently large graph space which needs to be manually interrogated by the user.

This thesis aims to streamline applications of the Mapper algorithm for disease subgroup discovery and classification in precision medicine. First, I address the challenge of manual analysis of the Mapper graph by introducing a novel method to search for unique subgroups of patients within a Mapper graph. In particular, I propose a novel algorithm that performs hotspot detection on a graph to locate anomalous and interconnected communities of nodes (i.e. a hotspot) representing homogenous subsets of patients. Simultaneously, I propose to address the problem of Mapper parameter selection by considering the presence of a hotspot as a parameter selection criterion. I explore artificial datasets using the novel algorithm and demonstrate the effectiveness of the hotspot detection method for subgroup discovery in Mapper.

As the second work of this thesis I apply the hotspot detection algorithm in a real world setting and demonstrate that it can extract a subset of patients with atypical survival outcomes from a larger cohort of breast cancer patients. I investigate two publicly available gene expression datasets from oestrogen-receptor positive breast tumours, which are typically associated with a favourable prognosis. The hotspot detection algorithm uncovers a unique hotspot of patients with a shared gene signature and higher levels of tumour reoccurrence, who may benefit from modified interventions or treatment options in the future.

Finally, I address the challenge of classifying unseen patients as members of the hotspot group by proposing a novel Mapper-based classification algorithm. The results show that standard machine-learning predictive algorithms produce inconsistent results in this task across five cancer gene expression datasets. The second proposed algorithm of this thesis, called ‘Mapper k-Nearest Neighbour’, extends the k-Nearest Neighbours algorithm for hotspot membership prediction. Mapper k-Nearest Neighbour is shown to achieve reliably high prediction performance, accommodating different hotspot scenarios produced from variable parameter selections in Mapper.

Date of AwardDec 2024
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsNorthern Ireland Department for the Economy
SupervisorAnna Jurek-Loughrey (Supervisor) & Nick Orr (Supervisor)

Keywords

  • topological data analysis
  • precision medicine
  • mapper algorithm
  • cancer
  • patient stratification
  • machine learning
  • artificial intelligence
  • bioinformatics
  • data science

Cite this

'