Cross-dataset transcription-based quality control for tumour samples and pre-clinical models

  • Nathan Brown

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Within cancer research, publicly-available patient datasets, comprising gene expression profiles and clinico-pathological data, are available from a range of repositories. The potential reuse and reanalysis of these datasets for discovery and validation has benefits in terms reducing replication and minimising redundancy. However, varying degrees of tumour purity and mislabelling of tumour tissue may lead to propagation of errors in downstream analysis. A quality control method estimating tumour of origin and cell-type-composition has the potential to facilitate more robust and stringent biomarker discovery and development.
Using robust semi-supervised clustering, we identified 24 molecular subtypes in a gold-standard reference dataset (TCGA PanCancer Atlas). These molecular subtypes were characterised by gene expression, survival, mutations, gene set enrichment and cell type and were used as a reference to compare a CUP dataset, 32 primary tumour clinical datasets, the Cancer Cell Line Encyclopaedia (CCLE) 2D pre-clinical model dataset, 16 3D pre-clinical model datasets and 19 Patient-Derived Xenograft (PDX) pre-clinical model datasets with respect to quality control. AS a comparison tool we used a version of a previously published cross-dataset label transfer method, Gene Expression Compositional Assignment (GECA) to correct for imbalance in the training set.
We have shown classifier validation by percentage subtype assignment of external query datasets via molecular similarity and not just tissue of origin or histology alone. The analysis has demonstrated issues with commonly-used cell lines in terms of similarity, by gene expression, to the expected tumour subtypes. PDX models, instead, showed that they represented primary tumours more effectively than the 2D and 3D models.
The analysis has demonstrated a need to screen publicly available data to ensure more accurate representation and therefore more reliable results in the future.

Thesis is embargoed until 31 July 2026.
Date of AwardJul 2024
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsNorthern Ireland Department for the Economy
SupervisorRichard Kennedy (Supervisor) & Jaine Blayney (Supervisor)

Keywords

  • Pre-clinical models
  • cancer
  • transcriptomics
  • R
  • Cross Dataset
  • classification
  • cluster analysis

Cite this

'