Deep learning of proteomics data

Student thesis: Doctoral ThesisDoctor of Philosophy

Abstract

Next-generation sequencing technology has propelled the field of biology into the big data era, and continual advancements in computing have now made it easier to explore complex biological systems. However, analysing such highly complex data with conventional machine learning algorithms can be troublesome as these techniques require a considerable amount of feature engineering. Fortunately, a subfield of machine learning known as deep learning has recently, shown evidence towards overcoming these issues. Such algorithms have initially been applied in a genomic and transcriptomic setting. However, advancements in sequencing technology have allowed proteomics to mature to the level where deep learning is now a viable option. This thesis will primarily consider the application of deep learning for modelling the various properties of a protein.

Although deep learning addresses some of the initial problems encountered when analysing omic data, there is a series of different challenges that are still present when applying deep learning algorithms. Even when using the latest approaches, a deep learning model often requires a large amount of labelled data, which can be costly and time-consuming to acquire. If there is not an adequate amount of data, then standard deep learning approaches often underperform when compared to traditional machine learning algorithms. Also, these models are black-box algorithms, which presents problems with the interpretation of the predictions being produced by the model.

Given the variability within proteins, it can be difficult to summarise the data effectively as information about the protein can be lost through feature engineering. In each research chapter of this thesis, we address the shortcomings of applying conventional machine learning to model protein data using deep learning. In the first technical chapter, we begin with the use of state-of-art subword encoding schemes. We prove that these new representations are more beneficial and practical for pre-training when compared to the standard baselines. In the next chapter, we go a step further and address the issue of applying deep learning models to smaller datasets. In doing so, we explore how metric learning can be used to form a robust model architecture that is capable of learning and ranking proteins from just a few labelled examples. After this, we consider an approach that utilises both pre-training and metric learning to reach a new state-of-the-art by using large unsupervised networks. In this chapter, we leverage a BERT model that has been pre-trained on a vast quantity of proteomic data, to model a collection of regression tasks using only a minimal amount of data. We adopt a triplet network structure to fine-tune the BERT model for each dataset and evaluate its performance on a set of downstream tasks. The first three strategies mentioned were tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localisation, thermostability, peak absorption wavelength, enantioselectivity).

Additionally, this thesis includes a further two chapters that consider other challenges encountered when modelling protein data. This begins with a chapter on applying pre-training to improve upon the state-of-the-art in phosphorylation site modelling using a brand-new convolutional transformer-based model. We evaluate our approach on a general phosphorylation site dataset, and a variety of kinase-specific datasets. Additionally, to emphasise that this is an example of white-box deep learning, we visualise the model's features to gain a better understanding behind the prediction of each site.

The final research chapter considers a state-of-the-art approach to modelling the interactions between proteins and drugs. In this chapter, we leverage a set of BERT-style models that have been pre-trained on vast quantities of both protein and drug data. The encodings produced by each model are then utilised as node representations for a graph convolutional neural network, which in turn models the interactions without the need to simultaneously fine-tune both protein and drug BERT models to the task. We evaluate the performance of our approach on two drug-target interaction datasets that were previously used as benchmarks in recent work.
Date of AwardDec 2021
Original languageEnglish
Awarding Institution
  • Queen's University Belfast
SponsorsNorthern Ireland Department for the Economy
SupervisorNeil Robertson (Supervisor) & Barry Devereux (Supervisor)

Keywords

  • Deep learning
  • proteomics

Cite this

'