Abstract
Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.
Original language | English |
---|---|
Title of host publication | 42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society: Proceedings |
Pages | 2361-2367 |
Number of pages | 7 |
Volume | 2020 |
DOIs | |
Publication status | Published - 27 Aug 2020 |
Publication series
Name | Annual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference |
---|---|
ISSN (Print) | 2375-7477 |
Fingerprint
Dive into the research topics of 'Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling'. Together they form a unique fingerprint.Student theses
-
Deep learning of proteomics data
Lennox, M. (Author), Robertson, N. (Supervisor) & Devereux, B. (Supervisor), Dec 2021Student thesis: Doctoral Thesis › Doctor of Philosophy
File