Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling

Mark Lennox, Neil Robertson, Barry Devereux

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)
147 Downloads (Pure)

Abstract

Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.

Original languageEnglish
Title of host publication42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society: Proceedings
Pages2361-2367
Number of pages7
Volume2020
DOIs
Publication statusPublished - 27 Aug 2020

Publication series

NameAnnual International Conference of the IEEE Engineering in Medicine and Biology Society. IEEE Engineering in Medicine and Biology Society. Annual International Conference
ISSN (Print)2375-7477

Fingerprint

Dive into the research topics of 'Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling'. Together they form a unique fingerprint.

Cite this