Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep learning has proven to be a useful tool for modelling protein properties. However, given the variability in the length of proteins, it can be difficult to summarise the sequence of amino acids effectively. In many cases, as a result of using fixed-length representations, information about long proteins can be lost through truncation, or model training can be slow due to the use of excessive padding. In this work, we aim to overcome these problems by expanding upon the original vocabulary used to represent the protein sequence. To this end, we utilise two prominent subword algorithms that have been previously used to reach state-of-the-art results in various Natural Language Processing tasks. The algorithms are used to encode the original protein sequence into a set of subsequences before they are analysed by a Doc2Vec model. The pre-trained encodings produced by each algorithm are tested on a variety of downstream tasks: four protein property prediction tasks (plasma membrane localization, thermostability, peak absorption wavelength, enantioselectivity) as well as drug-target affinity prediction tasks over two datasets. Our results significantly improve on the state-of-the-art for these tasks, demonstrating the benefits of using subword compression algorithms for modelling proteins.
Original languageEnglish
Title of host publication42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society: Proceedings
Number of pages7
Publication statusAccepted - 10 Apr 2020
Event42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society - Montreal, Canada
Duration: 20 Jul 202024 Jul 2020
https://embc.embs.org/2020/

Conference

Conference42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society
Abbreviated titleEMBC 2020
CountryCanada
CityMontreal
Period20/07/202024/07/2020
Internet address

Fingerprint Dive into the research topics of 'Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling'. Together they form a unique fingerprint.

  • Cite this

    Lennox, M., Robertson, N., & Devereux, B. (Accepted/In press). Expanding the Vocabulary of a Protein: Application of Subword Algorithms to Protein Sequence Modelling. In 42nd Annual International Conferences of the IEEE Engineering in Medicine and Biology Society in conjunction with the 43rd Annual Conference of the Canadian Medical and Biological Engineering Society: Proceedings