Single-sequence protein structure prediction using language models and deep learning

ORAL

Abstract

 

AlphaFold2 and related computational systems predict protein structure using deep learning and co-evolutionary relationships encoded in multiple sequence alignments (MSAs). Despite the outstanding performance of AlphaFold2, predicting the structures of single sequences using deep learning nonetheless remains a challenge. The requirement in AlphaFold2 for co-evolutionary information from MSAs makes it less performative with proteins that lack sequence homologs, currently estimated at ~20% of all metagenomic protein sequences and ~11% of eukaryotic and viral proteins. On the other hand, protein design and studies quantifying the effects of sequence variation on function also require single-sequence structure prediction. I will describe the development of an end-to-end differentiable recurrent geometric network that uses a protein language model (AminoBERT) to learn latent structural information from unaligned proteins and a geometric module that compactly represents Cα backbone geometry in terms of Frenet-Serret formulas. Our model outperforms AlphaFold2 and RoseTTAFold on orphan proteins and classes of designed proteins while achieving a 106-fold reduction in compute time. These findings demonstrate the practical and theoretical strengths of protein language models in structure prediction.

 

 

 

 

 

 

Publication: https://doi.org/10.1101/2021.08.02.454840

Presenters

  • Nazim Bouatta

    Harvard Medical School

Authors

  • Nazim Bouatta

    Harvard Medical School