A hike through the Protein Evolutionary Landscape
ORAL
Abstract
Masked language models can predict a protein's residue preference at a specific position using the complete sequence context. We propose a scheme to calculate the masked residue profile for the entire sequence in a single forward pass with unmasked residue embeddings. This allows us to efficiently calculate the pseudo-perplexity of a sequence, a measure of the model's uncertainty in assigning residues. Naturally occurring proteins typically have a low pseudo-perplexity, and in designed proteins, a low pseudo-perplexity can be used as a scalar proxy for function. We use this as a rough estimate for the fitness of a protein, and explore protein morphospace by generating high fitness paths between homologous proteins. We achieve this by using a directed evolution based approach where we iteratively select low pseudo-perplexity mutants that are most proximal to the target sequence. Navigation towards the target is achieved by using an alignment procedure powered by language model embeddings, and proposals for the mutations are drawn from the masked residue profile of the sequence. We cross-validate the reliability of the interpolated paths by folding the sequences using alphafold, and consistently achieve high pLDDT scores.
* NIH R35 GM138341 (BM,PK) and a Simons Investigator award (BM)
–
Presenters
-
Pranav Kantroo
Yale University
Authors
-
Pranav Kantroo
Yale University
-
Benjamin B Machta
Yale University
-
Gunter P Wagner
Yale University