Higher-Order Sequence Statistics in Protein Families: The Potts Model and MSA Transformer Showdown

POSTER

Abstract

Potts and Ising Hamiltonian models traditionally describe the physics of magnetic materials, but recently have important applications in understanding the biophysics of protein sequences. Recent "generative" machine learning models for protein sequences, including Potts models and MSA-Transformer, build on shared statistical insights but differ in their approaches. While Potts models assume pairwise interactions between amino acids, MSA-Transformer (MSA-T) claims to capture effects induced by effective potentials beyond pairwise interactions, possibly leading to superior performance in reproducing higher-order sequence statistics. We compare these models on Kinase and RR Domain protein families and find that performance depends on phylogenetic considerations. MSA-Transformer performs well without phylogenetic corrections, but once phylogeny is accounted for, the Potts model outperforms MSA-T. Our findings suggest that MSA-Transformer implicitly corrects for phylogeny in unweighted datasets, but the physics-based Potts model better captures cooperative interactions of biophysical origin when phylogenetic relationships are considered.

*This research was supported by National Institutes of Health grant number R35-GM132090, and by NIH Computer Equipment Grant (OD020095). Gratitude is also expressed to the OWLSNEST high performance cluster at Temple University for its computing support in this project.

Presenters

  • Kisan Khatri

    • Department of Physics, Temple University, Philadelphia, PA,

Authors

  • Kisan Khatri

    • Department of Physics, Temple University, Philadelphia, PA,
  • Ronald M Levy

    • Department of Physics, Department of Chemistry, Temple University, Philadelphia, PA, USA
  • Allan Haldane

    • Department of Physics, Temple University, Philadelphia, PA, USA