Leveraging protein language models to identify complex trait associations with previously inaccessible classes of functional rare variants

Last updated:: 7 June 2026

Author(s):: Seon-Kyeong Jang, Zitian Wang, Richard Border, Dinh Tuan, Angela Wei, Ulzee An, Sriram Sankararaman, Vasilis Ntranos, Jonathan Flint, Noah Zaitlen
Publish date:: 19 November 2025
Journal:: Cell Genomics
PubMed ID:: 41265447
DOI:: 10.1016/j.xgen.2025.101068

Abstract

Protein language models (PLMs) improve variant effect predictions, but their role in gene discovery for complex traits remains unclear. We introduce an allelic series-based regression test that uses PLM-derived variant effect predictions as proxies for effect sizes, identifying ∼46% more associations than standard burden tests. Extending this to isoform-level analysis, we find 26 gene-trait pairs with stronger associations in non-canonical versus canonical transcripts, highlighting isoform-specific effects. Finally, we identify evolutionary plausible variants (EPVs), missense variants assigned higher likelihoods than the wild-type alleles by PLMs, representing 0.45% of missense variants. EPVs show higher allele frequencies than synonymous variants, consistent with differential selection pressures, and are linked to nine traits, including protective associations with low-density lipoprotein (LDL) and bone mineral density. Together, our results demonstrate how PLMs can enhance rare-variant interpretation and gene-trait association discovery in exome data.