Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Tsui, Darin; Talreja, Kunal; Aghazadeh, Amirali

Computer Science > Machine Learning

arXiv:2508.18567 (cs)

[Submitted on 25 Aug 2025]

Title:Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Authors:Darin Tsui, Kunal Talreja, Amirali Aghazadeh

View PDF HTML (experimental)

Abstract:Predicting protein function from amino acid sequence remains a central challenge in data-scarce (low-$N$) regimes, limiting machine learning-guided protein design when only small amounts of assay-labeled sequence-function data are available. Protein language models (pLMs) have advanced the field by providing evolutionary-informed embeddings and sparse autoencoders (SAEs) have enabled decomposition of these embeddings into interpretable latent variables that capture structural and functional features. However, the effectiveness of SAEs for low-$N$ function prediction and protein design has not been systematically studied. Herein, we evaluate SAEs trained on fine-tuned ESM2 embeddings across diverse fitness extrapolation and protein engineering tasks. We show that SAEs, with as few as 24 sequences, consistently outperform or compete with their ESM2 baselines in fitness prediction, indicating that their sparse latent space encodes compact and biologically meaningful representations that generalize more effectively from limited data. Moreover, steering predictive latents exploits biological motifs in pLM representations, yielding top-fitness variants in 83% of cases compared to designing with ESM2 alone.

Comments:	15 pages, 4 figures
Subjects:	Machine Learning (cs.LG); Quantitative Methods (q-bio.QM)
Cite as:	arXiv:2508.18567 [cs.LG]
	(or arXiv:2508.18567v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2508.18567

Submission history

From: Darin Tsui [view email]
[v1] Mon, 25 Aug 2025 23:56:39 UTC (5,794 KB)

Computer Science > Machine Learning

Title:Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Sparse Autoencoders for Low-$N$ Protein Function Prediction and Design

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators