MeMDLM: De Novo Membrane Protein Design with Masked Discrete Diffusion Protein Language Models
Authors:
Shrey Goel,
Vishrut Thoutam,
Edgar Mariano Marroquin,
Aaron Gokaslan,
Arash Firouzbakht,
Sophia Vincoff,
Volodymyr Kuleshov,
Huong T. Kratochvil,
Pranam Chatterjee
Abstract:
Masked Diffusion Language Models (MDLMs) have recently emerged as a strong class of generative models, paralleling state-of-the-art (SOTA) autoregressive (AR) performance across natural language modeling domains. While there have been advances in AR as well as both latent and discrete diffusion-based approaches for protein sequence design, masked diffusion language modeling with protein language m…
▽ More
Masked Diffusion Language Models (MDLMs) have recently emerged as a strong class of generative models, paralleling state-of-the-art (SOTA) autoregressive (AR) performance across natural language modeling domains. While there have been advances in AR as well as both latent and discrete diffusion-based approaches for protein sequence design, masked diffusion language modeling with protein language models (pLMs) is unexplored. In this work, we introduce MeMDLM, an MDLM tailored for membrane protein design, harnessing the SOTA pLM ESM-2 to de novo generate realistic membrane proteins for downstream experimental applications. Our evaluations demonstrate that MeMDLM-generated proteins exceed AR-based methods by generating sequences with greater transmembrane (TM) character. We further apply our design framework to scaffold soluble and TM motifs in sequences, demonstrating that MeMDLM-reconstructed sequences achieve greater biological similarity to their original counterparts compared to SOTA inpainting methods. Finally, we show that MeMDLM captures physicochemical membrane protein properties with similar fidelity as SOTA pLMs, paving the way for experimental applications. In total, our pipeline motivates future exploration of MDLM-based pLMs for protein design.
△ Less
Submitted 22 October, 2024;
originally announced October 2024.
Caduceus: Bi-Directional Equivariant Long-Range DNA Sequence Modeling
Authors:
Yair Schiff,
Chia-Hsiang Kao,
Aaron Gokaslan,
Tri Dao,
Albert Gu,
Volodymyr Kuleshov
Abstract:
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off…
▽ More
Large-scale sequence modeling has sparked rapid advances that now extend into biology and genomics. However, modeling genomic sequences introduces challenges such as the need to model long-range token interactions, the effects of upstream and downstream regions of the genome, and the reverse complementarity (RC) of DNA. Here, we propose an architecture motivated by these challenges that builds off the long-range Mamba block, and extends it to a BiMamba component that supports bi-directionality, and to a MambaDNA block that additionally supports RC equivariance. We use MambaDNA as the basis of Caduceus, the first family of RC equivariant bi-directional long-range DNA language models, and we introduce pre-training and fine-tuning strategies that yield Caduceus DNA foundation models. Caduceus outperforms previous long-range models on downstream benchmarks; on a challenging long-range variant effect prediction task, Caduceus exceeds the performance of 10x larger models that do not leverage bi-directionality or equivariance.
△ Less
Submitted 5 June, 2024; v1 submitted 4 March, 2024;
originally announced March 2024.