Pre-trained Molecular Language Models with Random Functional Group Masking
Authors:
Tianhao Peng,
Yuchen Li,
Xuhong Li,
Jiang Bian,
Zeke Xie,
Ning Sui,
Shahid Mumtaz,
Yanwu Xu,
Linghe Kong,
Haoyi Xiong
Abstract:
Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, resear…
▽ More
Recent advancements in computational chemistry have leveraged the power of trans-former-based language models, such as MoLFormer, pre-trained using a vast amount of simplified molecular-input line-entry system (SMILES) sequences, to understand and predict molecular properties and activities, a critical step in fields like drug discovery and materials science. To further improve performance, researchers have introduced graph neural networks with graph-based molecular representations, such as GEM, incorporating the topology, geometry, 2D or even 3D structures of molecules into pre-training. While most of molecular graphs in existing studies were automatically converted from SMILES sequences, it is to assume that transformer-based language models might be able to implicitly learn structure-aware representations from SMILES sequences. In this paper, we propose \ours{} -- a SMILES-based \underline{\em M}olecular \underline{\em L}anguage \underline{\em M}odel, which randomly masking SMILES subsequences corresponding to specific molecular \underline{\em F}unctional \underline{\em G}roups to incorporate structure information of atoms during the pre-training phase. This technique aims to compel the model to better infer molecular structures and properties, thus enhancing its predictive capabilities. Extensive experimental evaluations across 11 benchmark classification and regression tasks in the chemical domain demonstrate the robustness and superiority of \ours{}. Our findings reveal that \ours{} outperforms existing pre-training models, either based on SMILES or graphs, in 9 out of the 11 downstream tasks, ranking as a close second in the remaining ones.
△ Less
Submitted 2 November, 2024;
originally announced November 2024.
Nonlinear Propagation in Multimode and Multicore Fibers: Generalization of the Manakov Equations
Authors:
Sami Mumtaz,
René-Jean Essiambre,
Govind P. Agrawal
Abstract:
This paper starts by an investigation of nonlinear transmission in space-division multiplexed (SDM) systems using multimode fibers exhibiting a rapidly varying birefringence. A primary objective is to generalize the Manakov equations, well known in the case of single-mode fibers. We first investigate a reference case where linear coupling among the spatial modes of the fiber is weak and after aver…
▽ More
This paper starts by an investigation of nonlinear transmission in space-division multiplexed (SDM) systems using multimode fibers exhibiting a rapidly varying birefringence. A primary objective is to generalize the Manakov equations, well known in the case of single-mode fibers. We first investigate a reference case where linear coupling among the spatial modes of the fiber is weak and after averaging over birefringence fluctuations, we obtain new Manakov equations for multimode fibers. Such an averaging reduces the number of intermodal nonlinear terms drastically since all four-wave-mixing terms average out. Cross-phase modulation terms still affect multimode transmission but their effectiveness is reduced. We then verify the accuracy of our new Manakov equations by transmitting multiple PDM-QPSK signals over different modes of a multimode fiber and comparing the numerical results with those obtained by solving the full stochastic equation. The agreement is excellent in all cases studied. A great benefit of the new equations is to reduce the computation time by a factor of 10 or more. Another important feature observed is that birefringence fluctuations improve system performance by reducing the impact of fiber nonlinearities. Finally multimode fibers with strong random coupling among all spatial modes are considered. Linear coupling is modeled using the random matrix theory approach. We derive new Manakov equations for multimode fibers in that regime and show that such fibers can perform better than single-modes fiber for large number of propagating spatial modes.
△ Less
Submitted 27 July, 2012;
originally announced July 2012.