Search | arXiv e-print repository

OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking

Authors: Heng Yang, Jack Cole, Yuan Li, Renzhi Chen, Geyong Min, Ke Li

Abstract: The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation… ▽ More The code of nature, embedded in DNA and RNA genomes since the origin of life, holds immense potential to impact both humans and ecosystems through genome modeling. Genomic Foundation Models (GFMs) have emerged as a transformative approach to decoding the genome. As GFMs scale up and reshape the landscape of AI-driven genomics, the field faces an urgent need for rigorous and reproducible evaluation. We present OmniGenBench, a modular benchmarking platform designed to unify the data, model, benchmarking, and interpretability layers across GFMs. OmniGenBench enables standardized, one-command evaluation of any GFM across five benchmark suites, with seamless integration of over 31 open-source models. Through automated pipelines and community-extensible features, the platform addresses critical reproducibility challenges, including data transparency, model interoperability, benchmark fragmentation, and black-box interpretability. OmniGenBench aims to serve as foundational infrastructure for reproducible genomic AI research, accelerating trustworthy discovery and collaborative innovation in the era of genome-scale modeling. △ Less

Submitted 20 May, 2025; originally announced May 2025.

arXiv:2410.01784 [pdf, other]

OmniGenBench: Automating Large-scale in-silico Benchmarking for Genomic Foundation Models

Authors: Heng Yang, Jack Cole, Ke Li

Abstract: The advancements in artificial intelligence in recent years, such as Large Language Models (LLMs), have fueled expectations for breakthroughs in genomic foundation models (GFMs). The code of nature, hidden in diverse genomes since the very beginning of life's evolution, holds immense potential for impacting humans and ecosystems through genome modeling. Recent breakthroughs in GFMs, such as Evo, h… ▽ More The advancements in artificial intelligence in recent years, such as Large Language Models (LLMs), have fueled expectations for breakthroughs in genomic foundation models (GFMs). The code of nature, hidden in diverse genomes since the very beginning of life's evolution, holds immense potential for impacting humans and ecosystems through genome modeling. Recent breakthroughs in GFMs, such as Evo, have attracted significant investment and attention to genomic modeling, as they address long-standing challenges and transform in-silico genomic studies into automated, reliable, and efficient paradigms. In the context of this flourishing era of consecutive technological revolutions in genomics, GFM studies face two major challenges: the lack of GFM benchmarking tools and the absence of open-source software for diverse genomics. These challenges hinder the rapid evolution of GFMs and their wide application in tasks such as understanding and synthesizing genomes, problems that have persisted for decades. To address these challenges, we introduce GFMBench, a framework dedicated to GFM-oriented benchmarking. GFMBench standardizes benchmark suites and automates benchmarking for a wide range of open-source GFMs. It integrates millions of genomic sequences across hundreds of genomic tasks from four large-scale benchmarks, democratizing GFMs for a wide range of in-silico genomic applications. Additionally, GFMBench is released as open-source software, offering user-friendly interfaces and diverse tutorials, applicable for AutoBench and complex tasks like RNA design and structure prediction. To facilitate further advancements in genome modeling, we have launched a public leaderboard showcasing the benchmark performance derived from AutoBench. GFMBench represents a step toward standardizing GFM benchmarking and democratizing GFM applications. △ Less

Submitted 2 October, 2024; originally announced October 2024.

Comments: https://github.com/yangheng95/OmniGenomeBench

arXiv:2407.04424 [pdf]

Benchmarking structure-based three-dimensional molecular generative models using GenBench3D: ligand conformation quality matters

Authors: Benoit Baillif, Jason Cole, Patrick McCabe, Andreas Bender

Abstract: Three-dimensional (3D) deep molecular generative models offer the advantage of goal-directed generation based on 3D-dependent properties, such as binding affinity for structure-based design within binding pockets. Traditional benchmarks created to evaluate SMILES or molecular graphs generators, such as GuacaMol or MOSES, are limited to evaluate 3D generators as they do not assess the quality of th… ▽ More Three-dimensional (3D) deep molecular generative models offer the advantage of goal-directed generation based on 3D-dependent properties, such as binding affinity for structure-based design within binding pockets. Traditional benchmarks created to evaluate SMILES or molecular graphs generators, such as GuacaMol or MOSES, are limited to evaluate 3D generators as they do not assess the quality of the generated molecular conformation. In this work, we hence developed GenBench3D, which implements a new benchmark for models producing molecules within a binding pocket. Our main contribution is the Validity3D metric, evaluating the conformation quality using the likelihood of bond lengths and valence angles based on reference values observed in the Cambridge Structural Database. The LiGAN, 3D-SBDD, Pocket2Mol, TargetDiff, DiffSBDD and ResGen models were benchmarked. We show that only between 0% and 11% of generated molecules have valid conformations. Performing local relaxation of generated molecules in the pocket considerably improved the Validity3D for all models by a minimum increase of 40%. For LiGAN, 3D-SBDD, or TargetDiff, the set of valid relaxed molecules shows on average higher Vina score (i.e. worse) than the set of raw generated molecules, indicating that the binding affinity of raw generated molecules might be overestimated. Using the other scoring functions, that give higher importance to ligand strain, only yield improved scores when using valid relaxed molecules. Using valid relaxed molecules, TargetDiff and Pocket2Mol show better median Vina, Glide and Gold PLP scores than other models. We have publicly released GenBench3D on GitHub for broader use: https://github.com/bbaillif/genbench3d △ Less

Submitted 5 July, 2024; originally announced July 2024.

arXiv:2307.07072 [pdf, other]

Rician likelihood loss for quantitative MRI using self-supervised deep learning

Authors: Christopher S. Parker, Anna Schroder, Sean C. Epstein, James Cole, Daniel C. Alexander, Hui Zhang

Abstract: Purpose: Previous quantitative MR imaging studies using self-supervised deep learning have reported biased parameter estimates at low SNR. Such systematic errors arise from the choice of Mean Squared Error (MSE) loss function for network training, which is incompatible with Rician-distributed MR magnitude signals. To address this issue, we introduce the negative log Rician likelihood (NLR) loss. M… ▽ More Purpose: Previous quantitative MR imaging studies using self-supervised deep learning have reported biased parameter estimates at low SNR. Such systematic errors arise from the choice of Mean Squared Error (MSE) loss function for network training, which is incompatible with Rician-distributed MR magnitude signals. To address this issue, we introduce the negative log Rician likelihood (NLR) loss. Methods: A numerically stable and accurate implementation of the NLR loss was developed to estimate quantitative parameters of the apparent diffusion coefficient (ADC) model and intra-voxel incoherent motion (IVIM) model. Parameter estimation accuracy, precision and overall error were evaluated in terms of bias, variance and root mean squared error and compared against the MSE loss over a range of SNRs (5 - 30). Results: Networks trained with NLR loss show higher estimation accuracy than MSE for the ADC and IVIM diffusion coefficients as SNR decreases, with minimal loss of precision or total error. At high effective SNR (high SNR and small diffusion coefficients), both losses show comparable accuracy and precision for all parameters of both models. Conclusion: The proposed NLR loss is numerically stable and accurate across the full range of tested SNRs and improves parameter estimation accuracy of diffusion coefficients using self-supervised deep learning. We expect the development to benefit quantitative MR imaging techniques broadly, enabling more accurate parameter estimation from noisy data. △ Less

Submitted 13 July, 2023; originally announced July 2023.

Comments: 16 pages, 6 figures

arXiv:2301.04424 [pdf, other]

Riemannian Geometry and Molecular Similarity II: Kähler Quantization

Authors: Daniel J. Cole, Stuart J. Hall, Thomas Murphy, Rachael Pirie

Abstract: Shape-similarity between molecules is a tool used by chemists for virtual screening, with the goal of reducing the cost and duration of drug discovery campaigns. This paper reports an entirely novel shape descriptor as an alternative to the previously described RGMolSA descriptors \cite{cole2022riemannian}, derived from the theory of Riemannian geometry and Kähler quantization (KQMolSA). The treat… ▽ More Shape-similarity between molecules is a tool used by chemists for virtual screening, with the goal of reducing the cost and duration of drug discovery campaigns. This paper reports an entirely novel shape descriptor as an alternative to the previously described RGMolSA descriptors \cite{cole2022riemannian}, derived from the theory of Riemannian geometry and Kähler quantization (KQMolSA). The treatment of a molecule as a series of intersecting spheres allows us to obtain the explicit \textit{Riemannian metric} which captures the geometry of the surface, which can in turn be used to calculate a Hermitian matrix $\mathbb{M}$ as a directly comparable surface representation. The potential utility of this method is demonstrated using a series of PDE5 inhibitors considered to have similar shape. The method shows promise in its capability to handle different conformers, and compares well to existing shape similarity methods. The code and data used to produce the results are available at: \url{https://github.com/RPirie96/KQMolSA}. △ Less

Submitted 11 January, 2023; originally announced January 2023.

Comments: 18 pages

MSC Class: 53Z15

arXiv:2201.04230 [pdf, other]

Riemannian Geometry and Molecular Surfaces I: Spectrum of the Laplacian

Authors: Daniel J. Cole, Stuart J. Hall, Rachael Pirie

Abstract: Ligand-based virtual screening aims to reduce the cost and duration of drug discovery campaigns. Shape similarity can be used to screen large databases, with the goal of predicting potential new hits by comparing to molecules with known favourable properties. This paper presents the theory underpinning RGMolSA, a new alignment-free and mesh-free surface-based molecular shape descriptor derived fro… ▽ More Ligand-based virtual screening aims to reduce the cost and duration of drug discovery campaigns. Shape similarity can be used to screen large databases, with the goal of predicting potential new hits by comparing to molecules with known favourable properties. This paper presents the theory underpinning RGMolSA, a new alignment-free and mesh-free surface-based molecular shape descriptor derived from the mathematical theory of Riemannian geometry. The treatment of a molecule as a series of intersecting spheres allows the description of its surface geometry using the Riemannian metric, obtained by considering the spectrum of the Laplacian. This gives a simple vector descriptor constructed of the weighted surface area and eight non-zero eigenvalues, which capture the surface shape. We demonstrate the potential of our method by considering a series of PDE5 inhibitors that are known to have similar shape as an initial test case. RGMolSA displays promise when compared to existing shape descriptors and in its capability to handle different molecular conformers. The code and data used to produce the results are available via GitHub: https://github.com/RPirie96/RGMolSA. △ Less

Submitted 10 January, 2022; originally announced January 2022.

Comments: 21 pages, 10 figures

arXiv:2107.07977 [pdf, other]

An Uncertainty-Aware, Shareable and Transparent Neural Network Architecture for Brain-Age Modeling

Authors: Tim Hahn, Jan Ernsting, Nils R. Winter, Vincent Holstein, Ramona Leenings, Marie Beisemann, Lukas Fisch, Kelvin Sarink, Daniel Emden, Nils Opel, Ronny Redlich, Jonathan Repple, Dominik Grotegerd, Susanne Meinert, Jochen G. Hirsch, Thoralf Niendorf, Beate Endemann, Fabian Bamberg, Thomas Kröncke, Robin Bülow, Henry Völzke, Oyunbileg von Stackelberg, Ramona Felizitas Sowade, Lale Umutlu, Börge Schmidt , et al. (9 additional authors not shown)

Abstract: The deviation between chronological age and age predicted from neuroimaging data has been identified as a sensitive risk-marker of cross-disorder brain changes, growing into a cornerstone of biological age-research. However, Machine Learning models underlying the field do not consider uncertainty, thereby confounding results with training data density and variability. Also, existing models are com… ▽ More The deviation between chronological age and age predicted from neuroimaging data has been identified as a sensitive risk-marker of cross-disorder brain changes, growing into a cornerstone of biological age-research. However, Machine Learning models underlying the field do not consider uncertainty, thereby confounding results with training data density and variability. Also, existing models are commonly based on homogeneous training sets, often not independently validated, and cannot be shared due to data protection issues. Here, we introduce an uncertainty-aware, shareable, and transparent Monte-Carlo Dropout Composite-Quantile-Regression (MCCQR) Neural Network trained on N=10,691 datasets from the German National Cohort. The MCCQR model provides robust, distribution-free uncertainty quantification in high-dimensional neuroimaging data, achieving lower error rates compared to existing models across ten recruitment centers and in three independent validation samples (N=4,004). In two examples, we demonstrate that it prevents spurious associations and increases power to detect accelerated brain-aging. We make the pre-trained model publicly available. △ Less

Submitted 16 July, 2021; originally announced July 2021.

arXiv:2007.08374 [pdf]

doi 10.1016/j.mayocp.2020.01.027

Brain volume: An important determinant of functional outcome after acute ischemic stroke

Authors: Markus D. Schirmer, Kathleen L. Donahue, Marco J. Nardin, Adrian V. Dalca, Anne-Katrin Giese, Mark R. Etherton, Steven J. T. Mocking, Elissa C. McIntosh, John W. Cole, Lukas Holmegaard, Katarina Jood, Jordi Jimenez-Conde, Steven J. Kittner, Robin Lemmens, James F. Meschia, Jonathan Rosand, Jaume Roquer, Tatjana Rundek, Ralph L. Sacco MD, Reinhold Schmidt, Pankaj Sharma, Agnieszka Slowik, Tara M. Stanne, Achala Vagal, Johan Wasselius , et al. (16 additional authors not shown)

Abstract: Objective: To determine whether brain volume is associated with functional outcome after acute ischemic stroke (AIS). Methods: We analyzed cross-sectional data of the multi-site, international hospital-based MRI-GENetics Interface Exploration (MRI-GENIE) study (July 1, 2014- March 16, 2019) with clinical brain magnetic resonance imaging (MRI) obtained on admission for index stroke and functional… ▽ More Objective: To determine whether brain volume is associated with functional outcome after acute ischemic stroke (AIS). Methods: We analyzed cross-sectional data of the multi-site, international hospital-based MRI-GENetics Interface Exploration (MRI-GENIE) study (July 1, 2014- March 16, 2019) with clinical brain magnetic resonance imaging (MRI) obtained on admission for index stroke and functional outcome assessment. Post-stroke outcome was determined using the modified Rankin Scale (mRS) score (0-6; 0: asymptomatic; 6 death) recorded between 60-190 days after stroke. Demographics and other clinical variables including acute stroke severity (measured as National Institutes of Health Stroke Scale score), vascular risk factors, and etiologic stroke subtypes (Causative Classification of Stroke) were recorded during index admission. Results: Utilizing the data from 912 acute ischemic stroke (AIS) patients (65+/-15 years of age, 58% male, 57% history of smoking, and 65% hypertensive) in a generalized linear model, brain volume (per 155.1cm^3 ) was associated with age (beta -0.3 (per 14.4 years)), male sex (beta 1.0) and prior stroke (beta -0.2). In the multivariable outcome model, brain volume was an independent predictor of mRS (beta -0.233), with reduced odds of worse long-term functional outcomes (OR: 0.8, 95% CI 0.7-0.9) in those with larger brain volumes. Conclusions: Larger brain volume quantified on clinical MRI of AIS patients at time of stroke purports a protective mechanism. The role of brain volume as a prognostic, protective biomarker has the potential to forge new areas of research and advance current knowledge of mechanisms of post-stroke recovery. △ Less

Submitted 16 July, 2020; originally announced July 2020.

Journal ref: Mayo Clinic Proceedings. Vol. 95. No. 5. Elsevier, 2020

arXiv:2002.03419 [pdf, other]

The Alzheimer's Disease Prediction Of Longitudinal Evolution (TADPOLE) Challenge: Results after 1 Year Follow-up

Authors: Razvan V. Marinescu, Neil P. Oxtoby, Alexandra L. Young, Esther E. Bron, Arthur W. Toga, Michael W. Weiner, Frederik Barkhof, Nick C. Fox, Arman Eshaghi, Tina Toni, Marcin Salaterski, Veronika Lunina, Manon Ansart, Stanley Durrleman, Pascal Lu, Samuel Iddi, Dan Li, Wesley K. Thompson, Michael C. Donohue, Aviv Nahon, Yarden Levy, Dan Halbersberg, Mariya Cohen, Huiling Liao, Tengfei Li , et al. (71 additional authors not shown)

Abstract: We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcome… ▽ More We present the findings of "The Alzheimer's Disease Prediction Of Longitudinal Evolution" (TADPOLE) Challenge, which compared the performance of 92 algorithms from 33 international teams at predicting the future trajectory of 219 individuals at risk of Alzheimer's disease. Challenge participants were required to make a prediction, for each month of a 5-year future time period, of three key outcomes: clinical diagnosis, Alzheimer's Disease Assessment Scale Cognitive Subdomain (ADAS-Cog13), and total volume of the ventricles. The methods used by challenge participants included multivariate linear regression, machine learning methods such as support vector machines and deep neural networks, as well as disease progression models. No single submission was best at predicting all three outcomes. For clinical diagnosis and ventricle volume prediction, the best algorithms strongly outperform simple baselines in predictive ability. However, for ADAS-Cog13 no single submitted prediction method was significantly better than random guesswork. Two ensemble methods based on taking the mean and median over all predictions, obtained top scores on almost all tasks. Better than average performance at diagnosis prediction was generally associated with the additional inclusion of features from cerebrospinal fluid (CSF) samples and diffusion tensor imaging (DTI). On the other hand, better performance at ventricle volume prediction was associated with inclusion of summary statistics, such as the slope or maxima/minima of biomarkers. TADPOLE's unique results suggest that current prediction algorithms provide sufficient accuracy to exploit biomarkers related to clinical diagnosis and ventricle volume, for cohort refinement in clinical trials for Alzheimer's disease. However, results call into question the usage of cognitive test scores for patient selection and as a primary endpoint in clinical trials. △ Less

Submitted 27 December, 2021; v1 submitted 9 February, 2020; originally announced February 2020.

Comments: Presents final results of the TADPOLE competition. 60 pages, 7 tables, 14 figures

Journal ref: Machine Learning for Biomedical Imaging (MELBA), Dec 2021

arXiv:1910.03349 [pdf, other]

Analysis of an Automated Machine Learning Approach in Brain Predictive Modelling: A data-driven approach to Predict Brain Age from Cortical Anatomical Measures

Authors: Jessica Dafflon, Walter H. L Pinaya, Federico Turkheimer, James H. Cole, Robert Leech, Mathew A. Harris, Simon R. Cox, Heather C. Whalley, Andrew M. McIntosh, Peter J. Hellyer

Abstract: The use of machine learning (ML) algorithms has significantly increased in neuroscience. However, from the vast extent of possible ML algorithms, which one is the optimal model to predict the target variable? What are the hyperparameters for such a model? Given the plethora of possible answers to these questions, in the last years, automated machine learning (autoML) has been gaining attention. He… ▽ More The use of machine learning (ML) algorithms has significantly increased in neuroscience. However, from the vast extent of possible ML algorithms, which one is the optimal model to predict the target variable? What are the hyperparameters for such a model? Given the plethora of possible answers to these questions, in the last years, automated machine learning (autoML) has been gaining attention. Here, we apply an autoML library called TPOT which uses a tree-based representation of machine learning pipelines and conducts a genetic-programming based approach to find the model and its hyperparameters that more closely predicts the subject's true age. To explore autoML and evaluate its efficacy within neuroimaging datasets, we chose a problem that has been the focus of previous extensive study: brain age prediction. Without any prior knowledge, TPOT was able to scan through the model space and create pipelines that outperformed the state-of-the-art accuracy for Freesurfer-based models using only thickness and volume information for anatomical structure. In particular, we compared the performance of TPOT (mean accuracy error (MAE): $4.612 \pm .124$ years) and a Relevance Vector Regression (MAE $5.474 \pm .140$ years). TPOT also suggested interesting combinations of models that do not match the current most used models for brain prediction but generalise well to unseen data. AutoML showed promising results as a data-driven approach to find optimal models for neuroimaging applications. △ Less

Submitted 8 October, 2019; originally announced October 2019.

arXiv:1612.02572 [pdf]

Predicting brain age with deep learning from raw imaging data results in a reliable and heritable biomarker

Authors: James H Cole, Rudra PK Poudel, Dimosthenis Tsagkrasoulis, Matthan WA Caan, Claire Steves, Tim D Spector, Giovanni Montana

Abstract: Machine learning analysis of neuroimaging data can accurately predict chronological age in healthy people and deviations from healthy brain ageing have been associated with cognitive impairment and disease. Here we sought to further establish the credentials of "brain-predicted age" as a biomarker of individual differences in the brain ageing process, using a predictive modelling approach based on… ▽ More Machine learning analysis of neuroimaging data can accurately predict chronological age in healthy people and deviations from healthy brain ageing have been associated with cognitive impairment and disease. Here we sought to further establish the credentials of "brain-predicted age" as a biomarker of individual differences in the brain ageing process, using a predictive modelling approach based on deep learning, and specifically convolutional neural networks (CNN), and applied to both pre-processed and raw T1-weighted MRI data. Firstly, we aimed to demonstrate the accuracy of CNN brain-predicted age using a large dataset of healthy adults (N = 2001). Next, we sought to establish the heritability of brain-predicted age using a sample of monozygotic and dizygotic female twins (N = 62). Thirdly, we examined the test-retest and multi-centre reliability of brain-predicted age using two samples (within-scanner N = 20; between-scanner N = 11). CNN brain-predicted ages were generated and compared to a Gaussian Process Regression (GPR) approach, on all datasets. Input data were grey matter (GM) or white matter (WM) volumetric maps generated by Statistical Parametric Mapping (SPM) or raw data. Brain-predicted age represents an accurate, highly reliable and genetically-valid phenotype, that has potential to be used as a biomarker of brain ageing. Moreover, age predictions can be accurately generated on raw T1-MRI data, substantially reducing computation time for novel data, bringing the process closer to giving real-time information on brain health in clinical settings. △ Less

Submitted 8 December, 2016; originally announced December 2016.

arXiv:1407.2181 [pdf, ps, other]

doi 10.1007/s11120-014-0027-3

Constrained geometric dynamics of the Fenna-Matthews-Olson complex: The role of correlated motion in reducing uncertainty in excitation energy transfer

Authors: Alexander S. Fokas, Daniel J. Cole, Alex W. Chin

Abstract: The Fenna Mathews Olson (FMO) complex of green sulphur bacteria is an example of a photosynthetic pigment protein complex, in which the electronic properties of the pigments are modified by the protein environment to promote efficient excitonic energy transfer from antenna complexes to the reaction centres. Many of the electronic properties of the FMO complex can be extracted from knowledge of the… ▽ More The Fenna Mathews Olson (FMO) complex of green sulphur bacteria is an example of a photosynthetic pigment protein complex, in which the electronic properties of the pigments are modified by the protein environment to promote efficient excitonic energy transfer from antenna complexes to the reaction centres. Many of the electronic properties of the FMO complex can be extracted from knowledge of the static crystal structure. However, the recent observation and analysis of long lasting quantum dynamics in the FMO complex point to protein dynamics as a key factor in protecting and generating quantum coherence under laboratory conditions. While fast inter and intra molecular vibrations have been investigated extensively, the slow dynamics which effectively determine the optical inhomogeneous broadening of experimental ensembles has received less attention. Our study employs constrained geometric dynamics to study the flexibility in the protein network by efficiently generating the accessible conformational states from the published crystal structure. Statistical and principle component analysis reveal highly correlated low frequency motions between functionally relevant elements, including strong correlations between pigments that are excitonically coupled. Our analysis reveals a hierarchy of structural interactions which enforce these correlated motions, from the level of monomer monomer interfaces right down to the alpha helices, beta sheets and pigments. In addition to inducing strong spatial correlations across the conformational ensemble, we find that the overall rigidity of the FMO complex is exceptionally high. We suggest that these observations support the idea of highly correlated inhomogeneous disorder of the electronic excited states, which is further supported by the remarkably low variance of the excitonic couplings of the conformational ensemble. △ Less

Submitted 8 July, 2014; originally announced July 2014.

Journal ref: Fokas, A.S. and Cole, D.J. and Chin A.W. Photosynth. Res., 2014, Online

arXiv:1305.5532 [pdf, ps, other]

doi 10.1021/jz3004188

Ligand Discrimination in Myoglobin from Linear-Scaling DFT+U

Authors: Daniel J. Cole, David D. O'Regan, Mike C. Payne

Abstract: Myoglobin modulates the binding of diatomic molecules to its heme group via hydrogen-bonding and steric interactions with neighboring residues, and is an important benchmark for computational studies of biomolecules. We have performed calculations on the heme binding site and a significant proportion of the protein environment (more than 1000 atoms) using linear-scaling density functional theory a… ▽ More Myoglobin modulates the binding of diatomic molecules to its heme group via hydrogen-bonding and steric interactions with neighboring residues, and is an important benchmark for computational studies of biomolecules. We have performed calculations on the heme binding site and a significant proportion of the protein environment (more than 1000 atoms) using linear-scaling density functional theory and the DFT+U method to correct for self-interaction errors associated with localized 3d states. We confirm both the hydrogen-bonding nature of the discrimination effect (3.6 kcal/mol) and assumptions that the relative strain energy stored in the protein is low (less than 1 kcal/mol). Our calculations significantly widen the scope for tackling problems in drug design and enzymology, especially in cases where electron localization, allostery or long-ranged polarization influence ligand binding and reaction. △ Less

Submitted 23 May, 2013; originally announced May 2013.

Comments: 15 pages, 3 figures. Supplementary material 8 pages, 3 figures. This version matches that accepted for J. Phys. Chem. Lett. on 10th May 2012

Journal ref: J. Phys. Chem. Lett., 2012, 3 (11), 1448-1452

arXiv:1302.4696 [pdf, other]

doi 10.1088/0953-8984/25/15/152101

Electrostatic considerations affecting the calculated HOMO-LUMO gap in protein molecules

Authors: Greg Lever, Daniel J Cole, Nicholas D M Hine, Peter D Haynes, Mike C Payne

Abstract: A detailed study of energy differences between the highest occupied and lowest unoccupied molecular orbitals (HOMO-LUMO gaps) in protein systems and water clusters is presented. Recent work questioning the applicability of Kohn-Sham density-functional theory to proteins and large water clusters (E. Rudberg, J. Phys.: Condens. Mat. 2012, 24, 072202) has demonstrated vanishing HOMO-LUMO gaps for the… ▽ More A detailed study of energy differences between the highest occupied and lowest unoccupied molecular orbitals (HOMO-LUMO gaps) in protein systems and water clusters is presented. Recent work questioning the applicability of Kohn-Sham density-functional theory to proteins and large water clusters (E. Rudberg, J. Phys.: Condens. Mat. 2012, 24, 072202) has demonstrated vanishing HOMO-LUMO gaps for these systems, which is generally attributed to the treatment of exchange in the functional used. The present work shows that the vanishing gap is, in fact, an electrostatic artefact of the method used to prepare the system. Practical solutions for ensuring the gap is maintained when the system size is increased are demonstrated. This work has important implications for the use of large-scale density-functional theory in biomolecular systems, particularly in the simulation of photoemission, optical absorption and electronic transport, all of which depend critically on differences between energies of molecular orbitals. △ Less

Submitted 19 February, 2013; originally announced February 2013.

Comments: 13 pages, 4 figures

Showing 1–14 of 14 results for author: Cole, J