Search | arXiv e-print repository

A Comprehensive Benchmark for RNA 3D Structure-Function Modeling

Authors: Luis Wyss, Vincent Mallet, Wissam Karroucha, Karsten Borgwardt, Carlos Oliver

Abstract: The relationship between RNA structure and function has recently attracted interest within the deep learning community, a trend expected to intensify as nucleic acid structure models advance. Despite this momentum, a lack of standardized, accessible benchmarks for applying deep learning to RNA 3D structures hinders progress. To this end, we introduce a collection of seven benchmarking datasets spe… ▽ More The relationship between RNA structure and function has recently attracted interest within the deep learning community, a trend expected to intensify as nucleic acid structure models advance. Despite this momentum, a lack of standardized, accessible benchmarks for applying deep learning to RNA 3D structures hinders progress. To this end, we introduce a collection of seven benchmarking datasets specifically designed to support RNA structure-function prediction. Built on top of the established Python package rnaglib, our library streamlines data distribution and encoding, provides tools for dataset splitting and evaluation, and offers a comprehensive, user-friendly environment for model comparison. The modular and reproducible design of our datasets encourages community contributions and enables rapid customization. To demonstrate the utility of our benchmarks, we report baseline results for all tasks using a relational graph neural network. △ Less

Submitted 20 May, 2025; v1 submitted 27 March, 2025; originally announced March 2025.

arXiv:2402.09330 [pdf, other]

3D-based RNA function prediction tools in rnaglib

Authors: Carlos Oliver, Vincent Mallet, Jérôme Waldispühl

Abstract: Understanding the connection between complex structural features of RNA and biological function is a fundamental challenge in evolutionary studies and in RNA design. However, building datasets of RNA 3D structures and making appropriate modeling choices remains time-consuming and lacks standardization. In this chapter, we describe the use of rnaglib, to train supervised and unsupervised machine le… ▽ More Understanding the connection between complex structural features of RNA and biological function is a fundamental challenge in evolutionary studies and in RNA design. However, building datasets of RNA 3D structures and making appropriate modeling choices remains time-consuming and lacks standardization. In this chapter, we describe the use of rnaglib, to train supervised and unsupervised machine learning-based function prediction models on datasets of RNA 3D structures. △ Less

Submitted 3 May, 2024; v1 submitted 14 February, 2024; originally announced February 2024.

arXiv:2401.14819 [pdf, other]

Endowing Protein Language Models with Structural Knowledge

Authors: Dexiong Chen, Philip Hartout, Paolo Pellizzoni, Carlos Oliver, Karsten Borgwardt

Abstract: Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data a… ▽ More Understanding the relationships between protein sequence, structure and function is a long-standing biological challenge with manifold implications from drug design to our understanding of evolution. Recently, protein language models have emerged as the preferred method for this challenge, thanks to their ability to harness large sequence databases. Yet, their reliance on expansive sequence data and parameter sets limits their flexibility and practicality in real-world scenarios. Concurrently, the recent surge in computationally predicted protein structures unlocks new opportunities in protein representation learning. While promising, the computational burden carried by such complex data still hinders widely-adopted practical applications. To address these limitations, we introduce a novel framework that enhances protein language models by integrating protein structural data. Drawing from recent advances in graph transformers, our approach refines the self-attention mechanisms of pretrained language transformers by integrating structural information with structure extractor modules. This refined model, termed Protein Structure Transformer (PST), is further pretrained on a small protein structure database, using the same masked language modeling objective as traditional protein language models. Empirical evaluations of PST demonstrate its superior parameter efficiency relative to protein language models, despite being pretrained on a dataset comprising only 542K structures. Notably, PST consistently outperforms the state-of-the-art foundation model for protein sequences, ESM-2, setting a new benchmark in protein function prediction. Our findings underscore the potential of integrating structural information into protein language models, paving the way for more effective and efficient protein modeling Code and pretrained models are available at https://github.com/BorgwardtLab/PST. △ Less

Submitted 26 January, 2024; originally announced January 2024.

arXiv:2207.02968 [pdf, other]

Unsupervised Manifold Alignment with Joint Multidimensional Scaling

Authors: Dexiong Chen, Bowen Fan, Carlos Oliver, Karsten Borgwardt

Abstract: We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to si… ▽ More We introduce Joint Multidimensional Scaling, a novel approach for unsupervised manifold alignment, which maps datasets from two different domains, without any known correspondences between data instances across the datasets, to a common low-dimensional Euclidean space. Our approach integrates Multidimensional Scaling (MDS) and Wasserstein Procrustes analysis into a joint optimization problem to simultaneously generate isometric embeddings of data and learn correspondences between instances from two different datasets, while only requiring intra-dataset pairwise dissimilarities as input. This unique characteristic makes our approach applicable to datasets without access to the input features, such as solving the inexact graph matching problem. We propose an alternating optimization scheme to solve the problem that can fully benefit from the optimization techniques for MDS and Wasserstein Procrustes. We demonstrate the effectiveness of our approach in several applications, including joint visualization of two datasets, unsupervised heterogeneous domain adaptation, graph matching, and protein structure alignment. The implementation of our work is available at https://github.com/BorgwardtLab/JointMDS △ Less

Submitted 16 February, 2023; v1 submitted 6 July, 2022; originally announced July 2022.

Comments: ICLR 2023, see https://openreview.net/forum?id=lUpjsrKItz4

arXiv:2206.01008 [pdf, other]

Approximate Network Motif Mining Via Graph Learning

Authors: Carlos Oliver, Dexiong Chen, Vincent Mallet, Pericles Philippopoulos, Karsten Borgwardt

Abstract: Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several… ▽ More Frequent and structurally related subgraphs, also known as network motifs, are valuable features of many graph datasets. However, the high computational complexity of identifying motif sets in arbitrary datasets (motif mining) has limited their use in many real-world datasets. By automatically leveraging statistical properties of datasets, machine learning approaches have shown promise in several tasks with combinatorial complexity and are therefore a promising candidate for network motif mining. In this work we seek to facilitate the development of machine learning approaches aimed at motif mining. We propose a formulation of the motif mining problem as a node labelling task. In addition, we build benchmark datasets and evaluation metrics which test the ability of models to capture different aspects of motif discovery such as motif number, size, topology, and scarcity. Next, we propose MotiFiesta, a first attempt at solving this problem in a fully differentiable manner with promising results on challenging baselines. Finally, we demonstrate through MotiFiesta that this learning setting can be applied simultaneously to general-purpose data mining and interpretable feature extraction for graph classification tasks. △ Less

Submitted 7 June, 2022; v1 submitted 2 June, 2022; originally announced June 2022.

arXiv:2109.09432 [pdf, ps, other]

Edge-similarity-aware Graph Neural Networks

Authors: Vincent Mallet, Carlos G. Oliver, William L. Hamilton

Abstract: Graph are a ubiquitous data representation, as they represent a flexible and compact representation. For instance, the 3D structure of RNA can be efficiently represented as $\textit{2.5D graphs}$, graphs whose nodes are nucleotides and edges represent chemical interactions. In this setting, we have biological evidence of the similarity between the edge types, as some chemical interactions are more… ▽ More Graph are a ubiquitous data representation, as they represent a flexible and compact representation. For instance, the 3D structure of RNA can be efficiently represented as $\textit{2.5D graphs}$, graphs whose nodes are nucleotides and edges represent chemical interactions. In this setting, we have biological evidence of the similarity between the edge types, as some chemical interactions are more similar than others. Machine learning on graphs have recently experienced a breakthrough with the introduction of Graph Neural Networks. This algorithm can be framed as a message passing algorithm between graph nodes over graph edges. These messages can depend on the edge type they are transmitted through, but no method currently constrains how a message is altered when the edge type changes. Motivated by the RNA use case, in this project we introduce a graph neural network layer which can leverage prior information about similarities between edges. We show that despite the theoretical appeal of including this similarity prior, the empirical performance is not enhanced on the tasks and datasets we include here. △ Less

Submitted 20 September, 2021; originally announced September 2021.

arXiv:2107.14028 [pdf, other]

Estimating Respiratory Rate From Breath Audio Obtained Through Wearable Microphones

Authors: Agni Kumar, Vikramjit Mitra, Carolyn Oliver, Adeeti Ullal, Matt Biddulph, Irida Mance

Abstract: Respiratory rate (RR) is a clinical metric used to assess overall health and physical fitness. An individual's RR can change from their baseline due to chronic illness symptoms (e.g., asthma, congestive heart failure), acute illness (e.g., breathlessness due to infection), and over the course of the day due to physical exhaustion during heightened exertion. Remote estimation of RR can offer a cost… ▽ More Respiratory rate (RR) is a clinical metric used to assess overall health and physical fitness. An individual's RR can change from their baseline due to chronic illness symptoms (e.g., asthma, congestive heart failure), acute illness (e.g., breathlessness due to infection), and over the course of the day due to physical exhaustion during heightened exertion. Remote estimation of RR can offer a cost-effective method to track disease progression and cardio-respiratory fitness over time. This work investigates a model-driven approach to estimate RR from short audio segments obtained after physical exertion in healthy adults. Data was collected from 21 individuals using microphone-enabled, near-field headphones before, during, and after strenuous exercise. RR was manually annotated by counting perceived inhalations and exhalations. A multi-task Long-Short Term Memory (LSTM) network with convolutional layers was implemented to process mel-filterbank energies, estimate RR in varying background noise conditions, and predict heavy breathing, indicated by an RR of more than 25 breaths per minute. The multi-task model performs both classification and regression tasks and leverages a mixture of loss functions. It was observed that RR can be estimated with a concordance correlation coefficient (CCC) of 0.76 and a mean squared error (MSE) of 0.2, demonstrating that audio can be a viable signal for approximating RR. △ Less

Submitted 28 July, 2021; originally announced July 2021.

Comments: International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC) 2021

arXiv:2009.00664 [pdf, other]

doi 10.1093/bioinformatics/btab844

VeRNAl: Mining RNA Structures for Fuzzy Base Pairing Network Motifs

Authors: Carlos Oliver, Vincent Mallet, Pericles Philippopoulos, William L. Hamilton, Jerome Waldispuhl

Abstract: RNA 3D motifs are recurrent substructures, modelled as networks of base pair interactions, which are crucial for understanding structure-function relationships. The task of automatically identifying such motifs is computationally hard, and remains a key challenge in the field of RNA structural biology and network analysis. State of the art methods solve special cases of the motif problem by constr… ▽ More RNA 3D motifs are recurrent substructures, modelled as networks of base pair interactions, which are crucial for understanding structure-function relationships. The task of automatically identifying such motifs is computationally hard, and remains a key challenge in the field of RNA structural biology and network analysis. State of the art methods solve special cases of the motif problem by constraining the structural variability in occurrences of a motif, and narrowing the substructure search space. Here, we relax these constraints by posing the motif finding problem as a graph representation learning and clustering task. This framing takes advantage of the continuous nature of graph representations to model the flexibility and variability of RNA motifs in an efficient manner. We propose a set of node similarity functions, clustering methods, and motif construction algorithms to recover flexible RNA motifs. Our tool, VeRNAl can be easily customized by users to desired levels of motif flexibility, abundance and size. We show that VeRNAl is able to retrieve and expand known classes of motifs, as well as to propose novel motifs. △ Less

Submitted 18 October, 2021; v1 submitted 1 September, 2020; originally announced September 2020.

arXiv:1911.00435 [pdf, other]

doi 10.5195/ledger.2020.194

Difficulty Scaling in Proof of Work for Decentralized Problem Solving

Authors: Pericles Philippopoulos, Alessandro Ricottone, Carlos G. Oliver

Abstract: We propose DIPS Difficulty-based Incentives for Problem Solving), a simple modification of the Bitcoin proof-of-work algorithm that rewards blockchain miners for solving optimization problems of scientific interest. The result is a blockchain which redirects some of the computational resources invested in hash-based mining towards scientific computation, effectively reducing the amount of energy `… ▽ More We propose DIPS Difficulty-based Incentives for Problem Solving), a simple modification of the Bitcoin proof-of-work algorithm that rewards blockchain miners for solving optimization problems of scientific interest. The result is a blockchain which redirects some of the computational resources invested in hash-based mining towards scientific computation, effectively reducing the amount of energy `wasted' on mining. DIPS builds the solving incentive directly in the proof-of-work by providing a reduction in block hashing difficulty when optimization improvements are found. A key advantage of this scheme is that decentralization is preserved and no additional protocol layers are required on top of the standard blockchain. We study two incentivization schemes and provide simulation results showing that DIPS is able to reduce the amount of hash-power used in the network while generating solutions to optimization problems. △ Less

Submitted 1 November, 2019; originally announced November 2019.

arXiv:1905.12033 [pdf, other]

Leveraging binding-site structure for drug discovery with point-cloud methods

Authors: Vincent Mallet, Carlos G. Oliver, Nicolas Moitessier, Jerome Waldispuhl

Abstract: Computational drug discovery strategies can be broadly placed in two categories: ligand-based methods which identify novel molecules by similarity with known ligands, and structure-based methods which predict molecules with high-affinity to a given 3D structure (e.g. a protein). However, ligand-based methods do not leverage information about the binding site, and structure-based approaches rely on… ▽ More Computational drug discovery strategies can be broadly placed in two categories: ligand-based methods which identify novel molecules by similarity with known ligands, and structure-based methods which predict molecules with high-affinity to a given 3D structure (e.g. a protein). However, ligand-based methods do not leverage information about the binding site, and structure-based approaches rely on the knowledge of a finite set of ligands binding the target. In this work, we introduce TarLig, a novel approach that aims to bridge the gap between ligand and structure-based approaches. We use the 3D structure of the binding site as input to a model which predicts the ligand preferences of the binding site. The resulting predictions could then offer promising seeds and constraints in the chemical space search, based on the binding site structure. TarLig outperforms standard models by introducing a data-alignment and augmentation technique. The recent popularity of Volumetric 3DCNN pipelines in structural bioinformatics suggests that this extra step could help a wide range of methods to improve their results with minimal modifications. △ Less

Submitted 28 May, 2019; originally announced May 2019.

arXiv:1708.09419 [pdf, other]

Proposal for a fully decentralized blockchain and proof-of-work algorithm for solving NP-complete problems

Authors: Carlos G. Oliver, Alessandro Ricottone, Pericles Philippopoulos

Abstract: We propose a proof-of-work algorithm that rewards blockchain miners for using computational resources to solve NP-complete puzzles. The resulting blockchain will publicly store and improve solutions to problems with real world applications while maintaining a secure and fully functional transaction ledger. We propose a proof-of-work algorithm that rewards blockchain miners for using computational resources to solve NP-complete puzzles. The resulting blockchain will publicly store and improve solutions to problems with real world applications while maintaining a secure and fully functional transaction ledger. △ Less

Submitted 2 September, 2017; v1 submitted 30 August, 2017; originally announced August 2017.

Showing 1–11 of 11 results for author: Oliver, C