-
An ELIXIR scoping review on domain-specific evaluation metrics for synthetic data in life sciences
Authors:
Styliani-Christina Fragkouli,
Somya Iqbal,
Lisa Crossman,
Barbara Gravel,
Nagat Masued,
Mark Onders,
Devesh Haseja,
Alex Stikkelman,
Alfonso Valencia,
Tom Lenaerts,
Fotis Psomopoulos,
Pilib Ó Broin,
Núria Queralt-Rosinach,
Davide Cirillo
Abstract:
Synthetic data has emerged as a powerful resource in life sciences, offering solutions for data scarcity, privacy protection and accessibility constraints. By creating artificial datasets that mirror the characteristics of real data, allows researchers to develop and validate computational methods in controlled environments. Despite its promise, the adoption of synthetic data in Life Sciences hing…
▽ More
Synthetic data has emerged as a powerful resource in life sciences, offering solutions for data scarcity, privacy protection and accessibility constraints. By creating artificial datasets that mirror the characteristics of real data, allows researchers to develop and validate computational methods in controlled environments. Despite its promise, the adoption of synthetic data in Life Sciences hinges on rigorous evaluation metrics designed to assess their fidelity and reliability. To explore the current landscape of synthetic data evaluation metrics in several Life Sciences domains, the ELIXIR Machine Learning Focus Group performed a systematic review of the scientific literature following the PRISMA guidelines. Six critical domains were examined to identify current practices for assessing synthetic data. Findings reveal that, while generation methods are rapidly evolving, systematic evaluation is often overlooked, limiting researchers ability to compare, validate, and trust synthetic datasets across different domains. This systematic review underscores the urgent need for robust, standardized evaluation approaches that not only bolster confidence in synthetic data but also guide its effective and responsible implementation. By laying the groundwork for establishing domain-specific yet interoperable standards, this scoping review paves the way for future initiatives aimed at enhancing the role of synthetic data in scientific discovery, clinical practice and beyond.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Persistence and extinction dynamics in a stochastic predator-prey model with emergent Allee effects
Authors:
Carlos Granados,
Leon A. Valencia
Abstract:
The Allee effect describes a decline in population fitness at low densities, potentially leading to extinction. In predator-prey systems, an emergent Allee effect can arise due to interactions such as density-dependent maturation rates and predation constraints. This work studies a stochastic predator-prey model where the prey population is structured into juvenile and adult stages, with maturatio…
▽ More
The Allee effect describes a decline in population fitness at low densities, potentially leading to extinction. In predator-prey systems, an emergent Allee effect can arise due to interactions such as density-dependent maturation rates and predation constraints. This work studies a stochastic predator-prey model where the prey population is structured into juvenile and adult stages, with maturation following a nonlinear function. We introduce Ito-type stochastic perturbations in mortality rates to account for environmental variability. We first establish the positivity of solutions and derive sufficient conditions for the stability of the trivial equilibrium, prey extinction, and conditional predator extinction. We then analyze prey persistence under specific maturation rate functions. Finally, numerical simulations illustrate the theoretical results and their ecological implications.
△ Less
Submitted 21 March, 2025;
originally announced March 2025.
-
The Stochastic Gause predator-prey model: noise-induced extinctions and invariance
Authors:
Leon Alexander Valencia,
Ph. D,
Jorge Mario Ramirez Osorio,
Jorge Andres Sanchez
Abstract:
This paper explores a stochastic Gause predator-prey model with bounded or sub-linear functional response. The model, described by a system of stochastic differential equations, captures the influence of stochastic fluctuations on predator-prey dynamics, with particular focus on the stability, extinction, and persistence of populations. We provide sufficient conditions for the existence and bounde…
▽ More
This paper explores a stochastic Gause predator-prey model with bounded or sub-linear functional response. The model, described by a system of stochastic differential equations, captures the influence of stochastic fluctuations on predator-prey dynamics, with particular focus on the stability, extinction, and persistence of populations. We provide sufficient conditions for the existence and boundedness of solutions, analyze noise-induced extinction events, and investigate the existence of unique stationary distributions for the case of Holing Type I functional response. Our analysis highlights the critical role of noise in determining long-term ecological outcomes, demonstrating that even in cases where deterministic models predict stable coexistence, stochastic noise can drive populations to extinction or alter the system's dynamics significantly.
△ Less
Submitted 8 September, 2024;
originally announced September 2024.
-
Parallel Model Exploration for Tumor Treatment Simulations
Authors:
Charilaos Akasiadis,
Miguel Ponce-de-Leon,
Arnau Montagud,
Evangelos Michelioudakis,
Alexia Atsidakou,
Elias Alevizos,
Alexander Artikis,
Alfonso Valencia,
Georgios Paliouras
Abstract:
Computational systems and methods are often being used in biological research, including the understanding of cancer and the development of treatments. Simulations of tumor growth and its response to different drugs are of particular importance, but also challenging complexity. The main challenges are first to calibrate the simulators so as to reproduce real-world cases, and second, to search for…
▽ More
Computational systems and methods are often being used in biological research, including the understanding of cancer and the development of treatments. Simulations of tumor growth and its response to different drugs are of particular importance, but also challenging complexity. The main challenges are first to calibrate the simulators so as to reproduce real-world cases, and second, to search for specific values of the parameter space concerning effective drug treatments. In this work, we combine a multi-scale simulator for tumor cell growth and a Genetic Algorithm (GA) as a heuristic search method for finding good parameter configurations in reasonable time. The two modules are integrated into a single workflow that can be executed in parallel on high performance computing infrastructures. In effect, the GA is used to calibrate the simulator, and then to explore different drug delivery schemes. Among these schemes, we aim to find those that minimize tumor cell size and the probability of emergence of drug resistant cells in the future. Experimental results illustrate the effectiveness and computational efficiency of the approach.
△ Less
Submitted 22 February, 2022; v1 submitted 25 March, 2021;
originally announced March 2021.
-
Unveiling new disease, pathway, and gene associations via multi-scale neural networks
Authors:
Thomas Gaudelet,
Noel Malod-Dognin,
Jon Sanchez-Valle,
Vera Pancaldi,
Alfonso Valencia,
Natasa Przulj
Abstract:
Diseases involve complex processes and modifications to the cellular machinery. The gene expression profile of the affected cells contains characteristic patterns linked to a disease. Hence, biological knowledge pertaining to a disease can be derived from a patient cell's profile, improving our diagnosis ability, as well as our grasp of disease risks. This knowledge can be used for drug re-purposi…
▽ More
Diseases involve complex processes and modifications to the cellular machinery. The gene expression profile of the affected cells contains characteristic patterns linked to a disease. Hence, biological knowledge pertaining to a disease can be derived from a patient cell's profile, improving our diagnosis ability, as well as our grasp of disease risks. This knowledge can be used for drug re-purposing, or by physicians to evaluate a patient's condition and co-morbidity risk. Here, we look at differential gene expression obtained from microarray technology for patients diagnosed with various diseases. Based on this data and cellular multi-scale organization, we aim to uncover disease--disease links, as well as disease-gene and disease--pathways associations. We propose neural networks with structures inspired by the multi-scale organization of a cell. We show that these models are able to correctly predict the diagnosis for the majority of the patients. Through the analysis of the trained models, we predict and validate disease-disease, disease-pathway, and disease-gene associations with comparisons to known interactions and literature search, proposing putative explanations for the novel predictions that come from our study.
△ Less
Submitted 10 April, 2020; v1 submitted 28 January, 2019;
originally announced January 2019.
-
An expanded evaluation of protein function prediction methods shows an improvement in accuracy
Authors:
Yuxiang Jiang,
Tal Ronnen Oron,
Wyatt T Clark,
Asma R Bankapur,
Daniel D'Andrea,
Rosalba Lepore,
Christopher S Funk,
Indika Kahanda,
Karin M Verspoor,
Asa Ben-Hur,
Emily Koo,
Duncan Penfold-Brown,
Dennis Shasha,
Noah Youngs,
Richard Bonneau,
Alexandra Lin,
Sayed ME Sahraeian,
Pier Luigi Martelli,
Giuseppe Profiti,
Rita Casadio,
Renzhi Cao,
Zhaolong Zhong,
Jianlin Cheng,
Adrian Altenhoff,
Nives Skunca
, et al. (122 additional authors not shown)
Abstract:
Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our a…
▽ More
Background: The increasing volume and variety of genotypic and phenotypic data is a major defining characteristic of modern biomedical sciences. At the same time, the limitations in technology for generating data and the inherently stochastic nature of biomolecular events have led to the discrepancy between the volume of data and the amount of knowledge gleaned from it. A major bottleneck in our ability to understand the molecular underpinnings of life is the assignment of function to biological macromolecules, especially proteins. While molecular experiments provide the most reliable annotation of proteins, their relatively low throughput and restricted purview have led to an increasing role for computational function prediction. However, accurately assessing methods for protein function prediction and tracking progress in the field remain challenging. Methodology: We have conducted the second Critical Assessment of Functional Annotation (CAFA), a timed challenge to assess computational methods that automatically assign protein function. One hundred twenty-six methods from 56 research groups were evaluated for their ability to predict biological functions using the Gene Ontology and gene-disease associations using the Human Phenotype Ontology on a set of 3,681 proteins from 18 species. CAFA2 featured significantly expanded analysis compared with CAFA1, with regards to data set size, variety, and assessment metrics. To review progress in the field, the analysis also compared the best methods participating in CAFA1 to those of CAFA2. Conclusions: The top performing methods in CAFA2 outperformed the best methods from CAFA1, demonstrating that computational function prediction is improving. This increased accuracy can be attributed to the combined effect of the growing number of experimental annotations and improved methods for function prediction.
△ Less
Submitted 2 January, 2016;
originally announced January 2016.
-
Integrating epigenomic data and 3D genomic structure with a new measure of chromatin assortativity
Authors:
Vera Pancaldi,
Enrique Carrillo-de-Santa-Pau,
Biola Maria Javierre,
David Juan,
Peter Fraser,
Mikhail Spivakov,
Alfonso Valencia,
Daniel Rico
Abstract:
Network analysis is a powerful way of modeling chromatin interactions. Assortativity is a network property used in social sciences to identify factors affecting how people establish social ties. We propose a new approach, using chromatin assortativity to integrate the epigenomic landscape of a specific cell type with its chromatin interaction network and thus investigate which proteins or chromati…
▽ More
Network analysis is a powerful way of modeling chromatin interactions. Assortativity is a network property used in social sciences to identify factors affecting how people establish social ties. We propose a new approach, using chromatin assortativity to integrate the epigenomic landscape of a specific cell type with its chromatin interaction network and thus investigate which proteins or chromatin marks mediate genomic contacts. We use high-resolution Promoter Capture Hi-C and Hi-Cap data as well as ChIA-PET data from mouse embryonic stem cells to investigate promoter-centered chromatin interaction networks and calculate the presence of specific epigenomic features in the chromatin fragments constituting the nodes of the network. We estimate the association of these features to the topology of four chromatin interaction networks and identify features localized in connected areas of the network. Polycomb Group proteins and associated histone marks are the features with the highest chromatin assortativity in promoter-centred networks. We then ask which features distinguish contacts amongst promoters from contacts between promoters and other genomic elements. We observe higher chromatin assortativity of the actively elongating form of RNA Polymerase 2 (RNAPII) compared to inactive forms only in interactions between promoters and other elements. Contacts among promoters, and between promoters and other elements have different characteristic epigenomic features. We identify a possible role for the elongating form of RNAPII in mediating interactions among promoters, enhancers and transcribed gene bodies. Our approach facilitates the study of multiple genome-wide epigenomic profiles, considering network topology and allowing the comparison of chromatin interaction networks.
△ Less
Submitted 30 May, 2016; v1 submitted 1 December, 2015;
originally announced December 2015.
-
The shrinking human protein coding complement: are there now fewer than 20,000 genes?
Authors:
Iakes Ezkurdia,
David Juan,
Jose Manuel Rodriguez,
Adam Frankish,
Mark Diekhans,
Jennifer Harrow,
Jesus Vazquez,
Alfonso Valencia,
Michael L. Tress
Abstract:
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we map the peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation the huma…
▽ More
Determining the full complement of protein-coding genes is a key goal of genome annotation. The most powerful approach for confirming protein coding potential is the detection of cellular protein expression through peptide mass spectrometry experiments. Here we map the peptides detected in 7 large-scale proteomics studies to almost 60% of the protein coding genes in the GENCODE annotation the human genome. We find that conservation across vertebrate species and the age of the gene family are key indicators of whether a peptide will be detected in proteomics experiments. We find peptides for most highly conserved genes and for practically all genes that evolved before bilateria. At the same time there is almost no evidence of protein expression for genes that have appeared since primates, or for genes that do not have any protein-like features or cross-species conservation. We identify 19 non-protein-like features such as weak conservation, no protein features or ambiguous annotations in major databases that are indicators of low peptide detection rates. We use these features to describe a set of 2,001 genes that are potentially non-coding, and show that many of these genes behave more like non-coding genes than protein-coding genes. We detect peptides for just 3% of these genes. We suggest that many of these 2,001 genes do not code for proteins under normal circumstances and that they should not be included in the human protein coding gene catalogue. These potential non-coding genes will be revised as part of the ongoing human genome annotation effort.
△ Less
Submitted 11 February, 2014; v1 submitted 26 December, 2013;
originally announced December 2013.
-
Late-replicating CNVs as a source of new genes
Authors:
David Juan,
Daniel Rico,
Tomas Marques-Bonet,
Oscar Fernandez-Capetillo,
Alfonso Valencia
Abstract:
Asynchronous replication of the genome has been associated with different rates of point mutation and copy number variation (CNV) in human populations. Here, we explored if the bias in the generation of CNV that is associated to DNA replication timing might have conditioned the birth of new protein-coding genes during evolution. We show that genes that were duplicated during primate evolution are…
▽ More
Asynchronous replication of the genome has been associated with different rates of point mutation and copy number variation (CNV) in human populations. Here, we explored if the bias in the generation of CNV that is associated to DNA replication timing might have conditioned the birth of new protein-coding genes during evolution. We show that genes that were duplicated during primate evolution are more commonly found among the human genes located in late-replicating CNV regions. We traced the relationship between replication timing and the evolutionary age of duplicated genes. Strikingly, we found that there is a significant enrichment of evolutionary younger duplicates in late replicating regions of the human and mouse genome. Indeed, the presence of duplicates in late replicating regions gradually decreases as the evolutionary time since duplication extends. Our results suggest that the accumulation of recent duplications in late replicating CNV regions is an active process influencing genome evolution.
△ Less
Submitted 30 October, 2013; v1 submitted 31 July, 2013;
originally announced July 2013.
-
Accurate Demarcation of Protein Domain Linkers based on Structural Analysis of Linker Probable Region
Authors:
Vivekanand Samant,
Arvind Hulgeri,
Alfonso Valencia,
Ashish V. Tendulkar
Abstract:
In multi-domain proteins, the domains are connected by a flexible unstructured region called as protein domain linker. The accurate demarcation of these linkers holds a key to understanding of their biochemical and evolutionary attributes. This knowledge helps in designing a suitable linker for engineering stable multi-domain chimeric proteins. Here we propose a novel method for the demarcation of…
▽ More
In multi-domain proteins, the domains are connected by a flexible unstructured region called as protein domain linker. The accurate demarcation of these linkers holds a key to understanding of their biochemical and evolutionary attributes. This knowledge helps in designing a suitable linker for engineering stable multi-domain chimeric proteins. Here we propose a novel method for the demarcation of the linker based on a three-dimensional protein structure and a domain definition. The proposed method is based on biological knowledge about structural flexibility of the linkers. We performed structural analysis on a linker probable region (LPR) around domain boundary points of known SCOP domains. The LPR was described using a set of overlapping peptide fragments of fixed size. Each peptide fragment was then described by geometric invariants (GIs) and subjected to clustering process where the fragments corresponding to actual linker come up as outliers. We then discover the actual linkers by finding the longest continuous stretch of outlier fragments from LPRs. This method was evaluated on a benchmark dataset of 51 continuous multi-domain proteins, where it achieves F1 score of 0.745 (0.83 precision and 0.66 recall). When the method was applied on 725 continuous multi-domain proteins, it was able to identify novel linkers that were not reported previously. This method can be used in combination with supervised / sequence based linker prediction methods for accurate linker demarcation.
△ Less
Submitted 23 November, 2012;
originally announced November 2012.
-
Mirroring co-evolving trees in the light of their topologies
Authors:
Iman Hajirasouliha,
Alexander Schönhuth,
David Juan,
Alfonso Valencia,
S. Cenk Sahinalp
Abstract:
Determining the interaction partners among protein/domain families poses hard computational problems, in particular in the presence of paralogous proteins. Available approaches aim to identify interaction partners among protein/domain families through maximizing the similarity between trimmed versions of their phylogenetic trees. Since maximization of any natural similarity score is computationall…
▽ More
Determining the interaction partners among protein/domain families poses hard computational problems, in particular in the presence of paralogous proteins. Available approaches aim to identify interaction partners among protein/domain families through maximizing the similarity between trimmed versions of their phylogenetic trees. Since maximization of any natural similarity score is computationally difficult, many approaches employ heuristics to maximize the distance matrices corresponding to the tree topologies in question. In this paper we devise an efficient deterministic algorithm which directly maximizes the similarity between two leaf labeled trees with edge lengths, obtaining a score-optimal alignment of the two trees in question.
Our algorithm is significantly faster than those methods based on distance matrix comparison: 1 minute on a single processor vs. 730 hours on a supercomputer. Furthermore we have advantages over the current state-of-the-art heuristic search approach in terms of precision as well as a recently suggested overall performance measure for mirrortree approaches, while incurring only acceptable losses in recall.
A C implementation of the method demonstrated in this paper is available at http://compbio.cs.sfu.ca/mirrort.htm
△ Less
Submitted 26 October, 2011;
originally announced October 2011.