Quantitative Biology
See recent articles
Showing new listings for Thursday, 23 October 2025
- [1] arXiv:2510.18929 [pdf, html, other]
-
Title: ECKO: Explainable Clinical Knowledge for OncologyMarta Contreiras Silva, Daniel Faria, Laura Balbi, Susana Nunes, Ana Filipa Rodrigues, Aleksander Palkowski, Michal Waleron, Emilia Daghir-Wojtkowiak, Ashwin Adrian Kallor, Christophe Battail, Federico Maria Corazza, Manuel Fiorelli, Armando Stellato, Javier Antonio Alfaro, Fabio Massimo Zanzotto, Catia PesquitaComments: 18 pages, 9 figuresSubjects: Quantitative Methods (q-bio.QM)
Personalized oncology aims to tailor treatment strategies to the unique molecular and clinical profiles of individual patients, moving beyond the traditional paradigm of treating the disease not the patient. Achieving this vision requires the integration and interpretation of vast, heterogeneous biomedical data within a meaningful scientific framework. Knowledge graphs, structured according to biomedical ontologies, offer a powerful approach to contextualize and interconnect diverse datasets, enabling more precise and informed clinical decision-making.
We present ECKO (Explainable Clinical Knowledge for Oncology), a comprehensive knowledge graph that integrates 33 biomedical ontologies and aggregates data from multiple studies to create a unified resource optimized for data-driven clinical applications in oncology. Designed to support personalized drug recommendations, ECKO facilitates the identification of optimal therapeutic options by linking patient-specific molecular data to relevant pharmacological knowledge. It provides transparent, interpretable explanations for drug recommendations, fostering greater trust and understanding among clinicians and researchers. This resource represents a significant advancement toward explainable, scalable, and clinically actionable personalized medicine in oncology, with potential applications in biomarker discovery, treatment optimization, and translational research. - [2] arXiv:2510.19294 [pdf, html, other]
-
Title: Integrative Analysis of Epigenetic, Transcriptomic, and Metabolomic Responses to Arsenic Exposure Using Coupled Matrix FactorizationComments: 12 pages, 4 figures; presented at a conference; in preparation for submission to Briefings in BioinformaticsSubjects: Genomics (q-bio.GN)
Arsenic (As), a widespread environmental toxin, poses major health risks due to its inorganic forms (iAs), which are linked to cancer, cardiovascular disease, and endocrine disruption. Although its toxic effects have been extensively studied, the molecular mechanisms underlying arsenic-induced perturbations remain incompletely understood. This complexity arises from its ability to reprogram epigenetic landscapes, alter gene expression, and disrupt metabolic balance through interconnected regulatory networks. Existing studies often analyze epigenomic, transcriptomic, and metabolomic datasets independently, overlooking their interdependence. Here, we present a coupled matrix factorization (CMF) framework based on the PARAFAC2-AOADMM model for joint integration of DNA methylation (RRBS), RNA-seq, and metabolomics data from mouse embryonic stem cells (ESCs) and epiblast-like cells (EpiLCs) exposed to arsenic. By jointly decomposing multi-omics matrices, our approach identifies shared and dataset-specific components that capture coordinated molecular responses to arsenic exposure. This integrative methodology demonstrates the potential of CMF-based models in computational toxicology and offers a generalizable framework for dissecting complex multi-layered biological perturbations.
- [3] arXiv:2510.19484 [pdf, html, other]
-
Title: KnowMol: Advancing Molecular Large Language Models with Multi-Level Chemical KnowledgeSubjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
The molecular large language models have garnered widespread attention due to their promising potential on molecular applications. However, current molecular large language models face significant limitations in understanding molecules due to inadequate textual descriptions and suboptimal molecular representation strategies during pretraining. To address these challenges, we introduce KnowMol-100K, a large-scale dataset with 100K fine-grained molecular annotations across multiple levels, bridging the gap between molecules and textual descriptions. Additionally, we propose chemically-informative molecular representation, effectively addressing limitations in existing molecular representation strategies. Building upon these innovations, we develop KnowMol, a state-of-the-art multi-modal molecular large language model. Extensive experiments demonstrate that KnowMol achieves superior performance across molecular understanding and generation tasks.
GitHub: this https URL
Huggingface: this https URL - [4] arXiv:2510.19499 [pdf, other]
-
Title: Interactive visualization of kidney micro-compartmental segmentations and associated pathomics on whole slide imagesMark S. Keller, Nicholas Lucarelli, Yijiang Chen, Samuel Border, Andrew Janowczyk, Jonathan Himmelfarb, Matthias Kretzler, Jeffrey Hodgin, Laura Barisoni, Dawit Demeke, Leal Herlitz, Gilbert Moeckel, Avi Z. Rosenberg, Yanli Ding (for the Kidney Precision Medicine Project, for the HuBMAP Consortium), Pinaki Sarder, Nils GehlenborgSubjects: Quantitative Methods (q-bio.QM); Human-Computer Interaction (cs.HC)
Application of machine learning techniques enables segmentation of functional tissue units in histology whole-slide images (WSIs). We built a pipeline to apply previously validated segmentation models of kidney structures and extract quantitative features from these structures. Such quantitative analysis also requires qualitative inspection of results for quality control, exploration, and communication. We extend the Vitessce web-based visualization tool to enable visualization of segmentations of multiple types of functional tissue units, such as, glomeruli, tubules, arteries/arterioles in the kidney. Moreover, we propose a standard representation for files containing multiple segmentation bitmasks, which we define polymorphically, such that existing formats including OME-TIFF, OME-NGFF, AnnData, MuData, and SpatialData can be used. We demonstrate that these methods enable researchers and the broader public to interactively explore datasets containing multiple segmented entities and associated features, including for exploration of renal morphometry of biopsies from the Kidney Precision Medicine Project (KPMP) and the Human Biomolecular Atlas Program (HuBMAP).
New submissions (showing 4 of 4 entries)
- [5] arXiv:2510.18889 (cross-list from physics.soc-ph) [pdf, other]
-
Title: Prejudice driven spite: A discontinuous phase transition in ultimatum gameComments: Accepted in Physical Review ESubjects: Physics and Society (physics.soc-ph); Theoretical Economics (econ.TH); Adaptation and Self-Organizing Systems (nlin.AO); Populations and Evolution (q-bio.PE)
In a mix of prejudiced and unprejudiced individuals engaged in strategic interactions, the individual intensity of prejudice is expected to have effect on overall level of societal prejudice. High level of prejudice should lead to discrimination that may manifest as unfairness and, perhaps, even spite. In this paper, we investigate this idea in the classical paradigm of the ultimatum game which we theoretically modify to introduce prejudice at the level of players, terming its intensity as prejudicity. The stochastic evolutionary game dynamics, in the regime of replication-selection, reveals the emergence of spiteful behaviour as a dominant behaviour via a first order phase transition -- a discontinuous jump in the frequency of spiteful individuals at a threshold value of prejudicity. The phase transition is quite robust and becomes progressively conspicuous in the limit of large population size where deterministic evolutionary game dynamics, viz., replicator dynamics, approximates the system closely. The emergence of spite driven by prejudice is also found to persist when one considers long-term evolutionary dynamics in the mutation-selection dominated regime.
- [6] arXiv:2510.19021 (cross-list from cs.LG) [pdf, html, other]
-
Title: Category learning in deep neural networks: Information content and geometry of internal representationsJournal-ref: Physical Review E 2025Subjects: Machine Learning (cs.LG); Information Theory (cs.IT); Neurons and Cognition (q-bio.NC)
In animals, category learning enhances discrimination between stimuli close to the category boundary. This phenomenon, called categorical perception, was also empirically observed in artificial neural networks trained on classification tasks. In previous modeling works based on neuroscience data, we show that this expansion/compression is a necessary outcome of efficient learning. Here we extend our theoretical framework to artificial networks. We show that minimizing the Bayes cost (mean of the cross-entropy loss) implies maximizing the mutual information between the set of categories and the neural activities prior to the decision layer. Considering structured data with an underlying feature space of small dimension, we show that maximizing the mutual information implies (i) finding an appropriate projection space, and, (ii) building a neural representation with the appropriate metric. The latter is based on a Fisher information matrix measuring the sensitivity of the neural activity to changes in the projection space. Optimal learning makes this neural Fisher information follow a category-specific Fisher information, measuring the sensitivity of the category membership. Category learning thus induces an expansion of neural space near decision boundaries. We characterize the properties of the categorical Fisher information, showing that its eigenvectors give the most discriminant directions at each point of the projection space. We find that, unexpectedly, its maxima are in general not exactly at, but near, the class boundaries. Considering toy models and the MNIST dataset, we numerically illustrate how after learning the two Fisher information matrices match, and essentially align with the category boundaries. Finally, we relate our approach to the Information Bottleneck one, and we exhibit a bias-variance decomposition of the Bayes cost, of interest on its own.
- [7] arXiv:2510.19090 (cross-list from cond-mat.soft) [pdf, html, other]
-
Title: Learning noisy tissue dynamics across time scalesComments: 15 pages, 6 figuresSubjects: Soft Condensed Matter (cond-mat.soft); Machine Learning (cs.LG); Biological Physics (physics.bio-ph); Quantitative Methods (q-bio.QM)
Tissue dynamics play a crucial role in biological processes ranging from wound healing to morphogenesis. However, these noisy multicellular dynamics are notoriously hard to predict. Here, we introduce a biomimetic machine learning framework capable of inferring noisy multicellular dynamics directly from experimental movies. This generative model combines graph neural networks, normalizing flows and WaveNet algorithms to represent tissues as neural stochastic differential equations where cells are edges of an evolving graph. This machine learning architecture reflects the architecture of the underlying biological tissues, substantially reducing the amount of data needed to train it compared to convolutional or fully-connected neural networks. Taking epithelial tissue experiments as a case study, we show that our model not only captures stochastic cell motion but also predicts the evolution of cell states in their division cycle. Finally, we demonstrate that our method can accurately generate the experimental dynamics of developmental systems, such as the fly wing, and cell signaling processes mediated by stochastic ERK waves, paving the way for its use as a digital twin in bioengineering and clinical contexts.
- [8] arXiv:2510.19455 (cross-list from eess.IV) [pdf, other]
-
Title: Automated Morphological Analysis of Neurons in Fluorescence Microscopy Using YOLOv8Comments: 7 pages, 2 figures and 2 tablesSubjects: Image and Video Processing (eess.IV); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Accurate segmentation and precise morphological analysis of neuronal cells in fluorescence microscopy images are crucial steps in neuroscience and biomedical imaging applications. However, this process is labor-intensive and time-consuming, requiring significant manual effort and expertise to ensure reliable outcomes. This work presents a pipeline for neuron instance segmentation and measurement based on a high-resolution dataset of stem-cell-derived neurons. The proposed method uses YOLOv8, trained on manually annotated microscopy images. The model achieved high segmentation accuracy, exceeding 97%. In addition, the pipeline utilized both ground truth and predicted masks to extract biologically significant features, including cell length, width, area, and grayscale intensity values. The overall accuracy of the extracted morphological measurements reached 75.32%, further supporting the effectiveness of the proposed approach. This integrated framework offers a valuable tool for automated analysis in cell imaging and neuroscience research, reducing the need for manual annotation and enabling scalable, precise quantification of neuron morphology.
- [9] arXiv:2510.19532 (cross-list from cs.HC) [pdf, other]
-
Title: EasyVitessce: auto-magically adding interactivity to Scverse single-cell and spatial biology plotsSubjects: Human-Computer Interaction (cs.HC); Quantitative Methods (q-bio.QM)
EasyVitessce is a Python package that turns existing static Scanpy and SpatialData plots into interactive visualizations by virtue of adding a single line of Python code. The package uses Vitessce internally to render interactive plots, and abstracts away technical details involved with configuration of Vitessce. The resulting interactive plots can be viewed in computational notebook environments or their configurations can be exported for usage in other contexts such as web applications, enhancing the utility of popular Scverse Python plotting APIs. EasyVitessce is released under the MIT License and available on the Python Package Index (PyPI). The source code is publicly available on GitHub.
- [10] arXiv:2510.19660 (cross-list from cs.ET) [pdf, html, other]
-
Title: Machine Olfaction and Embedded AI Are Shaping the New Global Sensing IndustryAndreas Mershin, Nikolas Stefanou, Adan Rotteveel, Matthew Kung, George Kung, Alexandru Dan, Howard Kivell, Zoia Okulova, Zoi Kountouri, Paul Pu LiangComments: 23 pages, 116 citations, combination tech review/industry roadmap/white paper on the rise of machine olfaction as an essential AI modalitySubjects: Emerging Technologies (cs.ET); Biomolecules (q-bio.BM)
Machine olfaction is rapidly emerging as a transformative capability, with applications spanning non-invasive medical diagnostics, industrial monitoring, agriculture, and security and defense. Recent advances in stabilizing mammalian olfactory receptors and integrating them into biophotonic and bioelectronic systems have enabled detection at near single-molecule resolution thus placing machines on par with trained detection dogs. As this technology converges with multimodal AI and distributed sensor networks imbued with embedded AI, it introduces a new, biochemical layer to a sensing ecosystem currently dominated by machine vision and audition. This review and industry roadmap surveys the scientific foundations, technological frontiers, and strategic applications of machine olfaction making the case that we are currently witnessing the rise of a new industry that brings with it a global chemosensory infrastructure. We cover exemplary industrial, military and consumer applications and address some of the ethical and legal concerns arising. We find that machine olfaction is poised to bring forth a planet-wide molecular awareness tech layer with the potential of spawning vast emerging markets in health, security, and environmental sensing via scent.
- [11] arXiv:2510.19749 (cross-list from cs.LG) [pdf, html, other]
-
Title: BATIS: Bayesian Approaches for Targeted Improvement of Species Distribution ModelsSubjects: Machine Learning (cs.LG); Populations and Evolution (q-bio.PE); Quantitative Methods (q-bio.QM)
Species distribution models (SDMs), which aim to predict species occurrence based on environmental variables, are widely used to monitor and respond to biodiversity change. Recent deep learning advances for SDMs have been shown to perform well on complex and heterogeneous datasets, but their effectiveness remains limited by spatial biases in the data. In this paper, we revisit deep SDMs from a Bayesian perspective and introduce BATIS, a novel and practical framework wherein prior predictions are updated iteratively using limited observational data. Models must appropriately capture both aleatoric and epistemic uncertainty to effectively combine fine-grained local insights with broader ecological patterns. We benchmark an extensive set of uncertainty quantification approaches on a novel dataset including citizen science observations from the eBird platform. Our empirical study shows how Bayesian deep learning approaches can greatly improve the reliability of SDMs in data-scarce locations, which can contribute to ecological understanding and conservation efforts.
Cross submissions (showing 7 of 7 entries)
- [12] arXiv:1904.03236 (replaced) [pdf, html, other]
-
Title: Log-normal Superstatistics in the Confined Motion of AntsComments: 6 pages, 8 figuresSubjects: Populations and Evolution (q-bio.PE)
We report the emergence of Log-normal Superstatistics in the collective motion of ants confined in a quasi-2D arena and exposed to a panic-inducing stimulus. A data-driven superstatistical Langevin model accurately reproduces the transition from stationary behavior to an organized escape response, characterized by non-Gaussian velocity distributions and a fluctuating diffusion coefficient. Our findings show that danger information propagates via a memory-limited, cascade-like mechanism, resulting in a stable cluster formation despite individual memory constraints. These discoveries establish a crucial connection between Superstatistics formalisms and living active matter beyond a unicellular level, and provide a foundation for the understanding of the biological origin of Log-normal type diffusion in confined environments.
- [13] arXiv:2408.00782 (replaced) [pdf, other]
-
Title: Dynamic transitions of blind spots in the Hermann grid illusionComments: 10 pages, 11 figuresSubjects: Neurons and Cognition (q-bio.NC)
Hermann discovered the grid illusion in 1870, but its cause has remained a mystery for more than 150 years. In 1960, Baumgartner proposed a hypothesis for the illusion based on neural receptive fields, but Geier presented a counterexample in 2008. In 1995, Schrauf devised the scintillating grid illusion, an improvement on the Hermann grid illusion. I propose that a hypothesis involving blind spots (optic discs) can significantly contribute to unraveling the mystery of the grid illusion.
- [14] arXiv:2409.08303 (replaced) [pdf, html, other]
-
Title: Interpretable Features for the Assessment of Neurodegenerative Diseases through Handwriting AnalysisThomas Thebaud, Anna Favaro, Casey Chen, Gabrielle Chavez, Laureano Moro-Velazquez, Ankur Butala, Najim DehakComments: pages including references, accepted in IEEE JHBISubjects: Neurons and Cognition (q-bio.NC); Machine Learning (cs.LG)
Motor dysfunction is a common sign of neurodegenerative diseases (NDs) such as Parkinson's disease (PD) and Alzheimer's disease (AD), but may be difficult to detect, especially in the early stages. In this work, we examine the behavior of a wide array of interpretable features extracted from the handwriting signals of 113 subjects performing multiple tasks on a digital tablet, as part of the Neurological Signals dataset. The aim is to measure their effectiveness in characterizing NDs, including AD and PD. To this end, task-agnostic and task-specific features are extracted from 14 distinct tasks. Subsequently, through statistical analysis and a series of classification experiments, we investigate which features provide greater discriminative power between NDs and healthy controls and amongst different NDs. Preliminary results indicate that the tasks at hand can all be effectively leveraged to distinguish between the considered set of NDs, specifically by measuring the stability, the speed of writing, the time spent not writing, and the pressure variations between groups from our handcrafted interpretable features, which shows a statistically significant difference between groups, across multiple tasks. Using various binary classification algorithms on the computed features, we obtain up to 87% accuracy for the discrimination between AD and healthy controls (CTL), and up to 69% for the discrimination between PD and CTL.
- [15] arXiv:2409.11183 (replaced) [pdf, html, other]
-
Title: Comorbid anxiety predicts lower odds of MDD improvement in a trial of smartphone-delivered interventionsComments: Jessica M. Lipschitz and Omar Costilla-Reyes are co-senior authorsJournal-ref: Talbot, Morgan B., Jessica M. Lipschitz*, and Omar Costilla-Reyes*. "Comorbid anxiety predicts lower odds of MDD improvement in a trial of smartphone-delivered interventions." J. of Affective Disorders (2025): 120416. *Co-Senior AuthorsSubjects: Quantitative Methods (q-bio.QM)
Comorbid anxiety disorders are common among patients with major depressive disorder (MDD), but their impact on outcomes of digital and smartphone-delivered interventions is not well understood. This study is a secondary analysis of a randomized controlled effectiveness trial (n=638) that assessed three smartphone-delivered interventions: Project EVO (a cognitive training app), iPST (a problem-solving therapy app), and Health Tips (an active control). We applied classical machine learning models (logistic regression, support vector machines, decision trees, random forests, and k-nearest-neighbors) to identify baseline predictors of MDD improvement at 4 weeks after trial enrollment. Our analysis produced a decision tree model indicating that a baseline GAD-7 questionnaire score of 11 or higher, a threshold consistent with at least moderate anxiety, strongly predicts lower odds of MDD improvement in this trial. Our exploratory findings suggest that depressed individuals with comorbid anxiety have reduced odds of substantial improvement in the context of smartphone-delivered interventions, as the association was observed across all three intervention groups. Our work highlights a methodology that can identify interpretable clinical thresholds, which, if validated, could predict symptom trajectories and inform treatment selection and intensity.
- [16] arXiv:2503.21681 (replaced) [pdf, html, other]
-
Title: A Comprehensive Benchmark for RNA 3D Structure-Function ModelingSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG); Machine Learning (stat.ML)
The relationship between RNA structure and function has recently attracted interest within the deep learning community, a trend expected to intensify as nucleic acid structure models advance. Despite this momentum, the lack of standardized, accessible benchmarks for applying deep learning to RNA 3D structures hinders progress. To this end, we introduce a collection of seven benchmarking datasets specifically designed to support RNA structure-function prediction. Built on top of the established Python package rnaglib, our library streamlines data distribution and encoding, provides tools for dataset splitting and evaluation, and offers a comprehensive, user-friendly environment for model comparison. The modular and reproducible design of our datasets encourages community contributions and enables rapid customization. To demonstrate the utility of our benchmarks, we report baseline results for all tasks using a relational graph neural network.
- [17] arXiv:2504.21818 (replaced) [pdf, html, other]
-
Title: Lineage topology, replication kinetics and cell cycle synchronization reveal regulated growth dynamics in human bone marrow stromal cell coloniesAlessandro Allegrezza, Riccardo Beschi, Domenico Caudo, Andrea Cavagna, Alessandro Corsi, Antonio Culla, Samantha Donsante, Giuseppe Giannicola, Irene Giardina, Giorgio Gosti, Tomas S. Grigera, Stefania Melillo, Biagio Palmisano, Leonardo Parisi, Lorena Postiglione, Mara Riminucci, Francesco Saverio RotondiComments: Version 1 of this submission has been split into two manuscripts: the first one focuses on lineage topology vs. replication kinetics (this submission). The second manuscript, introducing inheritance entropy, can be found in arXiv:2510.18589. 33 pages, 9 figures, 1 table, 2 videosSubjects: Cell Behavior (q-bio.CB); Statistical Mechanics (cond-mat.stat-mech); Biological Physics (physics.bio-ph)
Bone marrow stromal cells (BMSC) -- which include skeletal stem cells -- are a promising tool in regenerative medicine. However, their heterogeneous and unpredictable in vivo behaviour remains a critical barrier preventing the development of standardized therapeutic approaches for skeletal tissue regeneration. Several studies have attempted to identify in vitro features that could correlate with the in vivo differentiation properties, yet the mechanisms ruling BMSC heterogeneity remain poorly understood. Here, using time-lapse imaging, we lineage-trace 32 single-cell-derived BMSC colonies through seven generations. We observe significant inter-colony and intra-colony heterogeneity in lineage topology (determined by the number of senescent or apoptotic cells) and in replicative kinetics (measured from proliferating cells only). Interestingly, topology and kinetics result strongly correlated, suggesting the existence of regulatory factors linking the non-dividing/apoptotic subpopulations with proliferating cells. Furthermore, BMSCs display highly synchronized cell cycles during early generations, indicating stage-specific regulatory mechanisms through which cells influence each other. By employing a non-interacting population growth model, we demonstrate that the observed synchronisation cannot be explained by an uncorrelated branching process; instead, cell-to-cell correlation of division times must exist. Our findings reveal fundamental mechanisms governing BMSC heterogeneity and growth dynamics that may inform strategies to control their regenerative potential.
- [18] arXiv:2505.11309 (replaced) [pdf, html, other]
-
Title: Decomposing stimulus-specific sensory neural information via diffusion modelsComments: Steeve Laquitaine and Simone Azeglio have equal contributions; Ulisse Ferrari and Matthew Chalk have equal senior contributionsSubjects: Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
To understand sensory coding, we must ask not only how much information neurons encode, but also what that information is about. This requires decomposing mutual information into contributions from individual stimuli and stimulus features: a fundamentally ill-posed problem with infinitely many possible solutions. We address this by introducing three core axioms, additivity, positivity, and locality that any meaningful stimulus-wise decomposition should satisfy. We then derive a decomposition that meets all three criteria and remains tractable for high-dimensional stimuli. Our decomposition can be efficiently estimated using diffusion models, allowing for scaling up to complex, structured and naturalistic stimuli. Applied to a model of visual neurons, our method quantifies how specific stimuli and features contribute to encoded information. Our approach provides a scalable, interpretable tool for probing representations in both biological and artificial neural systems.
- [19] arXiv:2507.09024 (replaced) [pdf, other]
-
Title: CNeuroMod-THINGS, a densely-sampled fMRI dataset for visual neuroscienceMarie St-Laurent, Basile Pinsard, Oliver Contier, Elizabeth DuPre, Katja Seeliger, Valentina Borghesani, Julie A. Boyle, Lune Bellec, Martin N. HebartComments: 16 pages manuscript, 5 figures, 9 pages supplementary materialSubjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV)
Data-hungry neuro-AI modelling requires ever larger neuroimaging datasets. CNeuroMod-THINGS meets this need by capturing neural representations for a wide set of semantic concepts using well-characterized images in a new densely-sampled, large-scale fMRI dataset. Importantly, CNeuroMod-THINGS exploits synergies between two existing projects: the THINGS initiative (THINGS) and the Courtois Project on Neural Modelling (CNeuroMod). THINGS has developed a common set of thoroughly annotated images broadly sampling natural and man-made objects which is used to acquire a growing collection of large-scale multimodal neural responses. Meanwhile, CNeuroMod is acquiring hundreds of hours of fMRI data from a core set of participants during controlled and naturalistic tasks, including visual tasks like movie watching and videogame playing. For CNeuroMod-THINGS, four CNeuroMod participants each completed 33-36 sessions of a continuous recognition paradigm using approximately 4000 images from the THINGS stimulus set spanning 720 categories. We report behavioural and neuroimaging metrics that showcase the quality of the data. By bridging together large existing resources, CNeuroMod-THINGS expands our capacity to model broad slices of the human visual experience.
- [20] arXiv:2508.18404 (replaced) [pdf, html, other]
-
Title: Saccade crossing avoidance as a visual search strategyComments: Main text: 12 pages, 4 figures; Supplementary info: 13 pages, 9 figuresSubjects: Neurons and Cognition (q-bio.NC); Computer Vision and Pattern Recognition (cs.CV); Quantitative Methods (q-bio.QM)
Although visual search appears largely random, several oculomotor biases exist such that the likelihoods of saccade directions and lengths depend on the previous scan path. Compared to the most recent fixations, the impact of the longer path history is more difficult to quantify. Using the step-selection framework commonly used in movement ecology, and analyzing data from 45-second viewings of "Where's Waldo?", we report a new memory-dependent effect that also varies significantly between individuals, which we term self-crossing avoidance. This is a tendency for saccades to avoid crossing those earlier in the scan path, and is most evident when both have small amplitudes. We show this by comparing real data to synthetic data generated from a memoryless approximation of the spatial statistics (i.e. a Markovian nonparametric model with a matching distribution of saccade lengths over time). Maximum likelihood fitting indicates that this effect is strongest when including the last $\approx 7$ seconds of a scan path. The effect size is comparable to well-known forms of history dependence such as inhibition of return. A parametric probabilistic model including a self-crossing penalty term was able to reproduce joint statistics of saccade lengths and self-crossings. We also quantified individual strategic differences, and their consistency over the six images viewed per participant, using mixed-effect regressions. Participants with a higher tendency to avoid crossings displayed smaller saccade lengths and shorter fixation durations on average, but did not display more horizontal, vertical, forward or reverse saccades. Together, these results indicate that the avoidance of crossings is a local orienting strategy that facilitates and complements inhibition of return, and hence exploration of visual scenes.
- [21] arXiv:2509.11354 (replaced) [pdf, html, other]
-
Title: Intelligent Software System for Low-Cost, Brightfield Segmentation: Algorithmic Implementation for Cytometric Auto-AnalysisSubjects: Quantitative Methods (q-bio.QM); Computer Vision and Pattern Recognition (cs.CV); Image and Video Processing (eess.IV); Cell Behavior (q-bio.CB)
Bright-field microscopy, a cost-effective solution for live-cell culture, is often the only resource available, along with standard CPUs, for many low-budget labs. The inherent chal- lenges of bright-field images - their noisiness, low contrast, and dynamic morphology - coupled with a lack of GPU resources and complex software interfaces, hinder the desired research output. This article presents a novel microscopy image analysis frame- work designed for low-budget labs equipped with a standard CPU desktop. The Python-based program enables cytometric analysis of live, unstained cells in culture through an advanced computer vision and machine learning pipeline. Crucially, the framework operates on label-free data, requiring no manually annotated training data or training phase. It is accessible via a user-friendly, cross-platform GUI that requires no programming skills, while also providing a scripting interface for programmatic control and integration by developers. The end-to-end workflow performs semantic and instance segmentation, feature extraction, analysis, evaluation, and automated report generation. Its modular archi- tecture supports easy maintenance and flexible integration while supporting both single-image and batch processing. Validated on several unstained cell types from the public dataset of livecells, the framework demonstrates superior accuracy and reproducibility compared to contemporary tools like Cellpose and StarDist. Its competitive segmentation speed on a CPU-based platform highlights its significant potential for basic research and clinical applications - particularly in cell transplantation for personalised medicine and muscle regeneration therapies. The access to the application is available for reproducibility
- [22] arXiv:2509.17138 (replaced) [pdf, html, other]
-
Title: Analyzing Memory Effects in Large Language Models through the lens of Cognitive PsychologySubjects: Neurons and Cognition (q-bio.NC)
Memory, a fundamental component of human cognition, exhibits adaptive yet fallible characteristics as illustrated by Schacter's memory "sins".These cognitive phenomena have been studied extensively in psychology and neuroscience, but the extent to which artificial systems, specifically Large Language Models (LLMs), emulate these cognitive phenomena remains underexplored. This study uses human memory research as a lens for understanding LLMs and systematically investigates human memory effects in state-of-the-art LLMs using paradigms drawn from psychological research. We evaluate seven key memory phenomena, comparing human behavior to LLM performance. Both people and models remember less when overloaded with information (list length effect) and remember better with repeated exposure (list strength effect). They also show similar difficulties when retrieving overlapping information, where storing too many similar facts leads to confusion (fan effect). Like humans, LLMs are susceptible to falsely "remembering" words that were never shown but are related to others (false memories), and they can apply prior learning to new, related situations (cross-domain generalization). However, LLMs differ in two key ways: they are less influenced by the order in which information is presented (positional bias) and more robust when processing random or meaningless material (nonsense effect). These results reveal both alignments and divergences in how LLMs and humans reconstruct memory. The findings help clarify how memory-like behavior in LLMs echoes core features of human cognition, while also highlighting the architectural differences that lead to distinct patterns of error and success.
- [23] arXiv:2306.10407 (replaced) [pdf, other]
-
Title: FP-IRL: Fokker-Planck Inverse Reinforcement Learning -- A Physics-Constrained Approach to Markov Decision ProcessesChengyang Huang, Siddhartha Srivastava, Kenneth K. Y. Ho, Kathy E. Luker, Gary D. Luker, Xun Huan, Krishna GarikipatiSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biological Physics (physics.bio-ph); Cell Behavior (q-bio.CB)
Inverse reinforcement learning (IRL) is a powerful paradigm for uncovering the incentive structure that drives agent behavior, by inferring an unknown reward function from observed trajectories within a Markov decision process (MDP). However, most existing IRL methods require access to the transition function, either prescribed or estimated \textit{a priori}, which poses significant challenges when the underlying dynamics are unknown, unobservable, or not easily sampled.
We propose Fokker--Planck inverse reinforcement learning (FP-IRL), a novel physics-constrained IRL framework tailored for systems governed by Fokker--Planck (FP) dynamics. FP-IRL simultaneously infers both the reward and transition functions directly from trajectory data, without requiring access to sampled transitions. Our method leverages a conjectured equivalence between MDPs and the FP equation, linking reward maximization in MDPs with free energy minimization in FP dynamics. This connection enables inference of the potential function using our inference approach of variational system identification, from which the full set of MDP components -- reward, transition, and policy -- can be recovered using analytic expressions.
We demonstrate the effectiveness of FP-IRL through experiments on synthetic benchmarks and a modified version of the Mountain Car problem. Our results show that FP-IRL achieves accurate recovery of agent incentives while preserving computational efficiency and physical interpretability. - [24] arXiv:2309.16519 (replaced) [pdf, html, other]
-
Title: AtomSurf : Surface Representation for Learning on Protein StructuresComments: Published as a conference paper at The Thirteenth International Conference on Learning Representations (ICLR 2025). The official open-access version is available at this https URLJournal-ref: The Thirteenth International Conference on Learning Representations (ICLR), 2025Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
While there has been significant progress in evaluating and comparing different representations for learning on protein data, the role of surface-based learning approaches remains not well-understood. In particular, there is a lack of direct and fair benchmark comparison between the best available surface-based learning methods against alternative representations such as graphs. Moreover, the few existing surface-based approaches either use surface information in isolation or, at best, perform global pooling between surface and graph-based architectures.
In this work, we fill this gap by first adapting a state-of-the-art surface encoder for protein learning tasks. We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices across all layers.
We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, while adhering to the strict benchmark protocol, as well as more broadly on binding site identification and binding pocket classification. Furthermore, we use coarsened surfaces and optimize our approach for efficiency, making our tool competitive in training and inference time with existing techniques. Code can be found online: this https URL - [25] arXiv:2406.18851 (replaced) [pdf, html, other]
-
Title: LICO: Large Language Models for In-Context Molecular OptimizationComments: International Conference on Learning Representations (ICLR 2025)Subjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Chemical Physics (physics.chem-ph); Biomolecules (q-bio.BM); Quantitative Methods (q-bio.QM)
Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO performs competitively on PMO, a challenging molecular optimization benchmark comprising 23 objective functions, and achieves state-of-the-art performance on its low-budget version PMO-1K.
- [26] arXiv:2502.02904 (replaced) [pdf, html, other]
-
Title: ScholaWrite: A Dataset of End-to-End Scholarly Writing ProcessComments: Equal contribution: Khanh Chi Le, Linghe Wang, Minhwa Lee | project page: this https URLSubjects: Human-Computer Interaction (cs.HC); Computation and Language (cs.CL); Neurons and Cognition (q-bio.NC)
Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.
- [27] arXiv:2507.05101 (replaced) [pdf, html, other]
-
Title: PRING: Rethinking Protein-Protein Interaction Prediction from Pairs to GraphsXinzhe Zheng, Hao Du, Fanding Xu, Jinzhe Li, Zhiyuan Liu, Wenkang Wang, Tao Chen, Wanli Ouyang, Stan Z. Li, Yan Lu, Nanqing Dong, Yang ZhangSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Biomolecules (q-bio.BM); Molecular Networks (q-bio.MN)
Deep learning-based computational methods have achieved promising results in predicting protein-protein interactions (PPIs). However, existing benchmarks predominantly focus on isolated pairwise evaluations, overlooking a model's capability to reconstruct biologically meaningful PPI networks, which is crucial for biology research. To address this gap, we introduce PRING, the first comprehensive benchmark that evaluates protein-protein interaction prediction from a graph-level perspective. PRING curates a high-quality, multi-species PPI network dataset comprising 21,484 proteins and 186,818 interactions, with well-designed strategies to address both data redundancy and leakage. Building on this golden-standard dataset, we establish two complementary evaluation paradigms: (1) topology-oriented tasks, which assess intra and cross-species PPI network construction, and (2) function-oriented tasks, including protein complex pathway prediction, GO module analysis, and essential protein justification. These evaluations not only reflect the model's capability to understand the network topology but also facilitate protein function annotation, biological module detection, and even disease mechanism analysis. Extensive experiments on four representative model categories, consisting of sequence similarity-based, naive sequence-based, protein language model-based, and structure-based approaches, demonstrate that current PPI models have potential limitations in recovering both structural and functional properties of PPI networks, highlighting the gap in supporting real-world biological applications. We believe PRING provides a reliable platform to guide the development of more effective PPI prediction models for the community. The dataset and source code of PRING are available at this https URL.
- [28] arXiv:2510.18037 (replaced) [pdf, html, other]
-
Title: Benchmarking Probabilistic Time Series Forecasting Models on Neural ActivityZiyu Lu, Anna J. Li, Alexander E. Ladd, Pascha Matveev, Aditya Deole, Eric Shea-Brown, J. Nathan Kutz, Nicholas A. SteinmetzComments: Accepted at the 39th Conference on Neural Information Processing Systems (NeurIPS 2025) Workshop: Data on the Brain & MindSubjects: Machine Learning (cs.LG); Neurons and Cognition (q-bio.NC); Machine Learning (stat.ML)
Neural activity forecasting is central to understanding neural systems and enabling closed-loop control. While deep learning has recently advanced the state-of-the-art in the time series forecasting literature, its application to neural activity forecasting remains limited. To bridge this gap, we systematically evaluated eight probabilistic deep learning models, including two foundation models, that have demonstrated strong performance on general forecasting benchmarks. We compared them against four classical statistical models and two baseline methods on spontaneous neural activity recorded from mouse cortex via widefield imaging. Across prediction horizons, several deep learning models consistently outperformed classical approaches, with the best model producing informative forecasts up to 1.5 seconds into the future. Our findings point toward future control applications and open new avenues for probing the intrinsic temporal structure of neural activity.