Search | arXiv e-print repository

CaliciBoost: Performance-Driven Evaluation of Molecular Representations for Caco-2 Permeability Prediction

Authors: Huong Van Le, Weibin Ren, Junhong Kim, Yukyung Yun, Young Bin Park, Young Jun Kim, Bok Kyung Han, Inho Choi, Jong IL Park, Hwi-Yeol Yun, Jae-Mun Choi

Abstract: Caco-2 permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates during early-stage drug discovery. To enhance the accuracy and efficiency of computational predictions, we systematically investigated the impact of eight molecular feature representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings com… ▽ More Caco-2 permeability serves as a critical in vitro indicator for predicting the oral absorption of drug candidates during early-stage drug discovery. To enhance the accuracy and efficiency of computational predictions, we systematically investigated the impact of eight molecular feature representation types including 2D/3D descriptors, structural fingerprints, and deep learning-based embeddings combined with automated machine learning techniques to predict Caco-2 permeability. Using two datasets of differing scale and diversity (TDC benchmark and curated OCHEM data), we assessed model performance across representations and identified PaDEL, Mordred, and RDKit descriptors as particularly effective for Caco-2 prediction. Notably, the AutoML-based model CaliciBoost achieved the best MAE performance. Furthermore, for both PaDEL and Mordred representations, the incorporation of 3D descriptors resulted in a 15.73% reduction in MAE compared to using 2D features alone, as confirmed by feature importance analysis. These findings highlight the effectiveness of AutoML approaches in ADMET modeling and offer practical guidance for feature selection in data-limited prediction tasks. △ Less

Submitted 9 June, 2025; originally announced June 2025.

Comments: 49 pages, 11 figures

arXiv:2505.22134 [pdf]

Infection dynamics for fluctuating infection or removal rates regarding the number of infected and susceptible individuals

Authors: Seong Jun Park, M. Y. Choi

Abstract: In general, the rates of infection and removal (whether through recovery or death) are nonlinear functions of the number of infected and susceptible individuals. One of the simplest models for the spread of infectious diseases is the SIR model, which categorizes individuals as susceptible, infectious, recovered or deceased. In this model, the infection rate, governing the transition from susceptib… ▽ More In general, the rates of infection and removal (whether through recovery or death) are nonlinear functions of the number of infected and susceptible individuals. One of the simplest models for the spread of infectious diseases is the SIR model, which categorizes individuals as susceptible, infectious, recovered or deceased. In this model, the infection rate, governing the transition from susceptible to infected individuals, is given by a linear function of both susceptible and infected populations. Similarly, the removal rate, representing the transition from infected to removed individuals, is a linear function of the number of infected individuals. However, existing research often overlooks the impact of nonlinear infection and removal rates in infection dynamics. This work presents an analytic expression for the number of infected individuals considering nonlinear infection and removal rates. In particular, we examine how the number of infected individuals varies as cases emerge and obtain the expression accounting for the number of infected individuals at each moment. This work paves the way for new quantitative approaches to understanding infection dynamics. △ Less

Submitted 28 May, 2025; originally announced May 2025.

arXiv:2504.03732 [pdf, other]

SAGe: A Lightweight Algorithm-Architecture Co-Design for Mitigating the Data Preparation Bottleneck in Large-Scale Genome Analysis

Authors: Nika Mansouri Ghiasi, Talu Güloglu, Harun Mustafa, Can Firtina, Konstantina Koliogeorgi, Konstantinos Kanellopoulos, Haiyu Mao, Rakesh Nadig, Mohammad Sadrosadati, Jisung Park, Onur Mutlu

Abstract: Given the exponentially growing volumes of genomic data, there are extensive efforts to accelerate genome analysis. We demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome analysis accelerators: the data preparation bottleneck, where genomic data is stored in compressed form and needs to be decompressed and formatted first before an accelerator… ▽ More Given the exponentially growing volumes of genomic data, there are extensive efforts to accelerate genome analysis. We demonstrate a major bottleneck that greatly limits and diminishes the benefits of state-of-the-art genome analysis accelerators: the data preparation bottleneck, where genomic data is stored in compressed form and needs to be decompressed and formatted first before an accelerator can operate on it. To mitigate this bottleneck, we propose SAGe, an algorithm-architecture co-design for highly-compressed storage and high-performance access of large-scale genomic data. SAGe overcomes the challenges of mitigating the data preparation bottleneck while maintaining high compression ratios (comparable to genomic-specific compression algorithms) at low hardware cost. This is enabled by leveraging key features of genomic datasets to co-design (i) a new (de)compression algorithm, (ii) hardware, (iii) storage data layout, and (iv) interface commands to access storage. SAGe stores data in structures that can be rapidly interpreted and decompressed by efficient streaming accesses and lightweight hardware. To achieve high compression ratios using only these lightweight structures, SAGe exploits unique features of genomic data. We show that SAGe can be seamlessly integrated with a broad range of genome analysis hardware accelerators to mitigate their data preparation bottlenecks. Our results demonstrate that SAGe improves the average end-to-end performance and energy efficiency of two state-of-the-art genome analysis accelerators by 3.0x-32.1x and 18.8x-49.6x, respectively, compared to when the accelerators rely on state-of-the-art decompression tools. △ Less

Submitted 21 April, 2025; v1 submitted 31 March, 2025; originally announced April 2025.

arXiv:2503.20767 [pdf, ps, other]

Reliable algorithm selection for machine learning-guided design

Authors: Clara Fannjiang, Ji Won Park

Abstract: Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. H… ▽ More Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task -- for example, to design novel proteins with high binding affinity to a therapeutic target -- one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion -- for example, that at least ten percent of designs' labels exceed a threshold. It does so by combining designs' predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference. The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method's effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios. △ Less

Submitted 2 July, 2025; v1 submitted 26 March, 2025; originally announced March 2025.

Comments: ICML 2025

arXiv:2503.19924 [pdf, other]

EEG relative phase-based analysis unveils the complexity and universality of human brain dynamics: integrative insights from general anesthesia and ADHD

Authors: Athokpam Langlen Chanu, Youngjai Park, Younghwa Cha, UnCheol Lee, Joon-Young Moon, Jong-Min Park

Abstract: Understanding brain wave patterns is fundamental to uncovering neural information processing mechanisms, making quantifying complexity across brain states an important line of investigation. We present a comprehensive analysis of the complexity of electroencephalography (EEG) signals, integrating data from seven distinct states experienced by participants undergoing general anesthesia, and resting… ▽ More Understanding brain wave patterns is fundamental to uncovering neural information processing mechanisms, making quantifying complexity across brain states an important line of investigation. We present a comprehensive analysis of the complexity of electroencephalography (EEG) signals, integrating data from seven distinct states experienced by participants undergoing general anesthesia, and resting-state recordings from individuals with inattentive-type ADHD alongside healthy control groups. Departing from prior studies that primarily focus on EEG amplitude dynamics, we adopt a novel relative phase approach to extract patterns of information flow directionality based on EEG phase dynamics. We quantify the complexity of these relative phase directionality patterns across various states using permutation entropy (PE) and statistical complexity measure within ordinal pattern framework. Our analysis: (i) PE is inversely correlated with the level of consciousness during general anesthesia, reflecting a dynamic interplay between anesthetic depth and shifts in directional information flow; (ii) healthy subjects consistently show higher PE than inADHD participants; (iii) when mapped onto the complexity-entropy causality plane, all brain states, regardless of condition or individual differences, align along a single curve, suggesting an underlying universal pattern in brain dynamics; and (iv) brain data consistently exhibit higher complexity than standard stochastic processes, likely due to greater multifractal scaling. These findings highlight that neural information propagation, as captured by EEG relative phase dynamics, is governed by self-organizing principles that are fundamentally more complex than the stochastic processes. Our EEG relative phase-based characterization provides new insight into the complexity of neural information flow directionality. △ Less

Submitted 21 April, 2025; v1 submitted 12 March, 2025; originally announced March 2025.

arXiv:2502.04892 [pdf, other]

A Foundational Brain Dynamics Model via Stochastic Optimal Control

Authors: Joonhyeong Park, Byoungwoo Park, Chang-Bae Bang, Jungwon Choi, Hyungjin Chung, Byung-Hoon Kim, Juho Lee

Abstract: We introduce a foundational model for brain dynamics that utilizes stochastic optimal control (SOC) and amortized inference. Our method features a continuous-discrete state space model (SSM) that can robustly handle the intricate and noisy nature of fMRI signals. To address computational limitations, we implement an approximation strategy grounded in the SOC framework. Additionally, we present a s… ▽ More We introduce a foundational model for brain dynamics that utilizes stochastic optimal control (SOC) and amortized inference. Our method features a continuous-discrete state space model (SSM) that can robustly handle the intricate and noisy nature of fMRI signals. To address computational limitations, we implement an approximation strategy grounded in the SOC framework. Additionally, we present a simulation-free latent dynamics approach that employs locally linear approximations, facilitating efficient and scalable inference. For effective representation learning, we derive an Evidence Lower Bound (ELBO) from the SOC formulation, which integrates smoothly with recent advancements in self-supervised learning (SSL), thereby promoting robust and transferable representations. Pre-trained on extensive datasets such as the UKB, our model attains state-of-the-art results across a variety of downstream tasks, including demographic prediction, trait analysis, disease diagnosis, and prognosis. Moreover, evaluating on external datasets such as HCP-A, ABIDE, and ADHD200 further validates its superior abilities and resilience across different demographic and clinical distributions. Our foundational model provides a scalable and efficient approach for deciphering brain dynamics, opening up numerous applications in neuroscience. △ Less

Submitted 7 February, 2025; originally announced February 2025.

Comments: The first two authors contributed equally

arXiv:2501.15208 [pdf]

Advancing Understanding of Long COVID Pathophysiology Through Quantum Walk-Based Network Analysis

Authors: Jaesub Park, Woochang Hwang, Seokjun Lee, Hyun Chang Lee, Méabh MacMahon, Matthias Zilbauer, Namshik Han

Abstract: Long COVID is a multisystem condition characterized by persistent symptoms such as fatigue, cognitive impairment, and systemic inflammation, following COVID-19 infection, yet its mechanisms remain poorly understood. In this study, we applied quantum walk (QW), a computational approach leveraging quantum interference, to explore large-scale SARS-CoV-2-induced protein (SIP) networks. Compared to the… ▽ More Long COVID is a multisystem condition characterized by persistent symptoms such as fatigue, cognitive impairment, and systemic inflammation, following COVID-19 infection, yet its mechanisms remain poorly understood. In this study, we applied quantum walk (QW), a computational approach leveraging quantum interference, to explore large-scale SARS-CoV-2-induced protein (SIP) networks. Compared to the conventional random walk with restart (RWR) method, QW demonstrated superior capacity to traverse deeper regions of the network, uncovering proteins and pathways implicated in Long COVID. Key findings include mitochondrial dysfunction, thromboinflammatory responses, and neuronal inflammation as central mechanisms. QW uniquely identified the CDGSH iron-sulfur domain-containing protein family and VDAC1, a mitochondrial calcium transporter, as critical regulators of these processes. VDAC1 emerged as a potential biomarker and therapeutic target, supported by FDA-approved compounds such as cannabidiol. These findings highlight QW as a powerful tool for elucidating complex biological systems and identifying novel therapeutic targets for conditions like Long COVID. △ Less

Submitted 29 January, 2025; v1 submitted 25 January, 2025; originally announced January 2025.

Comments: 25 pages, 6 figures and 3 tables

arXiv:2501.14790 [pdf, other]

Towards Dynamic Neural Communication and Speech Neuroprosthesis Based on Viseme Decoding

Authors: Ji-Ha Park, Seo-Hyun Lee, Soowon Kim, Seong-Whan Lee

Abstract: Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or… ▽ More Decoding text, speech, or images from human neural signals holds promising potential both as neuroprosthesis for patients and as innovative communication tools for general users. Although neural signals contain various information on speech intentions, movements, and phonetic details, generating informative outputs from them remains challenging, with mostly focusing on decoding short intentions or producing fragmented outputs. In this study, we developed a diffusion model-based framework to decode visual speech intentions from speech-related non-invasive brain signals, to facilitate face-to-face neural communication. We designed an experiment to consolidate various phonemes to train visemes of each phoneme, aiming to learn the representation of corresponding lip formations from neural signals. By decoding visemes from both isolated trials and continuous sentences, we successfully reconstructed coherent lip movements, effectively bridging the gap between brain signals and dynamic visual interfaces. The results highlight the potential of viseme decoding and talking face reconstruction from human neural signals, marking a significant step toward dynamic neural communication systems and speech neuroprosthesis for patients. △ Less

Submitted 8 January, 2025; originally announced January 2025.

Comments: 5 pages, 5 figures, 1 table, Name of Conference: 2025 IEEE International Conference on Acoustics, Speech, and Signal Processing

arXiv:2411.00871 [pdf, other]

LLaMo: Large Language Model-based Molecular Graph Assistant

Authors: Jinyoung Park, Minseong Bae, Dohwan Ko, Hyunwoo J. Kim

Abstract: Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language… ▽ More Large Language Models (LLMs) have demonstrated remarkable generalization and instruction-following capabilities with instruction tuning. The advancements in LLMs and instruction tuning have led to the development of Large Vision-Language Models (LVLMs). However, the competency of the LLMs and instruction tuning have been less explored in the molecular domain. Thus, we propose LLaMo: Large Language Model-based Molecular graph assistant, which is an end-to-end trained large molecular graph-language model. To bridge the discrepancy between the language and graph modalities, we present the multi-level graph projector that transforms graph representations into graph tokens by abstracting the output representations of each GNN layer and motif representations with the cross-attention mechanism. We also introduce machine-generated molecular graph instruction data to instruction-tune the large molecular graph-language model for general-purpose molecule and language understanding. Our extensive experiments demonstrate that LLaMo shows the best performance on diverse tasks, such as molecular description generation, property prediction, and IUPAC name prediction. The code of LLaMo is available at https://github.com/mlvlab/LLaMo. △ Less

Submitted 30 October, 2024; originally announced November 2024.

Comments: NeurIPS 2024

arXiv:2410.20255 [pdf, other]

Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation

Authors: Jiwoong Park, Yang Shen

Abstract: How can diffusion models process 3D geometries in a coarse-to-fine manner, akin to our multiscale view of the world? In this paper, we address the question by focusing on a fundamental biochemical problem of generating 3D molecular conformers conditioned on molecular graphs in a multiscale manner. Our approach consists of two hierarchical stages: i) generation of coarse-grained fragment-level 3D s… ▽ More How can diffusion models process 3D geometries in a coarse-to-fine manner, akin to our multiscale view of the world? In this paper, we address the question by focusing on a fundamental biochemical problem of generating 3D molecular conformers conditioned on molecular graphs in a multiscale manner. Our approach consists of two hierarchical stages: i) generation of coarse-grained fragment-level 3D structure from the molecular graph, and ii) generation of fine atomic details from the coarse-grained approximated structure while allowing the latter to be adjusted simultaneously. For the challenging second stage, which demands preserving coarse-grained information while ensuring SE(3) equivariance, we introduce a novel generative model termed Equivariant Blurring Diffusion (EBD), which defines a forward process that moves towards the fragment-level coarse-grained structure by blurring the fine atomic details of conformers, and a reverse process that performs the opposite operation using equivariant networks. We demonstrate the effectiveness of EBD by geometric and chemical comparison to state-of-the-art denoising diffusion models on a benchmark of drug-like molecules. Ablation studies draw insights on the design of EBD by thoroughly analyzing its architecture, which includes the design of the loss function and the data corruption process. Codes are released at https://github.com/Shen-Lab/EBD . △ Less

Submitted 26 October, 2024; originally announced October 2024.

Comments: NeurIPS 2024

arXiv:2410.17270 [pdf, other]

MOFFlow: Flow Matching for Structure Prediction of Metal-Organic Frameworks

Authors: Nayoung Kim, Seongsu Kim, Minsu Kim, Jinkyoo Park, Sungsoo Ahn

Abstract: Metal-organic frameworks (MOFs) are a class of crystalline materials with promising applications in many areas such as carbon capture and drug delivery. In this work, we introduce MOFFlow, the first deep generative model tailored for MOF structure prediction. Existing approaches, including ab initio calculations and even deep generative models, struggle with the complexity of MOF structures due to… ▽ More Metal-organic frameworks (MOFs) are a class of crystalline materials with promising applications in many areas such as carbon capture and drug delivery. In this work, we introduce MOFFlow, the first deep generative model tailored for MOF structure prediction. Existing approaches, including ab initio calculations and even deep generative models, struggle with the complexity of MOF structures due to the large number of atoms in the unit cells. To address this limitation, we propose a novel Riemannian flow matching framework that reduces the dimensionality of the problem by treating the metal nodes and organic linkers as rigid bodies, capitalizing on the inherent modularity of MOFs. By operating in the $SE(3)$ space, MOFFlow effectively captures the roto-translational dynamics of these rigid components in a scalable way. Our experiment demonstrates that MOFFlow accurately predicts MOF structures containing several hundred atoms, significantly outperforming conventional methods and state-of-the-art machine learning baselines while being much faster. △ Less

Submitted 19 March, 2025; v1 submitted 7 October, 2024; originally announced October 2024.

Comments: 10 pages, 6 figures

Journal ref: International Conference on Learning Representations (ICLR) 2025

arXiv:2410.04542 [pdf, other]

Generative Flows on Synthetic Pathway for Drug Design

Authors: Seonghwan Seo, Minsu Kim, Tony Shen, Martin Ester, Jinkyoo Park, Sungsoo Ahn, Woo Youn Kim

Abstract: Generative models in drug discovery have recently gained attention as efficient alternatives to brute-force virtual screening. However, most existing models do not account for synthesizability, limiting their practical use in real-world scenarios. In this paper, we propose RxnFlow, which sequentially assembles molecules using predefined molecular building blocks and chemical reaction templates to… ▽ More Generative models in drug discovery have recently gained attention as efficient alternatives to brute-force virtual screening. However, most existing models do not account for synthesizability, limiting their practical use in real-world scenarios. In this paper, we propose RxnFlow, which sequentially assembles molecules using predefined molecular building blocks and chemical reaction templates to constrain the synthetic chemical pathway. We then train on this sequential generating process with the objective of generative flow networks (GFlowNets) to generate both highly rewarded and diverse molecules. To mitigate the large action space of synthetic pathways in GFlowNets, we implement a novel action space subsampling method. This enables RxnFlow to learn generative flows over extensive action spaces comprising combinations of 1.2 million building blocks and 71 reaction templates without significant computational overhead. Additionally, RxnFlow can employ modified or expanded action spaces for generation without retraining, allowing for the introduction of additional objectives or the incorporation of newly discovered building blocks. We experimentally demonstrate that RxnFlow outperforms existing reaction-based and fragment-based models in pocket-specific optimization across various target pockets. Furthermore, RxnFlow achieves state-of-the-art performance on CrossDocked2020 for pocket-conditional generation, with an average Vina score of -8.85 kcal/mol and 34.8% synthesizability. △ Less

Submitted 6 March, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

Comments: Accepted to ICLR 2025, 32 pages, 17 figures, code: https://github.com/SeonghwanSeo/RxnFlow

arXiv:2410.04461 [pdf, ps, other]

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Authors: Hyeonah Kim, Minsu Kim, Taeyoung Yun, Sanghyeok Choi, Emmanuel Bengio, Alex Hernández-García, Jinkyoo Park

Abstract: Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $δ$-Conservative Search, that enhances r… ▽ More Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $δ$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $δ$, then denoise them using our policy. We further adapt $δ$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design. △ Less

Submitted 16 June, 2025; v1 submitted 6 October, 2024; originally announced October 2024.

Comments: ICML 2025

arXiv:2409.05484 [pdf, other]

CRADLE-VAE: Enhancing Single-Cell Gene Perturbation Modeling with Counterfactual Reasoning-based Artifact Disentanglement

Authors: Seungheun Baek, Soyon Park, Yan Ting Chok, Junhyun Lee, Jueon Park, Mogan Gim, Jaewoo Kang

Abstract: Predicting cellular responses to various perturbations is a critical focus in drug discovery and personalized therapeutics, with deep learning models playing a significant role in this endeavor. Single-cell datasets contain technical artifacts that may hinder the predictability of such models, which poses quality control issues highly regarded in this area. To address this, we propose CRADLE-VAE,… ▽ More Predicting cellular responses to various perturbations is a critical focus in drug discovery and personalized therapeutics, with deep learning models playing a significant role in this endeavor. Single-cell datasets contain technical artifacts that may hinder the predictability of such models, which poses quality control issues highly regarded in this area. To address this, we propose CRADLE-VAE, a causal generative framework tailored for single-cell gene perturbation modeling, enhanced with counterfactual reasoning-based artifact disentanglement. Throughout training, CRADLE-VAE models the underlying latent distribution of technical artifacts and perturbation effects present in single-cell datasets. It employs counterfactual reasoning to effectively disentangle such artifacts by modulating the latent basal spaces and learns robust features for generating cellular response data with improved quality. Experimental results demonstrate that this approach improves not only treatment effect estimation performance but also generative quality as well. The CRADLE-VAE codebase is publicly available at https://github.com/dmis-lab/CRADLE-VAE. △ Less

Submitted 9 September, 2024; v1 submitted 9 September, 2024; originally announced September 2024.

arXiv:2408.12907 [pdf, other]

Bundling instability of lophotrichous bacteria

Authors: Jeungeun Park, Yongsam Kim, Wanho Lee, Veronika Pfeifer, Valeriia Muraveva, Carsten Beta, Sookkyung Lim

Abstract: We present a mathematical model of lophotrichous bacteria, motivated by Pseudomonas putida, which swim through fluid by rotating a cluster of multiple flagella extended from near one pole of the cell body. Although the flagella rotate individually, they are typically bundled together, enabling the bacterium to exhibit three primary modes of motility: push, pull, and wrapping. One key determinant o… ▽ More We present a mathematical model of lophotrichous bacteria, motivated by Pseudomonas putida, which swim through fluid by rotating a cluster of multiple flagella extended from near one pole of the cell body. Although the flagella rotate individually, they are typically bundled together, enabling the bacterium to exhibit three primary modes of motility: push, pull, and wrapping. One key determinant of these modes is the coordination between motor torque and rotational direction of motors. The computational variations in this coordination reveal a wide spectrum of dynamical motion regimes, which are modulated by hydrodynamic interactions between flagellar filaments. These dynamic modes can be categorized into two groups based on the collective behavior of flagella, i.e., bundled and unbundled configurations. For some of these configurations, experimental examples from fluorescence microscopy recordings of swimming P. putida cells are also presented. Furthermore, we analyze the characteristics of stable bundles, such as push and pull, and investigate the dependence of swimming behaviors on the elastic properties of the flagella. △ Less

Submitted 23 August, 2024; originally announced August 2024.

MSC Class: 92-10; 92-08; 76-10; 76Z10

arXiv:2407.21028 [pdf, other]

Antibody DomainBed: Out-of-Distribution Generalization in Therapeutic Protein Design

Authors: Nataša Tagasovska, Ji Won Park, Matthieu Kirchmeyer, Nathan C. Frey, Andrew Martin Watkins, Aya Abdelsalam Ismail, Arian Rokkum Jamasb, Edith Lee, Tyler Bryson, Stephen Ra, Kyunghyun Cho

Abstract: Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key… ▽ More Machine learning (ML) has demonstrated significant promise in accelerating drug design. Active ML-guided optimization of therapeutic molecules typically relies on a surrogate model predicting the target property of interest. The model predictions are used to determine which designs to evaluate in the lab, and the model is updated on the new measurements to inform the next cycle of decisions. A key challenge is that the experimental feedback from each cycle inspires changes in the candidate proposal or experimental protocol for the next cycle, which lead to distribution shifts. To promote robustness to these shifts, we must account for them explicitly in the model training. We apply domain generalization (DG) methods to classify the stability of interactions between an antibody and antigen across five domains defined by design cycles. Our results suggest that foundational models and ensembling improve predictive performance on out-of-distribution domains. We publicly release our codebase extending the DG benchmark ``DomainBed,'' and the associated dataset of antibody sequences and structures emulating distribution shifts across design cycles. △ Less

Submitted 15 July, 2024; originally announced July 2024.

arXiv:2406.19113 [pdf, other]

MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing

Authors: Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Mao, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu

Abstract: Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storag… ▽ More Metagenomics has led to significant advances in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system. In-storage processing can be a fundamental solution for reducing this overhead. However, designing an in-storage processing system for metagenomics is challenging because existing approaches to metagenomic analysis cannot be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MegIS, the first in-storage processing system designed to significantly reduce the data movement overhead of the end-to-end metagenomic analysis pipeline. MegIS is enabled by our lightweight design that effectively leverages and orchestrates processing inside and outside the storage system. We address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) data mapping, and 5) lightweight in-storage accelerators. MegIS's design is flexible, capable of supporting different types of metagenomic input datasets, and can be integrated into various metagenomic analysis pipelines. Our evaluation shows that MegIS outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7$\times$-37.2$\times$ and 6.9$\times$-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MegIS achieves 1.5$\times$-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated (using processing-in-memory) tool, while achieving significantly higher accuracy. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: To appear in ISCA 2024. arXiv admin note: substantial text overlap with arXiv:2311.12527

arXiv:2403.20109 [pdf, ps, other]

Mol-AIR: Molecular Reinforcement Learning with Adaptive Intrinsic Rewards for Goal-directed Molecular Generation

Authors: Jinyeong Park, Jaegyoon Ahn, Jonghwan Choi, Jibum Kim

Abstract: Optimizing techniques for discovering molecular structures with desired properties is crucial in artificial intelligence(AI)-based drug discovery. Combining deep generative models with reinforcement learning has emerged as an effective strategy for generating molecules with specific properties. Despite its potential, this approach is ineffective in exploring the vast chemical space and optimizing… ▽ More Optimizing techniques for discovering molecular structures with desired properties is crucial in artificial intelligence(AI)-based drug discovery. Combining deep generative models with reinforcement learning has emerged as an effective strategy for generating molecules with specific properties. Despite its potential, this approach is ineffective in exploring the vast chemical space and optimizing particular chemical properties. To overcome these limitations, we present Mol-AIR, a reinforcement learning-based framework using adaptive intrinsic rewards for effective goal-directed molecular generation. Mol-AIR leverages the strengths of both history-based and learning-based intrinsic rewards by exploiting random distillation network and counting-based strategies. In benchmark tests, Mol-AIR demonstrates superior performance over existing approaches in generating molecules with desired properties without any prior knowledge, including penalized LogP, QED, and celecoxib similarity. We believe that Mol-AIR represents a significant advancement in drug discovery, offering a more efficient path to discovering novel therapeutics. △ Less

Submitted 29 March, 2024; originally announced March 2024.

arXiv:2402.05982 [pdf, other]

Decoupled Sequence and Structure Generation for Realistic Antibody Design

Authors: Nayoung Kim, Minsu Kim, Sungsoo Ahn, Jinkyoo Park

Abstract: Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space… ▽ More Recently, deep learning has made rapid progress in antibody design, which plays a key role in the advancement of therapeutics. A dominant paradigm is to train a model to jointly generate the antibody sequence and the structure as a candidate. However, the joint generation requires the model to generate both the discrete amino acid categories and the continuous 3D coordinates; this limits the space of possible architectures and may lead to suboptimal performance. In response, we propose an antibody sequence-structure decoupling (ASSD) framework, which separates sequence generation and structure prediction. Although our approach is simple, our idea allows the use of powerful neural architectures and demonstrates notable performance improvements. We also find that the widely used non-autoregressive generators promote sequences with overly repeating tokens. Such sequences are both out-of-distribution and prone to undesirable developability properties that can trigger harmful immune responses in patients. To resolve this, we introduce a composition-based objective that allows an efficient trade-off between high performance and low token repetition. ASSD shows improved performance in various antibody design experiments, while the composition-based objective successfully mitigates token repetition of non-autoregressive models. △ Less

Submitted 16 January, 2025; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: 22 pages, 6 figures

Journal ref: Transactions on Machine Learning Research, 2025

arXiv:2402.05961 [pdf, other]

Genetic-guided GFlowNets for Sample Efficient Molecular Optimization

Authors: Hyeonah Kim, Minsu Kim, Sanghyeok Choi, Jinkyoo Park

Abstract: The challenge of discovering new molecules with desired properties is crucial in domains like drug discovery and material design. Recent advances in deep learning-based generative methods have shown promise but face the issue of sample efficiency due to the computational expense of evaluating the reward function. This paper proposes a novel algorithm for sample-efficient molecular optimization by… ▽ More The challenge of discovering new molecules with desired properties is crucial in domains like drug discovery and material design. Recent advances in deep learning-based generative methods have shown promise but face the issue of sample efficiency due to the computational expense of evaluating the reward function. This paper proposes a novel algorithm for sample-efficient molecular optimization by distilling a powerful genetic algorithm into deep generative policy using GFlowNets training, the off-policy method for amortized inference. This approach enables the deep generative policy to learn from domain knowledge, which has been explicitly integrated into the genetic algorithm. Our method achieves state-of-the-art performance in the official molecular optimization benchmark, significantly outperforming previous methods. It also demonstrates effectiveness in designing inhibitors against SARS-CoV-2 with substantially fewer reward calls. △ Less

Submitted 29 December, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

Comments: NeurIPS 2024

arXiv:2402.05953 [pdf, other]

doi 10.1109/MCG.2023.3345742

idMotif: An Interactive Motif Identification in Protein Sequences

Authors: Ji Hwan Park, Vikash Prasad, Sydney Newsom, Fares Najar, Rakhi Rajan

Abstract: This article introduces idMotif, a visual analytics framework designed to aid domain experts in the identification of motifs within protein sequences. Motifs, short sequences of amino acids, are critical for understanding the distinct functions of proteins. Identifying these motifs is pivotal for predicting diseases or infections. idMotif employs a deep learning-based method for the categorization… ▽ More This article introduces idMotif, a visual analytics framework designed to aid domain experts in the identification of motifs within protein sequences. Motifs, short sequences of amino acids, are critical for understanding the distinct functions of proteins. Identifying these motifs is pivotal for predicting diseases or infections. idMotif employs a deep learning-based method for the categorization of protein sequences, enabling the discovery of potential motif candidates within protein groups through local explanations of deep learning model decisions. It offers multiple interactive views for the analysis of protein clusters or groups and their sequences. A case study, complemented by expert feedback, illustrates idMotif's utility in facilitating the analysis and identification of protein sequences and motifs. △ Less

Submitted 4 February, 2024; originally announced February 2024.

Comments: IEEE CGA

Journal ref: idMotif: An Interactive Motif Identification in Protein Sequences," in IEEE Computer Graphics and Applications, 2023

arXiv:2311.12527 [pdf, other]

MetaStore: High-Performance Metagenomic Analysis via In-Storage Computing

Authors: Nika Mansouri Ghiasi, Mohammad Sadrosadati, Harun Mustafa, Arvid Gollwitzer, Can Firtina, Julien Eudine, Haiyu Ma, Joël Lindegger, Meryem Banu Cavlak, Mohammed Alser, Jisung Park, Onur Mutlu

Abstract: Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amo… ▽ More Metagenomics has led to significant advancements in many fields. Metagenomic analysis commonly involves the key tasks of determining the species present in a sample and their relative abundances. These tasks require searching large metagenomic databases containing information on different species' genomes. Metagenomic analysis suffers from significant data movement overhead due to moving large amounts of low-reuse data from the storage system to the rest of the system. In-storage processing can be a fundamental solution for reducing data movement overhead. However, designing an in-storage processing system for metagenomics is challenging because none of the existing approaches can be directly implemented in storage effectively due to the hardware limitations of modern SSDs. We propose MetaStore, the first in-storage processing system designed to significantly reduce the data movement overhead of end-to-end metagenomic analysis. MetaStore is enabled by our lightweight and cooperative design that effectively leverages and orchestrates processing inside and outside the storage system. Through our detailed analysis of the end-to-end metagenomic analysis pipeline and careful hardware/software co-design, we address in-storage processing challenges for metagenomics via specialized and efficient 1) task partitioning, 2) data/computation flow coordination, 3) storage technology-aware algorithmic optimizations, 4) light-weight in-storage accelerators, and 5) data mapping. Our evaluation shows that MetaStore outperforms the state-of-the-art performance- and accuracy-optimized software metagenomic tools by 2.7-37.2$\times$ and 6.9-100.2$\times$, respectively, while matching the accuracy of the accuracy-optimized tool. MetaStore achieves 1.5-5.1$\times$ speedup compared to the state-of-the-art metagenomic hardware-accelerated tool, while achieving significantly higher accuracy. △ Less

Submitted 21 November, 2023; originally announced November 2023.

arXiv:2311.04468 [pdf]

A human brain atlas of chi-separation for normative iron and myelin distributions

Authors: Kyeongseon Min, Beomseok Sohn, Woo Jung Kim, Chae Jung Park, Soohwa Song, Dong Hoon Shin, Kyung Won Chang, Na-Young Shin, Minjun Kim, Hyeong-Geol Shin, Phil Hyu Lee, Jongho Lee

Abstract: Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility mapping technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opene… ▽ More Iron and myelin are primary susceptibility sources in the human brain. These substances are essential for healthy brain, and their abnormalities are often related to various neurological disorders. Recently, an advanced susceptibility mapping technique, which is referred to as chi-separation, has been proposed, successfully disentangling paramagnetic iron from diamagnetic myelin. This method opened a potential for generating high resolution iron and myelin maps in the brain. Utilizing this technique, this study constructs a normative chi-separation atlas from 106 healthy human brains. The resulting atlas provides detailed anatomical structures associated with the distributions of iron and myelin, clearly delineating subcortical nuclei, thalamic nuclei, and white matter fiber bundles. Additionally, susceptibility values in a number of regions of interest are reported along with age-dependent changes. This atlas may have direct applications such as localization of subcortical structures for deep brain stimulation or high-intensity focused ultrasound and also serve as a valuable resource for future research. △ Less

Submitted 2 April, 2024; v1 submitted 8 November, 2023; originally announced November 2023.

Comments: 19 pages, 9 figures

arXiv:2309.11438 [pdf, other]

doi 10.1073/pnas.2320242121

Brain-inspired computing with fluidic iontronic nanochannels

Authors: T. M. Kamsma, J. Kim, K. Kim, W. Q. Boon, C. Spitoni, J. Park, R. van Roij

Abstract: The brain's remarkable and efficient information processing capability is driving research into brain-inspired (neuromorphic) computing paradigms. Artificial aqueous ion channels are emerging as an exciting platform for neuromorphic computing, representing a departure from conventional solid-state devices by directly mimicking the brain's fluidic ion transport. Supported by a quantitative theoreti… ▽ More The brain's remarkable and efficient information processing capability is driving research into brain-inspired (neuromorphic) computing paradigms. Artificial aqueous ion channels are emerging as an exciting platform for neuromorphic computing, representing a departure from conventional solid-state devices by directly mimicking the brain's fluidic ion transport. Supported by a quantitative theoretical model, we present easy to fabricate tapered microchannels that embed a conducting network of fluidic nanochannels between a colloidal structure. Due to transient salt concentration polarisation our devices are volatile memristors (memory resistors) that are remarkably stable. The voltage-driven net salt flux and accumulation, that underpin the concentration polarisation, surprisingly combine into a diffusionlike quadratic dependence of the memory retention time on the channel length, allowing channel design for a specific timescale. We implement our device as a synaptic element for neuromorphic reservoir computing. Individual channels distinguish various time series, that together represent (handwritten) numbers, for subsequent in-silico classification with a simple readout function. Our results represent a significant step towards realising the promise of fluidic ion channels as a platform to emulate the rich aqueous dynamics of the brain. △ Less

Submitted 25 April, 2024; v1 submitted 20 September, 2023; originally announced September 2023.

Journal ref: Proceedings of the National Academy of Sciences (2024), Vol 121, Issue 18

arXiv:2309.05768 [pdf]

doi 10.1162/imag_a_00103

The Past, Present, and Future of the Brain Imaging Data Structure (BIDS)

Authors: Russell A. Poldrack, Christopher J. Markiewicz, Stefan Appelhoff, Yoni K. Ashar, Tibor Auer, Sylvain Baillet, Shashank Bansal, Leandro Beltrachini, Christian G. Benar, Giacomo Bertazzoli, Suyash Bhogawar, Ross W. Blair, Marta Bortoletto, Mathieu Boudreau, Teon L. Brooks, Vince D. Calhoun, Filippo Maria Castelli, Patricia Clement, Alexander L Cohen, Julien Cohen-Adad, Sasha D'Ambrosio, Gilles de Hollander, María de la iglesia-Vayá, Alejandro de la Vega, Arnaud Delorme , et al. (89 additional authors not shown)

Abstract: The Brain Imaging Data Structure (BIDS) is a community-driven standard for the organization of data and metadata from a growing range of neuroscience modalities. This paper is meant as a history of how the standard has developed and grown over time. We outline the principles behind the project, the mechanisms by which it has been extended, and some of the challenges being addressed as it evolves.… ▽ More The Brain Imaging Data Structure (BIDS) is a community-driven standard for the organization of data and metadata from a growing range of neuroscience modalities. This paper is meant as a history of how the standard has developed and grown over time. We outline the principles behind the project, the mechanisms by which it has been extended, and some of the challenges being addressed as it evolves. We also discuss the lessons learned through the project, with the aim of enabling researchers in other domains to learn from the success of BIDS. △ Less

Submitted 8 January, 2024; v1 submitted 11 September, 2023; originally announced September 2023.

arXiv:2309.04423 [pdf, other]

doi 10.1109/VIS54172.2023.00030

Vis-SPLIT: Interactive Hierarchical Modeling for mRNA Expression Classification

Authors: Braden Roper, James C. Mathews, Saad Nadeem, Ji Hwan Park

Abstract: We propose an interactive visual analytics tool, Vis-SPLIT, for partitioning a population of individuals into groups with similar gene signatures. Vis-SPLIT allows users to interactively explore a dataset and exploit visual separations to build a classification model for specific cancers. The visualization components reveal gene expression and correlation to assist specific partitioning decisions,… ▽ More We propose an interactive visual analytics tool, Vis-SPLIT, for partitioning a population of individuals into groups with similar gene signatures. Vis-SPLIT allows users to interactively explore a dataset and exploit visual separations to build a classification model for specific cancers. The visualization components reveal gene expression and correlation to assist specific partitioning decisions, while also providing overviews for the decision model and clustered genetic signatures. We demonstrate the effectiveness of our framework through a case study and evaluate its usability with domain experts. Our results show that Vis-SPLIT can classify patients based on their genetic signatures to effectively gain insights into RNA sequencing data, as compared to an existing classification system. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: To be published in IEEE Visualization and Visual Analytics (VIS), 2023

arXiv:2309.01670 [pdf, other]

Blind Biological Sequence Denoising with Self-Supervised Set Learning

Authors: Nathan Ng, Ji Won Park, Jae Hyeon Lee, Ryan Lewis Kelly, Stephen Ra, Kyunghyun Cho

Abstract: Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are avai… ▽ More Biological sequence analysis relies on the ability to denoise the imprecise output of sequencing platforms. We consider a common setting where a short sequence is read out repeatedly using a high-throughput long-read platform to generate multiple subreads, or noisy observations of the same sequence. Denoising these subreads with alignment-based approaches often fails when too few subreads are available or error rates are too high. In this paper, we propose a novel method for blindly denoising sets of sequences without directly observing clean source sequence labels. Our method, Self-Supervised Set Learning (SSSL), gathers subreads together in an embedding space and estimates a single set embedding as the midpoint of the subreads in both the latent and sequence spaces. This set embedding represents the "average" of the subreads and can be decoded into a prediction of the clean sequence. In experiments on simulated long-read DNA data, SSSL methods denoise small reads of $\leq 6$ subreads with 17% fewer errors and large reads of $>6$ subreads with 8% fewer errors compared to the best baseline. On a real dataset of antibody sequences, SSSL improves over baselines on two self-supervised metrics, with a significant improvement on difficult small reads that comprise over 60% of the test set. By accurately denoising these reads, SSSL promises to better realize the potential of high-throughput DNA sequencing data for downstream scientific applications. △ Less

Submitted 4 September, 2023; originally announced September 2023.

arXiv:2306.16085 [pdf, other]

Mass Spectra Prediction with Structural Motif-based Graph Neural Networks

Authors: Jiwon Park, Jeonghee Jo, Sungroh Yoon

Abstract: Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures. A prevalent analysis method involves spectral library searches,where unknown spectra are cross-referenced with a database. The effectiveness of such search-based approaches, however, is restricted by the scope of the existing… ▽ More Mass spectra, which are agglomerations of ionized fragments from targeted molecules, play a crucial role across various fields for the identification of molecular structures. A prevalent analysis method involves spectral library searches,where unknown spectra are cross-referenced with a database. The effectiveness of such search-based approaches, however, is restricted by the scope of the existing mass spectra database, underscoring the need to expand the database via mass spectra prediction. In this research, we propose the Motif-based Mass Spectrum Prediction Network (MoMS-Net), a system that predicts mass spectra using the information derived from structural motifs and the implementation of Graph Neural Networks (GNNs). We have tested our model across diverse mass spectra and have observed its superiority over other existing models. MoMS-Net considers substructure at the graph level, which facilitates the incorporation of long-range dependencies while using less memory compared to the graph transformer model. △ Less

Submitted 28 June, 2023; originally announced June 2023.

Comments: 19 pages, 3figures

arXiv:2306.03111 [pdf, other]

Bootstrapped Training of Score-Conditioned Generator for Offline Design of Biological Sequences

Authors: Minsu Kim, Federico Berto, Sungsoo Ahn, Jinkyoo Park

Abstract: We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with ra… ▽ More We study the problem of optimizing biological sequences, e.g., proteins, DNA, and RNA, to maximize a black-box score function that is only evaluated in an offline dataset. We propose a novel solution, bootstrapped training of score-conditioned generator (BootGen) algorithm. Our algorithm repeats a two-stage process. In the first stage, our algorithm trains the biological sequence generator with rank-based weights to enhance the accuracy of sequence generation based on high scores. The subsequent stage involves bootstrapping, which augments the training dataset with self-generated data labeled by a proxy score function. Our key idea is to align the score-based generation with a proxy score function, which distills the knowledge of the proxy score function to the generator. After training, we aggregate samples from multiple bootstrapped generators and proxies to produce a diverse design. Extensive experiments show that our method outperforms competitive baselines on biological sequential design tasks. We provide reproducible source code: \href{https://github.com/kaist-silab/bootgen}{https://github.com/kaist-silab/bootgen}. △ Less

Submitted 22 March, 2024; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: NeurIPS 2023, 19 pages, 5 figures

arXiv:2305.12341 [pdf, other]

Enhancing biodiversity through intraspecific suppression in large ecosystems

Authors: Seong-Gyu Yang, Hye Jin Park

Abstract: The competitive exclusion principle (CEP) is a fundamental concept in the niche theory, which posits that the number of available resources constrains the coexistence of species. While the CEP offers an intuitive explanation on coexistence, it has been challenged by counterexamples observed in nature. One prominent counterexample is the phytoplankton community, known as the paradox of the plankton… ▽ More The competitive exclusion principle (CEP) is a fundamental concept in the niche theory, which posits that the number of available resources constrains the coexistence of species. While the CEP offers an intuitive explanation on coexistence, it has been challenged by counterexamples observed in nature. One prominent counterexample is the phytoplankton community, known as the paradox of the plankton. Diverse phytoplankton species coexist in the ocean even though they demand a limited number of resources. To shed light on this remarkable biodiversity in large ecosystems quantitatively, we consider \textit{intraspecific suppression} into the generalized MacArthur's consumer-resource model and study the relative diversity, the number ratio between coexisting consumers and resource kinds. By employing the cavity method and generating functional analysis, we demonstrate that, under intraspecific suppression, the number of consumer species can surpass the available resources. This phenomenon stems from the fact that intraspecific suppression prevents the emergence of dominant species, thereby fostering high biodiversity. Furthermore, our study highlights that the impact of this competition on biodiversity is contingent upon environmental conditions. Our work presents a comprehensive framework that encompasses the CEP and its counterexamples by introducing intraspecific suppression. △ Less

Submitted 1 April, 2024; v1 submitted 21 May, 2023; originally announced May 2023.

Comments: 40 pages (including Appendix), 25 figures (5 figures in main, 20 figures in Appendix)

arXiv:2304.10065 [pdf]

Machine learning traction force maps of cell monolayers

Authors: Changhao Li, Luyi Feng, Yang Jeong Park, Jian Yang, Ju Li, Sulin Zhang

Abstract: Cellular force transmission across a hierarchy of molecular switchers is central to mechanobiological responses. However, current cellular force microscopies suffer from low throughput and resolution. Here we introduce and train a generative adversarial network (GAN) to paint out traction force maps of cell monolayers with high fidelity to the experimental traction force microscopy (TFM). The GAN… ▽ More Cellular force transmission across a hierarchy of molecular switchers is central to mechanobiological responses. However, current cellular force microscopies suffer from low throughput and resolution. Here we introduce and train a generative adversarial network (GAN) to paint out traction force maps of cell monolayers with high fidelity to the experimental traction force microscopy (TFM). The GAN analyzes traction force maps as an image-to-image translation problem, where its generative and discriminative neural networks are simultaneously cross-trained by hybrid experimental and numerical datasets. In addition to capturing the colony-size and substrate-stiffness dependent traction force maps, the trained GAN predicts asymmetric traction force patterns for multicellular monolayers seeding on substrates with stiffness gradient, implicating collective durotaxis. Further, the neural network can extract experimentally inaccessible, the hidden relationship between substrate stiffness and cell contractility, which underlies cellular mechanotransduction. Trained solely on datasets for epithelial cells, the GAN can be extrapolated to other contractile cell types using only a single scaling factor. The digital TFM serves as a high-throughput tool for mapping out cellular forces of cell monolayers and paves the way toward data-driven discoveries in cell mechanobiology. △ Less

Submitted 19 April, 2023; originally announced April 2023.

arXiv:2301.00556 [pdf, ps, other]

doi 10.1016/j.chaos.2022.113004

Competition of alliances in a cyclically dominant eight-species population

Authors: Junpyo Park, Xiaojie Chen, Attila Szolnoki

Abstract: In a diverse population, where many species are present, competitors can fight for surviving at individual and collective levels. In particular, species, which would beat each other individually, may form a specific alliance that ensures them stable coexistence against the invasion of an external species. Our principal goal is to identify those general features of a formation which determine its v… ▽ More In a diverse population, where many species are present, competitors can fight for surviving at individual and collective levels. In particular, species, which would beat each other individually, may form a specific alliance that ensures them stable coexistence against the invasion of an external species. Our principal goal is to identify those general features of a formation which determine its vitality. Therefore, we here study a traditional Lotka-Volterra model of eight-species where two four-species cycles can fight for space. Beside these formations, there are other solutions which may emerge when invasion rates are varied. The complete range of parameters is explored and we find that in most of the cases those alliances prevail which are formed by equally strong members. Interestingly, there are regions where the symmetry is broken and the system is dominated by a solution formed by seven species. Our work also highlights that serious finite-size effects may emerge which prevent observing the valid solution in a small system. △ Less

Submitted 2 January, 2023; originally announced January 2023.

Comments: 10 double-column pages, 11 figures

Journal ref: Chaos, Solitons and Fractals 166 (2023) 113004

arXiv:2212.05187 [pdf, other]

doi 10.1063/5.0142978

Invasion and Interaction Determine Population Composition in an Open Evolving System

Authors: Youngjai Park, Takashi Shimada, Seung-Woo Son, Hye Jin Park

Abstract: It is well-known that interactions between species determine the population composition in an ecosystem. Conventional studies have focused on fixed population structures to reveal how interactions shape population compositions. However, interaction structures are not fixed, but change over time due to invasions. Thus, invasion and interaction play an important role in shaping communities. Despite… ▽ More It is well-known that interactions between species determine the population composition in an ecosystem. Conventional studies have focused on fixed population structures to reveal how interactions shape population compositions. However, interaction structures are not fixed, but change over time due to invasions. Thus, invasion and interaction play an important role in shaping communities. Despite its importance, however, the interplay between invasion and interaction has not been well explored. Here, we investigate how invasion affects the population composition with interactions in open evolving systems considering generalized Lotka-Volterra-type dynamics. Our results show that the system has two distinct regimes. One is characterized by low diversity with abrupt changes of dominant species in time, appearing when the interaction between species is strong and invasion slowly occurs. On the other hand, frequent invasions can induce higher diversity with slow changes in abundances despite strong interactions. It is because invasion happens before the system reaches its equilibrium, which drags the system from its equilibrium all the time. All species have similar abundances in this regime, which implies that fast invasion induces regime shift. Therefore, whether invasion or interaction dominates determines the population composition. △ Less

Submitted 9 December, 2022; originally announced December 2022.

Comments: 15 pages (including supplementary material), 8 figures (4 figures in main, 4 figures in SI)

arXiv:2210.04096 [pdf, other]

PropertyDAG: Multi-objective Bayesian optimization of partially ordered, mixed-variable properties for biological sequence design

Authors: Ji Won Park, Samuel Stanton, Saeed Saremi, Andrew Watkins, Henri Dwyer, Vladimir Gligorijevic, Richard Bonneau, Stephen Ra, Kyunghyun Cho

Abstract: Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarch… ▽ More Bayesian optimization offers a sample-efficient framework for navigating the exploration-exploitation trade-off in the vast design space of biological sequences. Whereas it is possible to optimize the various properties of interest jointly using a multi-objective acquisition function, such as the expected hypervolume improvement (EHVI), this approach does not account for objectives with a hierarchical dependency structure. We consider a common use case where some regions of the Pareto frontier are prioritized over others according to a specified $\textit{partial ordering}$ in the objectives. For instance, when designing antibodies, we would like to maximize the binding affinity to a target antigen only if it can be expressed in live cell culture -- modeling the experimental dependency in which affinity can only be measured for antibodies that can be expressed and thus produced in viable quantities. In general, we may want to confer a partial ordering to the properties such that each property is optimized conditioned on its parent properties satisfying some feasibility condition. To this end, we present PropertyDAG, a framework that operates on top of the traditional multi-objective BO to impose this desired ordering on the objectives, e.g. expression $\rightarrow$ affinity. We demonstrate its performance over multiple simulated active learning iterations on a penicillin production task, toy numerical problem, and a real-world antibody design task. △ Less

Submitted 8 October, 2022; originally announced October 2022.

Comments: 9 pages, 7 figures. Submitted to NeurIPS 2022 AI4Science Workshop

arXiv:2208.14959 [pdf]

Inference of Mixed Graphical Models for Dichotomous Phenotypes using Markov Random Field Model

Authors: Jaehyun Park, Sungho Won

Abstract: In this article, we propose a new method named fused mixed graphical model (FMGM), which can infer network structures for dichotomous phenotypes. We assumed that the interplay of different omics markers is associated with disease status and proposed an FMGM-based method to detect the associated omics marker network difference. The statistical models of the networks were based on a pairwise Markov… ▽ More In this article, we propose a new method named fused mixed graphical model (FMGM), which can infer network structures for dichotomous phenotypes. We assumed that the interplay of different omics markers is associated with disease status and proposed an FMGM-based method to detect the associated omics marker network difference. The statistical models of the networks were based on a pairwise Markov random field model, and penalty functions were added to minimize the effect of sparseness in the networks. The fast proximal gradient method (PGM) was used to optimize the target function. Method validity was measured using synthetic datasets that simulate power-law network structures, and it was found that FMGM showed superior performance, especially in terms of F1 scores, compared with the previous method inferring the networks sequentially (0.392 and 0.546). FMGM performed better not only in identifying the differences (0.217 and 0.410) but also in identifying the networks (0.492 and 0.572). The proposed method was applied to multi-omics profiles of 6-month-old infants with and without atopic dermatitis (AD), and different correlations were found between the abundance of microbial genes related to carotenoid biosynthesis and RNA degradation according to disease status, suggesting the importance of metabolism related to oxidative stress and microbial RNA balance. △ Less

Submitted 31 August, 2022; originally announced August 2022.

Comments: 31 pages (excluding figures and tables), 4 figures, 3 tables, submitted to Biometrics

MSC Class: 92B15 (Primary) 62P10 62H10 62-08 (Secondary)

arXiv:2208.10661 [pdf, other]

Therapeutic algebra of immunomodulatory drug responses at single-cell resolution

Authors: Jialong Jiang, Sisi Chen, Tiffany Tsou, Christopher S. McGinnis, Tahmineh Khazaei, Qin Zhu, Jong H. Park, Paul Rivaud, Inna-Marie Strazhnik, Eric D. Chow, David A. Sivak, Zev J. Gartner, Matt Thomson

Abstract: Therapeutic modulation of immune states is central to the treatment of human disease. However, how drugs and drug combinations impact the diverse cell types in the human immune system remains poorly understood at the transcriptome scale. Here, we apply single-cell mRNA-seq to profile the response of human immune cells to 502 immunomodulatory drugs alone and in combination. We develop a unified mat… ▽ More Therapeutic modulation of immune states is central to the treatment of human disease. However, how drugs and drug combinations impact the diverse cell types in the human immune system remains poorly understood at the transcriptome scale. Here, we apply single-cell mRNA-seq to profile the response of human immune cells to 502 immunomodulatory drugs alone and in combination. We develop a unified mathematical model that quantitatively describes the transcriptome scale response of myeloid and lymphoid cell types to individual drugs and drug combinations through a single inferred regulatory network. The mathematical model reveals how drug combinations generate novel, macrophage and T-cell states by recruiting combinations of gene expression programs through both additive and non-additive drug interactions. A simplified drug response algebra allows us to predict the continuous modulation of immune cell populations between activated, resting and hyper-inhibited states through combinatorial drug dose titrations. Our results suggest that transcriptome-scale mathematical models could enable the design of therapeutic strategies for programming the human immune system using combinations of therapeutics. △ Less

Submitted 22 August, 2022; originally announced August 2022.

Comments: 19 pages, 5 figures

arXiv:2205.04259 [pdf, other]

Multi-segment preserving sampling for deep manifold sampler

Authors: Daniel Berenberg, Jae Hyeon Lee, Simon Kelow, Ji Won Park, Andrew Watkins, Vladimir Gligorijević, Richard Bonneau, Stephen Ra, Kyunghyun Cho

Abstract: Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guide… ▽ More Deep generative modeling for biological sequences presents a unique challenge in reconciling the bias-variance trade-off between explicit biological insight and model flexibility. The deep manifold sampler was recently proposed as a means to iteratively sample variable-length protein sequences by exploiting the gradients from a function predictor. We introduce an alternative approach to this guided sampling procedure, multi-segment preserving sampling, that enables the direct inclusion of domain-specific knowledge by designating preserved and non-preserved segments along the input sequence, thereby restricting variation to only select regions. We present its effectiveness in the context of antibody design by training two models: a deep manifold sampler and a GPT-2 language model on nearly six million heavy chain sequences annotated with the IGHV1-18 gene. During sampling, we restrict variation to only the complementarity-determining region 3 (CDR3) of the input. We obtain log probability scores from a GPT-2 model for each sampled CDR3 and demonstrate that multi-segment preserving sampling generates reasonable designs while maintaining the desired, preserved regions. △ Less

Submitted 9 May, 2022; originally announced May 2022.

arXiv:2204.03742 [pdf, other]

doi 10.1016/j.media.2022.102699

Mitosis domain generalization in histopathology images -- The MIDOG challenge

Authors: Marc Aubreville, Nikolas Stathonikos, Christof A. Bertram, Robert Klopleisch, Natalie ter Hoeve, Francesco Ciompi, Frauke Wilm, Christian Marzahl, Taryn A. Donovan, Andreas Maier, Jack Breen, Nishant Ravikumar, Youjin Chung, Jinah Park, Ramin Nateghi, Fattaneh Pourakpour, Rutger H. J. Fick, Saima Ben Hadj, Mostafa Jahanifar, Nasir Rajpoot, Jakob Dexl, Thomas Wittenberg, Satoshi Kondo, Maxime W. Lafarge, Viktor H. Koelzer , et al. (10 additional authors not shown)

Abstract: The density of mitotic figures within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of mitotic figures by pathologists is known to be subject to a strong inter-rater bias, which limits the prognostic value. State-of-the-art deep learning methods can support the expert in this assessment but are known to strongly… ▽ More The density of mitotic figures within tumor tissue is known to be highly correlated with tumor proliferation and thus is an important marker in tumor grading. Recognition of mitotic figures by pathologists is known to be subject to a strong inter-rater bias, which limits the prognostic value. State-of-the-art deep learning methods can support the expert in this assessment but are known to strongly deteriorate when applied in a different clinical environment than was used for training. One decisive component in the underlying domain shift has been identified as the variability caused by using different whole slide scanners. The goal of the MICCAI MIDOG 2021 challenge has been to propose and evaluate methods that counter this domain shift and derive scanner-agnostic mitosis detection algorithms. The challenge used a training set of 200 cases, split across four scanning systems. As a test set, an additional 100 cases split across four scanning systems, including two previously unseen scanners, were given. The best approaches performed on an expert level, with the winning algorithm yielding an F_1 score of 0.748 (CI95: 0.704-0.781). In this paper, we evaluate and compare the approaches that were submitted to the challenge and identify methodological factors contributing to better performance. △ Less

Submitted 6 April, 2022; originally announced April 2022.

Comments: 19 pages, 9 figures, summary paper of the 2021 MICCAI MIDOG challenge

Journal ref: Medical Image Analysis 84 (2023) 102699

arXiv:2202.10400 [pdf, other]

GenStore: A High-Performance and Energy-Efficient In-Storage Computing System for Genome Sequence Analysis

Authors: Nika Mansouri Ghiasi, Jisung Park, Harun Mustafa, Jeremie Kim, Ataberk Olgun, Arvid Gollwitzer, Damla Senol Cali, Can Firtina, Haiyu Mao, Nour Almadhoun Alserr, Rachata Ausavarungnirun, Nandita Vijaykumar, Mohammed Alser, Onur Mutlu

Abstract: Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select th… ▽ More Read mapping is a fundamental, yet computationally-expensive step in many genomics applications. It is used to identify potential matches and differences between fragments (called reads) of a sequenced genome and an already known genome (called a reference genome). To address the computational challenges in genome analysis, many prior works propose various approaches such as filters that select the reads that must undergo expensive computation, efficient heuristics, and hardware acceleration. While effective at reducing the computation overhead, all such approaches still require the costly movement of a large amount of data from storage to the rest of the system, which can significantly lower the end-to-end performance of read mapping in conventional and emerging genomics systems. We propose GenStore, the first in-storage processing system designed for genome sequence analysis that greatly reduces both data movement and computational overheads of genome sequence analysis by exploiting low-cost and accurate in-storage filters. GenStore leverages hardware/software co-design to address the challenges of in-storage processing, supporting reads with 1) different read lengths and error rates, and 2) different degrees of genetic variation. Through rigorous analysis of read mapping processes, we meticulously design low-cost hardware accelerators and data/computation flows inside a NAND flash-based SSD. Our evaluation using a wide range of real genomic datasets shows that GenStore, when implemented in three modern SSDs, significantly improves the read mapping performance of state-of-the-art software (hardware) baselines by 2.07-6.05$\times$ (1.52-3.32$\times$) for read sets with high similarity to the reference genome and 1.45-33.63$\times$ (2.70-19.2$\times$) for read sets with low similarity to the reference genome. △ Less

Submitted 6 April, 2023; v1 submitted 21 February, 2022; originally announced February 2022.

Comments: Published at ASPLOS 2022

arXiv:2112.08687 [pdf, other]

doi 10.1093/nargab/lqad004

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches in Genome Analysis

Authors: Can Firtina, Jisung Park, Mohammed Alser, Jeremie S. Kim, Damla Senol Cali, Taha Shahroodi, Nika Mansouri Ghiasi, Gagandeep Singh, Konstantinos Kanellopoulos, Can Alkan, Onur Mutlu

Abstract: Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only e… ▽ More Generating the hash values of short subsequences, called seeds, enables quickly identifying similarities between genomic sequences by matching seeds with a single lookup of their hash values. However, these hash values can be used only for finding exact-matching seeds as the conventional hashing methods assign distinct hash values for different seeds, including highly similar seeds. Finding only exact-matching seeds causes either 1) increasing the use of the costly sequence alignment or 2) limited sensitivity. We introduce BLEND, the first efficient and accurate mechanism that can identify both exact-matching and highly similar seeds with a single lookup of their hash values, called fuzzy seed matches. BLEND 1) utilizes a technique called SimHash, that can generate the same hash value for similar sets, and 2) provides the proper mechanisms for using seeds as sets with the SimHash technique to find fuzzy seed matches efficiently. We show the benefits of BLEND when used in read overlapping and read mapping. For read overlapping, BLEND is faster by 2.4x - 83.9x (on average 19.3x), has a lower memory footprint by 0.9x - 14.1x (on average 3.8x), and finds higher quality overlaps leading to accurate de novo assemblies than the state-of-the-art tool, minimap2. For read mapping, BLEND is faster by 0.8x - 4.1x (on average 1.7x) than minimap2. Source code is available at https://github.com/CMU-SAFARI/BLEND. △ Less

Submitted 23 May, 2023; v1 submitted 16 December, 2021; originally announced December 2021.

Comments: Published in NARGAB

Journal ref: NAR Genomics and Bioinformatics, vol. 5, no. 1, p. lqad004, Mar. 2023

arXiv:2112.05782 [pdf, ps, other]

Dynamical clustering of U.S. states reveals four distinct infection patterns that predict SARS-CoV-2 pandemic behavior

Authors: Joseph L. Natale, Varun Viswanath, Oscar Trujillo Acevedo, Sophia Pérez Giottonini, Sandy Ihuiyan Romero Hernández, Diana G. Cruz Millán, A. Montserrat Palacios-Puga, Ammar Mandvi, Brian M. Khan, Martin Lilik, Jay Park, Benjamin L. Smarr

Abstract: The SARS-CoV-2 pandemic has so far unfolded diversely across the fifty United States of America, reflected both in different time progressions of infection "waves" and in magnitudes of local infection rates. Despite a marked diversity of presentations, most U.S. states experienced their single greatest surge in daily new cases during the transition from Fall 2020 to Winter 2021. Popular media also… ▽ More The SARS-CoV-2 pandemic has so far unfolded diversely across the fifty United States of America, reflected both in different time progressions of infection "waves" and in magnitudes of local infection rates. Despite a marked diversity of presentations, most U.S. states experienced their single greatest surge in daily new cases during the transition from Fall 2020 to Winter 2021. Popular media also cite additional similarities between states -- often despite disparities in governmental policies, reported mask-wearing compliance rates, and vaccination percentages. Here, we identify a set of robust, low-dimensional clusters that 1) summarize the timings and relative heights of four historical COVID-19 "wave opportunities" accessible to all 50 U.S. states, 2) correlate with geographical and intervention patterns associated with those groups of states they encompass, and 3) predict aspects of the "fifth wave" of new infections in the late Summer of 2021. In particular, we argue that clustering elucidates a negative relationship between vaccination rates and subsequent case-load variabilities within state groups. We advance the hypothesis that vaccination acts as a ``seat belt," in effect constraining the likely range of new-case upticks, even in the context of the Summer 2021, variant-driven surge. △ Less

Submitted 10 December, 2021; originally announced December 2021.

Comments: 22 pages, 4 figures; submitted to PLOS ONE

arXiv:2106.13202 [pdf, other]

SALT: Sea lice Adaptive Lattice Tracking -- An Unsupervised Approach to Generate an Improved Ocean Model

Authors: Ju An Park, Vikram Voleti, Kathryn E. Thomas, Alexander Wong, Jason L. Deglint

Abstract: Warming oceans due to climate change are leading to increased numbers of ectoparasitic copepods, also known as sea lice, which can cause significant ecological loss to wild salmon populations and major economic loss to aquaculture sites. The main transport mechanism driving the spread of sea lice populations are near-surface ocean currents. Present strategies to estimate the distribution of sea li… ▽ More Warming oceans due to climate change are leading to increased numbers of ectoparasitic copepods, also known as sea lice, which can cause significant ecological loss to wild salmon populations and major economic loss to aquaculture sites. The main transport mechanism driving the spread of sea lice populations are near-surface ocean currents. Present strategies to estimate the distribution of sea lice larvae are computationally complex and limit full-scale analysis. Motivated to address this challenge, we propose SALT: Sea lice Adaptive Lattice Tracking approach for efficient estimation of sea lice dispersion and distribution in space and time. Specifically, an adaptive spatial mesh is generated by merging nodes in the lattice graph of the Ocean Model based on local ocean properties, thus enabling highly efficient graph representation. SALT demonstrates improved efficiency while maintaining consistent results with the standard method, using near-surface current data for Hardangerfjord, Norway. The proposed SALT technique shows promise for enhancing proactive aquaculture management through predictive modelling of sea lice infestation pressure maps in a changing climate. △ Less

Submitted 24 June, 2021; originally announced June 2021.

Comments: 5 pages, 3 figures, 3 tables

arXiv:2106.10627 [pdf, other]

Experimentally testable whole brain manifolds that recapitulate behavior

Authors: Gerald M Pao, Cameron Smith, Joseph Park, Keichi Takahashi, Wassapon Watanakeesuntorn, Hiroaki Natsukawa, Sreekanth H Chalasani, Tom Lorimer, Ryousei Takano, Nuttida Rungratsameetaweemana, George Sugihara

Abstract: We propose an algorithm grounded in dynamical systems theory that generalizes manifold learning from a global state representation, to a network of local interacting manifolds termed a Generative Manifold Network (GMN). Manifolds are discovered using the convergent cross mapping (CCM) causal inference algorithm which are then compressed into a reduced redundancy network. The representation is a ne… ▽ More We propose an algorithm grounded in dynamical systems theory that generalizes manifold learning from a global state representation, to a network of local interacting manifolds termed a Generative Manifold Network (GMN). Manifolds are discovered using the convergent cross mapping (CCM) causal inference algorithm which are then compressed into a reduced redundancy network. The representation is a network of manifolds embedded from observational data where each orthogonal axis of a local manifold is an embedding of a individually identifiable neuron or brain area that has exact correspondence in the real world. As such these can be experimentally manipulated to test hypotheses derived from theory and data analysis. Here we demonstrate that this representation preserves the essential features of the brain of flies,larval zebrafish and humans. In addition to accurate near-term prediction, the GMN model can be used to synthesize realistic time series of whole brain neuronal activity and locomotion viewed over the long term. Thus, as a final validation of how well GMN captures essential dynamic information, we show that the artificially generated time series can be used as a training set to predict out-of-sample observed fly locomotion, as well as brain activity in out of sample withheld data not used in model building. Remarkably, the artificially generated time series show realistic novel behaviors that do not exist in the training data, but that do exist in the out-of-sample observational data. This suggests that GMN captures inherently emergent properties of the network. We suggest our approach may be a generic recipe for mapping time series observations of any complex nonlinear network into a model that is able to generate naturalistic system behaviors that identifies variables that have real world correspondence and can be experimentally manipulated. △ Less

Submitted 20 June, 2021; originally announced June 2021.

Comments: 20 pages, 15 figures; corresponding author: Gerald Pao [email protected]

arXiv:2011.13554 [pdf]

Towards decoding the coupled decision-making of metabolism and epithelial-mesenchymal transition in cancer

Authors: Dongya Jia, Jun Hyoung Park, Harsimran Kaur, Kwang Hwa Jung, Sukjin Yang, Shubham Tripathi, Madeline Galbraith, Youyuan Deng, Mohit Kumar Jolly, Benny Abraham Kaipparettu, Jose N. Onuchic, Herbert Levine

Abstract: Cancer cells have the plasticity to adjust their metabolic phenotypes for survival and metastasis. During metastasis, a developmental program known as the epithelial-mesenchymal transition (EMT) plays a critical role. There is extensive cross-talk between metabolism and EMT, but how this leads to coordinated physiological changes is still uncertain. The elusive connection between metabolism and EM… ▽ More Cancer cells have the plasticity to adjust their metabolic phenotypes for survival and metastasis. During metastasis, a developmental program known as the epithelial-mesenchymal transition (EMT) plays a critical role. There is extensive cross-talk between metabolism and EMT, but how this leads to coordinated physiological changes is still uncertain. The elusive connection between metabolism and EMT compromises the efficacy of metabolic therapies targeting metastasis. In this review, we aim for clarifying causation between metabolism and EMT based on recent experimental studies and propose integrated theoretical-experimental efforts to better understand the coupled decision-making of metabolism and EMT. △ Less

Submitted 26 November, 2020; originally announced November 2020.

Comments: 31 pages, 3 figures

arXiv:2011.11082 [pdf, other]

Massively Parallel Causal Inference of Whole Brain Dynamics at Single Neuron Resolution

Authors: Wassapon Watanakeesuntorn, Keichi Takahashi, Kohei Ichikawa, Joseph Park, George Sugihara, Ryousei Takano, Jason Haga, Gerald M. Pao

Abstract: Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for moder… ▽ More Empirical Dynamic Modeling (EDM) is a nonlinear time series causal inference framework. The latest implementation of EDM, cppEDM, has only been used for small datasets due to computational cost. With the growth of data collection capabilities, there is a great need to identify causal relationships in large datasets. We present mpEDM, a parallel distributed implementation of EDM optimized for modern GPU-centric supercomputers. We improve the original algorithm to reduce redundant computation and optimize the implementation to fully utilize hardware resources such as GPUs and SIMD units. As a use case, we run mpEDM on AI Bridging Cloud Infrastructure (ABCI) using datasets of an entire animal brain sampled at single neuron resolution to identify dynamical causation patterns across the brain. mpEDM is 1,530 X faster than cppEDM and a dataset containing 101,729 neuron was analyzed in 199 seconds on 512 nodes. This is the largest EDM causal inference achieved to date. △ Less

Submitted 22 November, 2020; originally announced November 2020.

Comments: 10 pges, 10 figures, accepted at IEEE International Conference on Parallel and Distributed Systems (ICPADS)2020, corresponding authors: Keichi Takahashi, Gerald M Pao

ACM Class: K.6.3; G.4; J.3

arXiv:2008.05377 [pdf]

Network reinforcement driven drug repurposing for COVID-19 by exploiting disease-gene-drug associations

Authors: Yonghyun Nam, Jae-Seung Yun, Seung Mi Lee, Ji Won Park, Ziqi Chen, Brian Lee, Anurag Verma, Xia Ning, Li Shen, Dokyoon Kim

Abstract: Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for developing treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already… ▽ More Currently, the number of patients with COVID-19 has significantly increased. Thus, there is an urgent need for developing treatments for COVID-19. Drug repurposing, which is the process of reusing already-approved drugs for new medical conditions, can be a good way to solve this problem quickly and broadly. Many clinical trials for COVID-19 patients using treatments for other diseases have already been in place or will be performed at clinical sites in the near future. Additionally, patients with comorbidities such as diabetes mellitus, obesity, liver cirrhosis, kidney diseases, hypertension, and asthma are at higher risk for severe illness from COVID-19. Thus, the relationship of comorbidity disease with COVID-19 may help to find repurposable drugs. To reduce trial and error in finding treatments for COVID-19, we propose building a network-based drug repurposing framework to prioritize repurposable drugs. First, we utilized knowledge of COVID-19 to construct a disease-gene-drug network (DGDr-Net) representing a COVID-19-centric interactome with components for diseases, genes, and drugs. DGDr-Net consisted of 592 diseases, 26,681 human genes and 2,173 drugs, and medical information for 18 common comorbidities. The DGDr-Net recommended candidate repurposable drugs for COVID-19 through network reinforcement driven scoring algorithms. The scoring algorithms determined the priority of recommendations by utilizing graph-based semi-supervised learning. From the predicted scores, we recommended 30 drugs, including dexamethasone, resveratrol, methotrexate, indomethacin, quercetin, etc., as repurposable drugs for COVID-19, and the results were verified with drugs that have been under clinical trials. The list of drugs via a data-driven computational approach could help reduce trial-and-error in finding treatment for COVID-19. △ Less

Submitted 12 August, 2020; originally announced August 2020.

Comments: 4 figures

arXiv:2006.00688 [pdf, other]

A Mathematical Description of Bacterial Chemotaxis in Response to Two Stimuli

Authors: Jeungeun Park, Zahra Aminzare

Abstract: Bacteria are often exposed to multiple stimuli in complex environments, and their efficient chemotactic decisions are critical to survive and grow in their native environments. Bacterial responses to the environmental stimuli depend on the ratio of their corresponding chemoreceptors. By incorporating the signaling machinery of individual cells, we analyze the collective motion of a population of E… ▽ More Bacteria are often exposed to multiple stimuli in complex environments, and their efficient chemotactic decisions are critical to survive and grow in their native environments. Bacterial responses to the environmental stimuli depend on the ratio of their corresponding chemoreceptors. By incorporating the signaling machinery of individual cells, we analyze the collective motion of a population of Escherichia coli bacteria in response to two stimuli, mainly serine and methyl-aspartate (MeAsp), in a one-dimensional and a two-dimensional environment, which is inspired by experimental results in Y. Kalinin et al., J. Bacteriol. 192(7):1796-1800, 2010. Under suitable conditions, we show that if the ratio of the main chemoreceptors of individual cells, namely Tar/Tsr is less than a specific threshold, the bacteria move to the gradient of serine, and if the ratio is greater than the threshold, the group of bacteria move toward the gradient of MeAsp. Finally, we examine the theory with Monte-Carlo agent-based simulations, and verify that our results qualitatively agree well with the experimental results in Y. Kalinin et al. (2010). △ Less

Submitted 8 June, 2021; v1 submitted 31 May, 2020; originally announced June 2020.

MSC Class: 35Q92; 58J55; 60J75; 92B05; 92C17; 92D25

arXiv:2005.12425 [pdf]

doi 10.1242/jeb.224121

Absolute ethanol intake drives ethanol preference in Drosophila

Authors: Scarlet J. Park, William W. Ja

Abstract: Factors that mediate ethanol preference in Drosophila melanogaster are not well understood. A major confound has been the use of diverse methods to estimate ethanol consumption. We measured fly consumptive ethanol preference on base diets varying in nutrients, taste, and ethanol concentration. Both sexes showed ethanol preference that was abolished on high nutrient concentration diets. Additionall… ▽ More Factors that mediate ethanol preference in Drosophila melanogaster are not well understood. A major confound has been the use of diverse methods to estimate ethanol consumption. We measured fly consumptive ethanol preference on base diets varying in nutrients, taste, and ethanol concentration. Both sexes showed ethanol preference that was abolished on high nutrient concentration diets. Additionally, manipulating total food intake without altering the nutritive value of the base diet or the ethanol concentration was sufficient to evoke or eliminate ethanol preference. Absolute ethanol intake and food volume consumed were stronger predictors of ethanol preference than caloric intake or the dietary caloric content. Our findings suggest that the effect of the base diet on ethanol preference is largely mediated by total consumption associated with the delivery medium, which ultimately determines the level of ethanol intake. We speculate that a physiologically relevant threshold for ethanol intake is essential for preferential ethanol consumption. △ Less

Submitted 25 May, 2020; originally announced May 2020.

Comments: 11 pages, 2 figures, 1 table. Complete raw data accessible from https://github.com/HungryFly/JaLab/raw/master/publications/ethanol_JEB/SI_dataset.xlsx This version of the manuscript is original submission before undergoing peer review process. Final accepted and published version of this manuscript is available from https://doi.org/10.1242/jeb.224121 J Exp Biol (2020)

arXiv:2002.02601 [pdf, other]

Bidimensional linked matrix factorization for pan-omics pan-cancer analysis

Authors: Eric F. Lock, Jun Young Park, Katherine A. Hoadley

Abstract: Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogenity beyond what was observed in single tumor and single platform studies. However, these studies… ▽ More Several modern applications require the integration of multiple large data matrices that have shared rows and/or columns. For example, cancer studies that integrate multiple omics platforms across multiple types of cancer, pan-omics pan-cancer analysis, have extended our knowledge of molecular heterogenity beyond what was observed in single tumor and single platform studies. However, these studies have been limited by available statistical methodology. We propose a flexible approach to the simultaneous factorization and decomposition of variation across such bidimensionally linked matrices, BIDIFAC+. This decomposes variation into a series of low-rank components that may be shared across any number of row sets (e.g., omics platforms) or column sets (e.g., cancer types). This builds on a growing literature for the factorization and decomposition of linked matrices, which has primarily focused on multiple matrices that are linked in one dimension (rows or columns) only. Our objective function extends nuclear norm penalization, is motivated by random matrix theory, gives an identifiable decomposition under relatively mild conditions, and can be shown to give the mode of a Bayesian posterior distribution. We apply BIDIFAC+ to pan-omics pan-cancer data from TCGA, identifying shared and specific modes of variability across 4 different omics platforms and 29 different cancer types. △ Less

Submitted 7 April, 2022; v1 submitted 6 February, 2020; originally announced February 2020.

Comments: 26 pages, 5 figures

Journal ref: Annals of Applied Statistics 2022, Vol. 16, No. 1, 193-215

arXiv:1909.03992 [pdf]

Acoustomicrofluidic separation of tardigrades from raw cultures for sample preparation

Authors: Muhammad Afzal, Jinsoo Park, Ghulam Destgeer, Husnain Ahmed, Syed Atif Iqrar, Sanghee Kim, Sunghyun Kang, Anas Alazzam, Tae-Sung Yoon, Hyung Jin Sung

Abstract: Tardigrades are microscopic animals widely known for their survival capabilities under extreme conditions. They are the focus of current research in the fields of taxonomy, biogeography, genomics, proteomics, development, space biology, evolution, and ecology. Tardigrades, such as Hypsibius exemplaris, are being advocated as a next-generation model organism for genomic and developmental studies. T… ▽ More Tardigrades are microscopic animals widely known for their survival capabilities under extreme conditions. They are the focus of current research in the fields of taxonomy, biogeography, genomics, proteomics, development, space biology, evolution, and ecology. Tardigrades, such as Hypsibius exemplaris, are being advocated as a next-generation model organism for genomic and developmental studies. The raw culture of H. exemplaris usually contains tardigrades themselves, their eggs, and algal food and feces. Experimentation with tardigrades often requires the demanding and laborious separation of tardigrades from raw samples to prepare pure and contamination-free tardigrade samples. In this paper, we propose a two-step acousto-microfluidic separation method to isolate tardigrades from raw samples. In the first step, a passive microfluidic filter composed of an array of traps is used to remove large algal clusters in the raw sample. In the second step, a surface acoustic wave-based active microfluidic separation device is used to continuously deflect tardigrades from their original streamlines inside the microchannel and thus selectively isolate them from algae and eggs. The experimental results demonstrated the efficient tardigrade separation with a recovery rate of 96% and an algae impurity of 4% on average in a continuous, contactless, automated, rapid, biocompatible manner. △ Less

Submitted 9 September, 2019; originally announced September 2019.

Showing 1–50 of 78 results for author: Park, J