Search | arXiv e-print repository

arXiv:2507.02025 [pdf, ps, other]

IntFold: A Controllable Foundation Model for General and Specialized Biomolecular Structure Prediction

Authors: The IntFold Team, Leon Qiao, Wayne Bai, He Yan, Gary Liu, Nova Xi, Xiang Zhang, Siqi Sun

Abstract: We introduce IntFold, a controllable foundation model for general and specialized biomolecular structure prediction. Utilizing a high-performance custom attention kernel, IntFold achieves accuracy comparable to the state-of-the-art AlphaFold 3 on a comprehensive benchmark of diverse biomolecular structures, while also significantly outperforming other leading all-atom prediction approaches. The mo… ▽ More We introduce IntFold, a controllable foundation model for general and specialized biomolecular structure prediction. Utilizing a high-performance custom attention kernel, IntFold achieves accuracy comparable to the state-of-the-art AlphaFold 3 on a comprehensive benchmark of diverse biomolecular structures, while also significantly outperforming other leading all-atom prediction approaches. The model's key innovation is its controllability, enabling downstream applications critical for drug screening and design. Through specialized adapters, it can be precisely guided to predict complex allosteric states, apply user-defined structural constraints, and estimate binding affinity. Furthermore, we present a training-free, similarity-based method for ranking predictions that improves success rates in a model-agnostic manner. This report details these advancements and shares insights from the training and development of this large-scale model. △ Less

Submitted 4 July, 2025; v1 submitted 2 July, 2025; originally announced July 2025.

arXiv:2507.01485 [pdf, ps, other]

BioMARS: A Multi-Agent Robotic System for Autonomous Biological Experiments

Authors: Yibo Qiu, Zan Huang, Zhiyu Wang, Handi Liu, Yiling Qiao, Yifeng Hu, Shu'ang Sun, Hangke Peng, Ronald X Xu, Mingzhai Sun

Abstract: Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), a… ▽ More Large language models (LLMs) and vision-language models (VLMs) have the potential to transform biological research by enabling autonomous experimentation. Yet, their application remains constrained by rigid protocol design, limited adaptability to dynamic lab conditions, inadequate error handling, and high operational complexity. Here we introduce BioMARS (Biological Multi-Agent Robotic System), an intelligent platform that integrates LLMs, VLMs, and modular robotics to autonomously design, plan, and execute biological experiments. BioMARS uses a hierarchical architecture: the Biologist Agent synthesizes protocols via retrieval-augmented generation; the Technician Agent translates them into executable robotic pseudo-code; and the Inspector Agent ensures procedural integrity through multimodal perception and anomaly detection. The system autonomously conducts cell passaging and culture tasks, matching or exceeding manual performance in viability, consistency, and morphological integrity. It also supports context-aware optimization, outperforming conventional strategies in differentiating retinal pigment epithelial cells. A web interface enables real-time human-AI collaboration, while a modular backend allows scalable integration with laboratory hardware. These results highlight the feasibility of generalizable, AI-driven laboratory automation and the transformative role of language-based reasoning in biological research. △ Less

Submitted 2 July, 2025; originally announced July 2025.

arXiv:2506.14853 [pdf, ps, other]

DisProtEdit: Exploring Disentangled Representations for Multi-Attribute Protein Editing

Authors: Max Ku, Sun Sun, Hongyu Guo, Wenhu Chen

Abstract: We introduce DisProtEdit, a controllable protein editing framework that leverages dual-channel natural language supervision to learn disentangled representations of structural and functional properties. Unlike prior approaches that rely on joint holistic embeddings, DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control. To support this, we construct SwissPro… ▽ More We introduce DisProtEdit, a controllable protein editing framework that leverages dual-channel natural language supervision to learn disentangled representations of structural and functional properties. Unlike prior approaches that rely on joint holistic embeddings, DisProtEdit explicitly separates semantic factors, enabling modular and interpretable control. To support this, we construct SwissProtDis, a large-scale multimodal dataset where each protein sequence is paired with two textual descriptions, one for structure and one for function, automatically decomposed using a large language model. DisProtEdit aligns protein and text embeddings using alignment and uniformity objectives, while a disentanglement loss promotes independence between structural and functional semantics. At inference time, protein editing is performed by modifying one or both text inputs and decoding from the updated latent representation. Experiments on protein editing and representation learning benchmarks demonstrate that DisProtEdit performs competitively with existing methods while providing improved interpretability and controllability. On a newly constructed multi-attribute editing benchmark, the model achieves a both-hit success rate of up to 61.7%, highlighting its effectiveness in coordinating simultaneous structural and functional edits. △ Less

Submitted 17 June, 2025; originally announced June 2025.

Comments: Accepted to ICMLW (GenBio) 2025 and ICMLW (FM4LS) 2025

arXiv:2506.13485 [pdf, ps, other]

Curriculum Learning for Biological Sequence Prediction: The Case of De Novo Peptide Sequencing

Authors: Xiang Zhang, Jiaqi Wei, Zijie Qiu, Sheng Xu, Nanqing Dong, Zhiqiang Gao, Siqi Sun

Abstract: Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through… ▽ More Peptide sequencing-the process of identifying amino acid sequences from mass spectrometry data-is a fundamental task in proteomics. Non-Autoregressive Transformers (NATs) have proven highly effective for this task, outperforming traditional methods. Unlike autoregressive models, which generate tokens sequentially, NATs predict all positions simultaneously, leveraging bidirectional context through unmasked self-attention. However, existing NAT approaches often rely on Connectionist Temporal Classification (CTC) loss, which presents significant optimization challenges due to CTC's complexity and increases the risk of training failures. To address these issues, we propose an improved non-autoregressive peptide sequencing model that incorporates a structured protein sequence curriculum learning strategy. This approach adjusts protein's learning difficulty based on the model's estimated protein generational capabilities through a sampling process, progressively learning peptide generation from simple to complex sequences. Additionally, we introduce a self-refining inference-time module that iteratively enhances predictions using learned NAT token embeddings, improving sequence accuracy at a fine-grained level. Our curriculum learning strategy reduces NAT training failures frequency by more than 90% based on sampled training over various data distributions. Evaluations on nine benchmark species demonstrate that our approach outperforms all previous methods across multiple metrics and species. △ Less

Submitted 16 June, 2025; originally announced June 2025.

arXiv:2502.15867 [pdf]

Strategic priorities for transformative progress in advancing biology with proteomics and artificial intelligence

Authors: Yingying Sun, Jun A, Zhiwei Liu, Rui Sun, Liujia Qian, Samuel H. Payne, Wout Bittremieux, Markus Ralser, Chen Li, Yi Chen, Zhen Dong, Yasset Perez-Riverol, Asif Khan, Chris Sander, Ruedi Aebersold, Juan Antonio Vizcaíno, Jonathan R Krieger, Jianhua Yao, Han Wen, Linfeng Zhang, Yunping Zhu, Yue Xuan, Benjamin Boyang Sun, Liang Qiao, Henning Hermjakob , et al. (37 additional authors not shown)

Abstract: Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights.… ▽ More Artificial intelligence (AI) is transforming scientific research, including proteomics. Advances in mass spectrometry (MS)-based proteomics data quality, diversity, and scale, combined with groundbreaking AI techniques, are unlocking new challenges and opportunities in biological discovery. Here, we highlight key areas where AI is driving innovation, from data analysis to new biological insights. These include developing an AI-friendly ecosystem for proteomics data generation, sharing, and analysis; improving peptide and protein identification and quantification; characterizing protein-protein interactions and protein complexes; advancing spatial and perturbation proteomics; integrating multi-omics data; and ultimately enabling AI-empowered virtual cells. △ Less

Submitted 21 February, 2025; originally announced February 2025.

Comments: 28 pages, 2 figures, perspective in AI proteomics

arXiv:2412.10538 [pdf, ps, other]

Predictive Modeling, Pattern Recognition, and Spatiotemporal Representations of Plant Growth in Simulated and Controlled Environments: A Comprehensive Review

Authors: Mohamed Debbagh, Shangpeng Sun, Mark Lefsrud

Abstract: Accurate predictions and representations of plant growth patterns in simulated and controlled environments are important for addressing various challenges in plant phenomics research. This review explores various works on state-of-the-art predictive pattern recognition techniques, focusing on the spatiotemporal modeling of plant traits and the integration of dynamic environmental interactions. We… ▽ More Accurate predictions and representations of plant growth patterns in simulated and controlled environments are important for addressing various challenges in plant phenomics research. This review explores various works on state-of-the-art predictive pattern recognition techniques, focusing on the spatiotemporal modeling of plant traits and the integration of dynamic environmental interactions. We provide a comprehensive examination of deterministic, probabilistic, and generative modeling approaches, emphasizing their applications in high-throughput phenotyping and simulation-based plant growth forecasting. Key topics include regressions and neural network-based representation models for the task of forecasting, limitations of existing experiment-based deterministic approaches, and the need for dynamic frameworks that incorporate uncertainty and evolving environmental feedback. This review surveys advances in 2D and 3D structured data representations through functional-structural plant models and conditional generative models. We offer a perspective on opportunities for future works, emphasizing the integration of domain-specific knowledge to data-driven methods, improvements to available datasets, and the implementation of these techniques toward real-world applications. △ Less

Submitted 24 June, 2025; v1 submitted 13 December, 2024; originally announced December 2024.

arXiv:2412.10347 [pdf, other]

COMET: Benchmark for Comprehensive Biological Multi-omics Evaluation Tasks and Language Models

Authors: Yuchen Ren, Wenwei Han, Qianyuan Zhang, Yining Tang, Weiqiang Bai, Yuchen Cai, Lifeng Qiao, Hao Jiang, Dong Yuan, Tao Chen, Siqi Sun, Pan Tan, Wanli Ouyang, Nanqing Dong, Xinzhu Ma, Peng Ye

Abstract: As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large langua… ▽ More As key elements within the central dogma, DNA, RNA, and proteins play crucial roles in maintaining life by guaranteeing accurate genetic expression and implementation. Although research on these molecules has profoundly impacted fields like medicine, agriculture, and industry, the diversity of machine learning approaches-from traditional statistical methods to deep learning models and large language models-poses challenges for researchers in choosing the most suitable models for specific tasks, especially for cross-omics and multi-omics tasks due to the lack of comprehensive benchmarks. To address this, we introduce the first comprehensive multi-omics benchmark COMET (Benchmark for Biological COmprehensive Multi-omics Evaluation Tasks and Language Models), designed to evaluate models across single-omics, cross-omics, and multi-omics tasks. First, we curate and develop a diverse collection of downstream tasks and datasets covering key structural and functional aspects in DNA, RNA, and proteins, including tasks that span multiple omics levels. Then, we evaluate existing foundational language models for DNA, RNA, and proteins, as well as the newly proposed multi-omics method, offering valuable insights into their performance in integrating and analyzing data from different biological modalities. This benchmark aims to define critical issues in multi-omics research and guide future directions, ultimately promoting advancements in understanding biological processes through integrated and different omics data analysis. △ Less

Submitted 13 December, 2024; originally announced December 2024.

arXiv:2412.03614 [pdf, other]

Deep Learning in Single-Cell and Spatial Transcriptomics Data Analysis: Advances and Challenges from a Data Science Perspective

Authors: Shuang Ge, Shuqing Sun, Huan Xu, Qiang Cheng, Zhixiang Ren

Abstract: The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring th… ▽ More The development of single-cell and spatial transcriptomics has revolutionized our capacity to investigate cellular properties, functions, and interactions in both cellular and spatial contexts. However, the analysis of single-cell and spatial omics data remains challenging. First, single-cell sequencing data are high-dimensional and sparse, often contaminated by noise and uncertainty, obscuring the underlying biological signals. Second, these data often encompass multiple modalities, including gene expression, epigenetic modifications, and spatial locations. Integrating these diverse data modalities is crucial for enhancing prediction accuracy and biological interpretability. Third, while the scale of single-cell sequencing has expanded to millions of cells, high-quality annotated datasets are still limited. Fourth, the complex correlations of biological tissues make it difficult to accurately reconstruct cellular states and spatial contexts. Traditional feature engineering-based analysis methods struggle to deal with the various challenges presented by intricate biological networks. Deep learning has emerged as a powerful tool capable of handling high-dimensional complex data and automatically identifying meaningful patterns, offering significant promise in addressing these challenges. This review systematically analyzes these challenges and discusses related deep learning approaches. Moreover, we have curated 21 datasets from 9 benchmarks, encompassing 58 computational methods, and evaluated their performance on the respective modeling tasks. Finally, we highlight three areas for future development from a technical, dataset, and application perspective. This work will serve as a valuable resource for understanding how deep learning can be effectively utilized in single-cell and spatial transcriptomics analyses, while inspiring novel approaches to address emerging challenges. △ Less

Submitted 5 December, 2024; v1 submitted 4 December, 2024; originally announced December 2024.

arXiv:2409.16339 [pdf]

Large-scale digital phenotyping: identifying depression and anxiety indicators in a general UK population with over 10,000 participants

Authors: Yuezhou Zhang, Callum Stewart, Yatharth Ranjan, Pauline Conde, Heet Sankesara, Zulqarnain Rashid, Shaoxiong Sun, Richard J B Dobson, Amos A Folarin

Abstract: Digital phenotyping offers a novel and cost-efficient approach for managing depression and anxiety. Previous studies, often limited to small-to-medium or specific populations, may lack generalizability. We conducted a cross-sectional analysis of data from 10,129 participants recruited from a UK-based general population between June 2020 and August 2022. Participants shared wearable (Fitbit) data a… ▽ More Digital phenotyping offers a novel and cost-efficient approach for managing depression and anxiety. Previous studies, often limited to small-to-medium or specific populations, may lack generalizability. We conducted a cross-sectional analysis of data from 10,129 participants recruited from a UK-based general population between June 2020 and August 2022. Participants shared wearable (Fitbit) data and self-reported questionnaires on depression (PHQ-8), anxiety (GAD-7), and mood via a study app. We first examined the correlations between PHQ-8/GAD-7 scores and wearable-derived features, demographics, health data, and mood assessments. Subsequently, unsupervised clustering was used to identify behavioural patterns associated with depression or anxiety. Finally, we employed separate XGBoost models to predict depression and anxiety and compared the results using different subsets of features. We observed significant associations between the severity of depression and anxiety with several factors, including mood, age, gender, BMI, sleep patterns, physical activity, and heart rate. Clustering analysis revealed that participants simultaneously exhibiting lower physical activity levels and higher heart rates reported more severe symptoms. Prediction models incorporating all types of variables achieved the best performance ($R^2$=0.41, MAE=3.42 for depression; $R^2$=0.31, MAE=3.50 for anxiety) compared to those using subsets of variables. This study identified potential indicators for depression and anxiety, highlighting the utility of digital phenotyping and machine learning technologies for rapid screening of mental disorders in general populations. These findings provide robust real-world insights for future healthcare applications. △ Less

Submitted 24 September, 2024; originally announced September 2024.

arXiv:2406.10391 [pdf, other]

BEACON: Benchmark for Comprehensive RNA Tasks and Language Models

Authors: Yuchen Ren, Zhiyuan Chen, Lifeng Qiao, Hongtai Jing, Yuchen Cai, Sheng Xu, Peng Ye, Xinzhu Ma, Siqi Sun, Hongliang Yan, Dong Yuan, Wanli Ouyang, Xihui Liu

Abstract: RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we i… ▽ More RNA plays a pivotal role in translating genetic instructions into functional outcomes, underscoring its importance in biological processes and disease mechanisms. Despite the emergence of numerous deep learning approaches for RNA, particularly universal RNA language models, there remains a significant lack of standardized benchmarks to assess the effectiveness of these methods. In this study, we introduce the first comprehensive RNA benchmark BEACON (\textbf{BE}nchm\textbf{A}rk for \textbf{CO}mprehensive R\textbf{N}A Task and Language Models). First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications, enabling a comprehensive assessment of the performance of methods on various RNA understanding tasks. Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models. Third, we investigate the vital RNA language model components from the tokenizer and positional encoding aspects. Notably, our findings emphasize the superiority of single nucleotide tokenization and the effectiveness of Attention with Linear Biases (ALiBi) over traditional positional encoding methods. Based on these insights, a simple yet strong baseline called BEACON-B is proposed, which can achieve outstanding performance with limited data and computational resources. The datasets and source code of our benchmark are available at https://github.com/terry-r123/RNABenchmark. △ Less

Submitted 12 December, 2024; v1 submitted 14 June, 2024; originally announced June 2024.

Comments: Accepted by NeurIPS 2024 Dataset and Benchmark Track

arXiv:2405.14796 [pdf, ps, other]

doi 10.1007/978-3-031-71602-7_26

Generative Plant Growth Simulation from Sequence-Informed Environmental Conditions

Authors: Mohamed Debbagh, Yixue Liu, Zhouzhou Zheng, Xintong Jiang, Shangpeng Sun, Mark Lefsrud

Abstract: A plant growth simulation can be characterized as a reconstructed visual representation of a plant or plant system. The phenotypic characteristics and plant structures are controlled by the scene environment and other contextual attributes. Considering the temporal dependencies and compounding effects of various factors on growth trajectories, we formulate a probabilistic approach to the simulatio… ▽ More A plant growth simulation can be characterized as a reconstructed visual representation of a plant or plant system. The phenotypic characteristics and plant structures are controlled by the scene environment and other contextual attributes. Considering the temporal dependencies and compounding effects of various factors on growth trajectories, we formulate a probabilistic approach to the simulation task by solving a frame synthesis and pattern recognition problem. We introduce a sequence-informed plant growth simulation framework (SI-PGS) that employs a conditional generative model to implicitly learn a distribution of possible plant representations within a dynamic scene from a fusion of low-dimensional temporal sensor and context data. Methods such as controlled latent sampling and recurrent output connections are used to improve coherence in the plant structures between frames of prediction. In this work, we demonstrate that SI-PGS is able to capture temporal dependencies and continuously generate realistic frames of plant growth. △ Less

Submitted 9 July, 2024; v1 submitted 23 May, 2024; originally announced May 2024.

Journal ref: Artificial Neural Networks in Pattern Recognition. ANNPR 2024. Lecture Notes in Computer Science(), vol. 15154, Springer, Cham, 2024, pp. 308-319

arXiv:2312.12094 [pdf, other]

CrossBind: Collaborative Cross-Modal Identification of Protein Nucleic-Acid-Binding Residues

Authors: Linglin Jing, Sheng Xu, Yifan Wang, Yuzhe Zhou, Tao Shen, Zhigang Ji, Hui Fang, Zhen Li, Siqi Sun

Abstract: Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in in… ▽ More Accurate identification of protein nucleic-acid-binding residues poses a significant challenge with important implications for various biological processes and drug design. Many typical computational methods for protein analysis rely on a single model that could ignore either the semantic context of the protein or the global 3D geometric information. Consequently, these approaches may result in incomplete or inaccurate protein analysis. To address the above issue, in this paper, we present CrossBind, a novel collaborative cross-modal approach for identifying binding residues by exploiting both protein geometric structure and its sequence prior knowledge extracted from a large-scale protein language model. Specifically, our multi-modal approach leverages a contrastive learning technique and atom-wise attention to capture the positional relationships between atoms and residues, thereby incorporating fine-grained local geometric knowledge, for better binding residue prediction. Extensive experimental results demonstrate that our approach outperforms the next best state-of-the-art methods, GraphSite and GraphBind, on DNA and RNA datasets by 10.8/17.3% in terms of the harmonic mean of precision and recall (F1-Score) and 11.9/24.8% in Matthews correlation coefficient (MCC), respectively. We release the code at https://github.com/BEAM-Labs/CrossBind. △ Less

Submitted 20 December, 2023; v1 submitted 19 December, 2023; originally announced December 2023.

Comments: Accepted to AAAI-24

arXiv:2312.11584 [pdf, other]

ContraNovo: A Contrastive Learning Approach to Enhance De Novo Peptide Sequencing

Authors: Zhi Jin, Sheng Xu, Xiang Zhang, Tianze Ling, Nanqing Dong, Wanli Ouyang, Zhiqiang Gao, Cheng Chang, Siqi Sun

Abstract: De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides.… ▽ More De novo peptide sequencing from mass spectrometry (MS) data is a critical task in proteomics research. Traditional de novo algorithms have encountered a bottleneck in accuracy due to the inherent complexity of proteomics data. While deep learning-based methods have shown progress, they reduce the problem to a translation task, potentially overlooking critical nuances between spectra and peptides. In our research, we present ContraNovo, a pioneering algorithm that leverages contrastive learning to extract the relationship between spectra and peptides and incorporates the mass information into peptide decoding, aiming to address these intricacies more efficiently. Through rigorous evaluations on two benchmark datasets, ContraNovo consistently outshines contemporary state-of-the-art solutions, underscoring its promising potential in enhancing de novo peptide sequencing. The source code is available at https://github.com/BEAM-Labs/ContraNovo. △ Less

Submitted 18 December, 2023; originally announced December 2023.

Comments: This paper has been accepted by AAAI 2024

arXiv:2312.07931 [pdf, other]

doi 10.1609/aaai.v38i14.29509

Levenshtein Distance Embedding with Poisson Regression for DNA Storage

Authors: Xiang Wei, Alan J. X. Guo, Sihan Sun, Mengyi Wei, Wei Yu

Abstract: Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural n… ▽ More Efficient computation or approximation of Levenshtein distance, a widely-used metric for evaluating sequence similarity, has attracted significant attention with the emergence of DNA storage and other biological applications. Sequence embedding, which maps Levenshtein distance to a conventional distance between embedding vectors, has emerged as a promising solution. In this paper, a novel neural network-based sequence embedding technique using Poisson regression is proposed. We first provide a theoretical analysis of the impact of embedding dimension on model performance and present a criterion for selecting an appropriate embedding dimension. Under this embedding dimension, the Poisson regression is introduced by assuming the Levenshtein distance between sequences of fixed length following a Poisson distribution, which naturally aligns with the definition of Levenshtein distance. Moreover, from the perspective of the distribution of embedding distances, Poisson regression approximates the negative log likelihood of the chi-squared distribution and offers advancements in removing the skewness. Through comprehensive experiments on real DNA storage data, we demonstrate the superior performance of the proposed method compared to state-of-the-art approaches. △ Less

Submitted 13 December, 2023; originally announced December 2023.

Journal ref: Proceedings of the AAAI Conference on Artificial Intelligence, (2024) 38(14), 15796-15804

arXiv:2312.02953 [pdf]

Longitudinal Assessment of Seasonal Impacts and Depression Associations on Circadian Rhythm Using Multimodal Wearable Sensing

Authors: Yuezhou Zhang, Amos A Folarin, Shaoxiong Sun, Nicholas Cummins, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Pauline Conde, Heet Sankesara, Petroula Laiou, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Srinivasan Vairavan, Inez Myin-Germeys, David C. Mohr, Til Wykes, Josep Maria Haro, Peter Annas, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf , et al. (2 additional authors not shown)

Abstract: Objective: This study aimed to explore the associations between depression severity and wearable-measured circadian rhythms, accounting for seasonal impacts and quantifying seasonal changes in circadian rhythms.Materials and Methods: Data used in this study came from a large longitudinal mobile health study. Depression severity (measured biweekly using the 8-item Patient Health Questionnaire [PHQ-… ▽ More Objective: This study aimed to explore the associations between depression severity and wearable-measured circadian rhythms, accounting for seasonal impacts and quantifying seasonal changes in circadian rhythms.Materials and Methods: Data used in this study came from a large longitudinal mobile health study. Depression severity (measured biweekly using the 8-item Patient Health Questionnaire [PHQ-8]) and behaviors (monitored by Fitbit) were tracked for up to two years. Twelve features were extracted from Fitbit recordings to approximate circadian rhythms. Three nested linear mixed-effects models were employed for each feature: (1) incorporating the PHQ-8 score as an independent variable; (2) adding the season variable; and (3) adding an interaction term between season and the PHQ-8 score. Results: This study analyzed 10,018 PHQ-8 records with Fitbit data from 543 participants. Upon adjusting for seasonal effects, higher PHQ-8 scores were associated with reduced activity, irregular behaviors, and delayed rhythms. Notably, the negative association with daily step counts was stronger in summer and spring than in winter, and the positive association with the onset of the most active continuous 10-hour period was significant only during summer. Furthermore, participants had shorter and later sleep, more activity, and delayed circadian rhythms in summer compared to winter. Discussion and Conclusions: Our findings underscore the significant seasonal impacts on human circadian rhythms and their associations with depression and indicate that wearable-measured circadian rhythms have the potential to be the digital biomarkers of depression. △ Less

Submitted 5 December, 2023; originally announced December 2023.

arXiv:2308.16713 [pdf, other]

Accurate Prediction of Antibody Function and Structure Using Bio-Inspired Antibody Language Model

Authors: Hongtai Jing, Zhengtao Gao, Sheng Xu, Tao Shen, Zhangzhi Peng, Shwai He, Tao You, Shuang Ye, Wei Lin, Siqi Sun

Abstract: In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging… ▽ More In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% non-redundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold, and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. △ Less

Submitted 31 August, 2023; originally announced August 2023.

arXiv:2308.11773 [pdf]

Identifying depression-related topics in smartphone-collected free-response speech recordings using an automatic speech recognition system and a deep learning topic model

Authors: Yuezhou Zhang, Amos A Folarin, Judith Dineley, Pauline Conde, Valeria de Angel, Shaoxiong Sun, Yatharth Ranjan, Zulqarnain Rashid, Callum Stewart, Petroula Laiou, Heet Sankesara, Linglong Qian, Faith Matcham, Katie M White, Carolin Oetzmann, Femke Lamers, Sara Siddi, Sara Simblett, Björn W. Schuller, Srinivasan Vairavan, Til Wykes, Josep Maria Haro, Brenda WJH Penninx, Vaibhav A Narayan, Matthew Hotopf , et al. (3 additional authors not shown)

Abstract: Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordi… ▽ More Language use has been shown to correlate with depression, but large-scale validation is needed. Traditional methods like clinic studies are expensive. So, natural language processing has been employed on social media to predict depression, but limitations remain-lack of validated labels, biased user samples, and no context. Our study identified 29 topics in 3919 smartphone-collected speech recordings from 265 participants using the Whisper tool and BERTopic model. Six topics with a median PHQ-8 greater than or equal to 10 were regarded as risk topics for depression: No Expectations, Sleep, Mental Therapy, Haircut, Studying, and Coursework. To elucidate the topic emergence and associations with depression, we compared behavioral (from wearables) and linguistic characteristics across identified topics. The correlation between topic shifts and changes in depression severity over time was also investigated, indicating the importance of longitudinally monitoring language use. We also tested the BERTopic model on a similar smaller dataset (356 speech recordings from 57 participants), obtaining some consistent results. In summary, our findings demonstrate specific speech topics may indicate depression severity. The presented data-driven workflow provides a practical approach to collecting and analyzing large-scale speech data from real-world settings for digital health research. △ Less

Submitted 5 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

arXiv:2306.01824 [pdf, other]

Enhancing the Protein Tertiary Structure Prediction by Multiple Sequence Alignment Generation

Authors: Le Zhang, Jiayang Chen, Tao Shen, Yu Li, Siqi Sun

Abstract: The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database fo… ▽ More The field of protein folding research has been greatly advanced by deep learning methods, with AlphaFold2 (AF2) demonstrating exceptional performance and atomic-level precision. As co-evolution is integral to protein structure prediction, AF2's accuracy is significantly influenced by the depth of multiple sequence alignment (MSA), which requires extensive exploration of a large protein database for similar sequences. However, not all protein sequences possess abundant homologous families, and consequently, AF2's performance can degrade on such queries, at times failing to produce meaningful results. To address this, we introduce a novel generative language model, MSA-Augmenter, which leverages protein-specific attention mechanisms and large-scale MSAs to generate useful, novel protein sequences not currently found in databases. These sequences supplement shallow MSAs, enhancing the accuracy of structural property predictions. Our experiments on CASP14 demonstrate that MSA-Augmenter can generate de novo sequences that retain co-evolutionary information from inferior MSAs, thereby improving protein structure prediction quality on top of strong AF2. △ Less

Submitted 2 June, 2023; originally announced June 2023.

arXiv:2305.08929 [pdf, other]

doi 10.15212/AMM-2024-0047

AF2-Mutation: Adversarial Sequence Mutations against AlphaFold2 on Protein Tertiary Structure Prediction

Authors: Zhongju Yuan, Tao Shen, Sheng Xu, Leiye Yu, Ruobing Ren, Siqi Sun

Abstract: Deep learning-based approaches, such as AlphaFold2 (AF2), have significantly advanced protein tertiary structure prediction, achieving results comparable to real biological experimental methods. While AF2 has shown limitations in predicting the effects of mutations, its robustness against sequence mutations remains to be determined. Starting with the wild-type (WT) sequence, we investigate adversa… ▽ More Deep learning-based approaches, such as AlphaFold2 (AF2), have significantly advanced protein tertiary structure prediction, achieving results comparable to real biological experimental methods. While AF2 has shown limitations in predicting the effects of mutations, its robustness against sequence mutations remains to be determined. Starting with the wild-type (WT) sequence, we investigate adversarial sequences generated via an evolutionary approach, which AF2 predicts to be substantially different from WT. Our experiments on CASP14 reveal that by modifying merely three residues in the protein sequence using a combination of replacement, deletion, and insertion strategies, the alteration in AF2's predictions, as measured by the Local Distance Difference Test (lDDT), reaches 46.61. Moreover, when applied to a specific protein, SPNS2, our proposed algorithm successfully identifies biologically meaningful residues critical to protein structure determination and potentially indicates alternative conformations, thus significantly expediting the experimental process. △ Less

Submitted 15 May, 2023; originally announced May 2023.

arXiv:2212.10540 [pdf]

doi 10.2196/45233

Challenges in Using mHealth Data From Smartphones and Wearable Devices to Predict Depression Symptom Severity: Retrospective Analysis

Authors: Shaoxiong Sun, Amos A. Folarin, Yuezhou Zhang, Nicholas Cummins, Rafael Garcia-Dias, Callum Stewart, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Petroula Laiou, Heet Sankesara, Faith Matcham, Daniel Leightley, Katie M. White, Carolin Oetzmann, Alina Ivan, Femke Lamers, Sara Siddi, Sara Simblett, Raluca Nica, Aki Rintala, David C. Mohr, Inez Myin-Germeys, Til Wykes, Josep Maria Haro , et al. (6 additional authors not shown)

Abstract: A number of challenges exist for the analysis of mHealth data: maintaining participant engagement over extended time periods and therefore understanding what constitutes an acceptable threshold of missing data; distinguishing between the cross-sectional and longitudinal relationships for different features to determine their utility in tracking within-individual longitudinal variation or screening… ▽ More A number of challenges exist for the analysis of mHealth data: maintaining participant engagement over extended time periods and therefore understanding what constitutes an acceptable threshold of missing data; distinguishing between the cross-sectional and longitudinal relationships for different features to determine their utility in tracking within-individual longitudinal variation or screening individuals at high risk; and understanding the heterogeneity with which depression manifests itself in behavioral patterns quantified by the passive features. From 479 participants with MDD, we extracted 21 features capturing mobility, sleep, and smartphone use. We investigated the impact of the number of days of available data on feature quality using the intraclass correlation coefficient and Bland-Altman analysis. We then examined the nature of the correlation between the 8-item Patient Health Questionnaire (PHQ-8) depression scale (measured every 14 days) and the features using the individual-mean correlation, repeated measures correlation, and linear mixed effects model. Furthermore, we stratified the participants based on their behavioral difference, quantified by the features, between periods of high (depression) and low (no depression) PHQ-8 scores using the Gaussian mixture model. We demonstrated that at least 8 (range 2-12) days were needed for reliable calculation of most of the features in the 14-day time window. We observed that features such as sleep onset time correlated better with PHQ-8 scores cross-sectionally than longitudinally, whereas features such as wakefulness after sleep onset correlated well with PHQ-8 longitudinally but worse cross-sectionally. Finally, we found that participants could be separated into 3 distinct clusters according to their behavioral difference between periods of depression and periods of no depression. △ Less

Submitted 14 August, 2023; v1 submitted 20 December, 2022; originally announced December 2022.

arXiv:2207.01586 [pdf, other]

doi 10.1038/s41592-024-02487-0

Accurate RNA 3D structure prediction using a language model-based deep learning approach

Authors: Tao Shen, Zhihang Hu, Siqi Sun, Di Liu, Felix Wong, Jiuming Wang, Jiayang Chen, Yixuan Wang, Liang Hong, Jin Xiao, Liangzhen Zheng, Tejas Krishnamoorthi, Irwin King, Sheng Wang, Peng Yin, James J. Collins, Yu Li

Abstract: Accurate prediction of RNA three-dimensional (3D) structure remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to scarcity of experimentally determined data, complicates computational prediction efforts. Here, we present Rh… ▽ More Accurate prediction of RNA three-dimensional (3D) structure remains an unsolved challenge. Determining RNA 3D structures is crucial for understanding their functions and informing RNA-targeting drug development and synthetic biology design. The structural flexibility of RNA, which leads to scarcity of experimentally determined data, complicates computational prediction efforts. Here, we present RhoFold+, an RNA language model-based deep learning method that accurately predicts 3D structures of single-chain RNAs from sequences. By integrating an RNA language model pre-trained on ~23.7 million RNA sequences and leveraging techniques to address data scarcity, RhoFold+ offers a fully automated end-to-end pipeline for RNA 3D structure prediction. Retrospective evaluations on RNA-Puzzles and CASP15 natural RNA targets demonstrate RhoFold+'s superiority over existing methods, including human expert groups. Its efficacy and generalizability are further validated through cross-family and cross-type assessments, as well as time-censored benchmarks. Additionally, RhoFold+ predicts RNA secondary structures and inter-helical angles, providing empirically verifiable features that broaden its applicability to RNA structure and function studies. △ Less

Submitted 2 January, 2025; v1 submitted 4 July, 2022; originally announced July 2022.

Comments: 23 pages, 5 figures. A revised version is published in Nature Methods 21, 2287-2298 (2024). doi:10.1038/s41592-024-02487-0

Journal ref: Nature Methods 2024

arXiv:2204.00300 [pdf, other]

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

Authors: Jiayang Chen, Zhihang Hu, Siqi Sun, Qingxiong Tan, Yixuan Wang, Qinze Yu, Licheng Zong, Liang Hong, Jin Xiao, Tao Shen, Irwin King, Yu Li

Abstract: Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, and post-transcriptional regulations. These are all among the core problems in the RNA field. With the rapid growth of sequencing technology, we have accumulated a massive amount of unannotated RNA sequences. On the other hand, expensive experimental observato… ▽ More Non-coding RNA structure and function are essential to understanding various biological processes, such as cell signaling, gene expression, and post-transcriptional regulations. These are all among the core problems in the RNA field. With the rapid growth of sequencing technology, we have accumulated a massive amount of unannotated RNA sequences. On the other hand, expensive experimental observatory results in only limited numbers of annotated data and 3D structures. Hence, it is still challenging to design computational methods for predicting their structures and functions. The lack of annotated data and systematic study causes inferior performance. To resolve the issue, we propose a novel RNA foundation model (RNA-FM) to take advantage of all the 23 million non-coding RNA sequences through self-supervised learning. Within this approach, we discover that the pre-trained RNA-FM could infer sequential and evolutionary information of non-coding RNAs without using any labels. Furthermore, we demonstrate RNA-FM's effectiveness by applying it to the downstream secondary/3D structure prediction, SARS-CoV-2 genome structure and evolution prediction, protein-RNA binding preference modeling, and gene expression regulation modeling. The comprehensive experiments show that the proposed method improves the RNA structural and functional modelling results significantly and consistently. Despite only being trained with unlabelled data, RNA-FM can serve as the foundational model for the field. △ Less

Submitted 7 August, 2022; v1 submitted 1 April, 2022; originally announced April 2022.

arXiv:2201.12644 [pdf]

doi 10.2196/40667

Associations between depression symptom severity and daily-life gait characteristics derived from long-term acceleration signals in real-world settings

Authors: Yuezhou Zhang, Amos A Folarin, Shaoxiong Sun, Nicholas Cummins, Srinivasan Vairavan, Linglong Qian, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Callum Stewart, Petroula Laiou, Heet Sankesara, Faith Matcham, Katie M White, Carolin Oetzmann, Alina Ivan, Femke Lamers, Sara Siddi, Sara Simblett, Aki Rintala, David C Mohr, Inez Myin-Germeys, Til Wykes, Josep Maria Haro, Brenda WJH Penninx , et al. (5 additional authors not shown)

Abstract: Gait is an essential manifestation of depression. Laboratory gait characteristics have been found to be closely associated with depression. However, the gait characteristics of daily walking in real-world scenarios and their relationships with depression are yet to be fully explored. This study aimed to explore associations between depression symptom severity and daily-life gait characteristics de… ▽ More Gait is an essential manifestation of depression. Laboratory gait characteristics have been found to be closely associated with depression. However, the gait characteristics of daily walking in real-world scenarios and their relationships with depression are yet to be fully explored. This study aimed to explore associations between depression symptom severity and daily-life gait characteristics derived from acceleration signals in real-world settings. In this study, we used two ambulatory datasets: a public dataset with 71 elder adults' 3-day acceleration signals collected by a wearable device, and a subset of an EU longitudinal depression study with 215 participants and their phone-collected acceleration signals (average 463 hours per participant). We detected participants' gait cycles and force from acceleration signals and extracted 20 statistics-based daily-life gait features to describe the distribution and variance of gait cadence and force over a long-term period corresponding to the self-reported depression score. The gait cadence of faster steps (75th percentile) over a long-term period has a significant negative association with the depression symptom severity of this period in both datasets. Daily-life gait features could significantly improve the goodness of fit of evaluating depression severity relative to laboratory gait patterns and demographics, which was assessed by likelihood-ratio tests in both datasets. This study indicated that the significant links between daily-life walking characteristics and depression symptom severity could be captured by both wearable devices and mobile phones. The gait cadence of faster steps in daily-life walking has the potential to be a biomarker for evaluating depression severity, which may contribute to clinical tools to remotely monitor mental health in real-world settings. △ Less

Submitted 29 January, 2022; originally announced January 2022.

arXiv:2112.12853 [pdf]

Systolic blood pressure estimation using ECG and PPG in patients undergoing surgery

Authors: Shaoxiong Sun, Erik Bresch, Jens Muehlsteff, Lars Schmitt, Xi Long, Rick Bezemer, Igor Paulussen, Gerrit J. Noordergraaf, Ronald M. Aarts

Abstract: Background and Objectives: In a significant portion of surgeries, blood pressure (BP) is often measured non-invasively in an intermittent manner. This practice has a risk of missing clinically relevant BP changes between two adjacent intermittent BP measurements. This study proposes a method to non-invasively estimate systolic blood pressure (SBP) with high accuracy in patients undergoing surgery.… ▽ More Background and Objectives: In a significant portion of surgeries, blood pressure (BP) is often measured non-invasively in an intermittent manner. This practice has a risk of missing clinically relevant BP changes between two adjacent intermittent BP measurements. This study proposes a method to non-invasively estimate systolic blood pressure (SBP) with high accuracy in patients undergoing surgery. Methods: Continuous arterial BP, electrocardiography (ECG), and photoplethysmography (PPG) signals were acquired from 29 patients undergoing surgery. After extracting 9 features from the PPG and ECG signals, we dynamically selected features upon each intermittent measurement (every 10 min) of SBP based on feature robustness and the principle of correlation-based feature selection. Finally, multiple linear regression models were built to combine these features to estimate SBP every 30 s. Results: Compared to the reference SBP, the proposed method achieved a mean of difference at 0.08 mmHg, a standard deviation of difference at 7.97 mmHg, and a correlation coefficient at 0.89 (p < 0.001). Conclusions: This study demonstrates the feasibility of non-invasively estimating SBP every 30 s with high accuracy during surgery by using ECG, PPG, and intermittent SBP measurements every 10 min, which meets the standard of the Association for the Advancement of Medical Instrumentation. The proposed method has the potential to enhance BP monitoring in the operating room, improving patient outcomes and experiences. △ Less

Submitted 19 August, 2022; v1 submitted 23 December, 2021; originally announced December 2021.

arXiv:2112.11903 [pdf]

The utility of wearable devices in assessing ambulatory impairments of people with multiple sclerosis in free-living conditions

Authors: Shaoxiong Sun, Amos A Folarin, Yuezhou Zhang, Nicholas Cummins, Shuo Liu, Callum Stewart, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Petroula Laiou, Heet Sankesara, Gloria Dalla Costa, Letizia Leocani, Per Soelberg Sørensen, Melinda Magyari, Ana Isabel Guerrero, Ana Zabalza, Srinivasan Vairavan, Raquel Bailon, Sara Simblett, Inez Myin-Germeys, Aki Rintala, Til Wykes, Vaibhav A Narayan, Matthew Hotopf , et al. (3 additional authors not shown)

Abstract: Multiple sclerosis (MS) is a progressive inflammatory and neurodegenerative disease of the central nervous system affecting over 2.5 million people globally. In-clinic six-minute walk test (6MWT) is a widely used objective measure to evaluate the progression of MS. Yet, it has limitations such as the need for a clinical visit and a proper walkway. The widespread use of wearable devices capable of… ▽ More Multiple sclerosis (MS) is a progressive inflammatory and neurodegenerative disease of the central nervous system affecting over 2.5 million people globally. In-clinic six-minute walk test (6MWT) is a widely used objective measure to evaluate the progression of MS. Yet, it has limitations such as the need for a clinical visit and a proper walkway. The widespread use of wearable devices capable of depicting patients activity profiles has the potential to assess the level of MS-induced disability in free-living conditions. In this work, we extracted 96 activity features in different temporal granularities (from minute-level to day-level) and explored their utility in estimating 6MWT scores in a European (Italy, Spain, and Denmark) MS cohort of 337 participants over an average of 10-month duration. We combined these features with participant demographics using three regression models including elastic net, gradient boosted trees and random forest. In addition, we quantified the individual feature contribution using feature importance in these regression models, linear mixed-effects models, generalized estimating equations, and correlation-based feature selection (CFS). The results showed promising estimation performance with R2 of 0.30, which was derived using random forest after CFS. This model was able to distinguish the participants with low disability from those with high disability. Furthermore, we observed that the minute-level (no longer than 8 minutes) step count, particularly those capturing the upper end of the step count distribution, had a stronger association with 6MWT. The use of a walking aid was indicative of ambulatory function measured through 6MWT. This study provides a basis for future investigation into the clinical relevance and utility of wearables in assessing MS progression in free-living conditions. △ Less

Submitted 22 December, 2021; originally announced December 2021.

arXiv:2101.07764 [pdf]

doi 10.1038/s41467-021-25890-z

Growth and site-specific organization of micron-scale biomolecular devices on living mammalian cells

Authors: Sisi Jia, Siew Cheng Phua, Yuta Nihongaki, Yizeng Li, Michael Pacella, Yi Li, Abdul M. Mohammed, Sean Sun, Takanari Inoue, Rebecca Schulman

Abstract: Mesoscale molecular assemblies on the cell surface, such as cilia and filopodia, integrate information, control transport and amplify signals. Synthetic devices mimicking these structures could sensitively monitor these cellular functions and direct new ones. The challenges in creating such devices, however are that they must be integrated with cells in a precise kinetically controlled process and… ▽ More Mesoscale molecular assemblies on the cell surface, such as cilia and filopodia, integrate information, control transport and amplify signals. Synthetic devices mimicking these structures could sensitively monitor these cellular functions and direct new ones. The challenges in creating such devices, however are that they must be integrated with cells in a precise kinetically controlled process and a device's structure and its precisely structured cell interface must then be maintained during active cellular function. Here we report the ability to integrate synthetic micro-scale filaments, DNA nanotubes, into a cell's architecture by anchoring them by their ends to specific receptors on the surfaces of mammalian cells. These filaments can act as shear stress meters: how anchored nanotubes bend at the cell surface quantitatively indicates the magnitude of shear stresses between 0-2 dyn per cm2, a regime important for cell signaling. Nanotubes can also grow while anchored to cells, thus acting as dynamic components of cells. This approach to cell surface engineering, in which synthetic biomolecular assemblies are organized within existing cellular architecture, could make it possible to build new types of sensors, machines and scaffolds that can interface with, control and measure properties of cells. △ Less

Submitted 19 January, 2021; originally announced January 2021.

Comments: 20 pages, 5 figures

arXiv:2010.00957 [pdf]

doi 10.1002/pst.2108

Estimands in Hematologic Oncology Trials

Authors: Steven Sun, Hans-Jochen Weber, Emily Butler, Kaspar Rufibach, Satrajit Roychoudhury

Abstract: The estimand framework included in the addendum to the ICH E9 guideline facilitates discussions to ensure alignment between the key question of interest, the analysis, and interpretation. Therapeutic knowledge and drug mechanism play a crucial role in determining the strategy and defining the estimand for clinical trial designs. Clinical trials in patients with hematological malignancies often pre… ▽ More The estimand framework included in the addendum to the ICH E9 guideline facilitates discussions to ensure alignment between the key question of interest, the analysis, and interpretation. Therapeutic knowledge and drug mechanism play a crucial role in determining the strategy and defining the estimand for clinical trial designs. Clinical trials in patients with hematological malignancies often present unique challenges for trial design due to complexity of treatment options and existence of potential curative but highly risky procedures, e.g. stem cell transplant or treatment sequence across different phases (induction, consolidation, maintenance). Here, we illustrate how to apply the estimand framework in hematological clinical trials and how the estimand framework can address potential difficulties in trial result interpretation. This paper is a result of a cross-industry collaboration to connect the International Conference on Harmonisation (ICH) E9 addendum concepts to applications. Three randomized phase 3 trials will be used to consider common challenges including intercurrent events in hematologic oncology trials to illustrate different scientific questions and the consequences of the estimand choice for trial design, data collection, analysis, and interpretation. Template language for describing estimand in both study protocols and statistical analysis plans is suggested for statisticians' reference. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: 5 tables, 1 figure

Journal ref: Pharm. Stat., 2021, 20, 793-805

arXiv:2009.12983 [pdf]

doi 10.2196/24604

The Relationship between Major Depression Symptom Severity and Sleep Collected Using a Wristband Wearable Device: Multi-centre Longitudinal Observational Study

Authors: Yuezhou Zhang, Amos A Folarin, Shaoxiong Sun, Nicholas Cummins, Rebecca Bendayan Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Callum Stewart, Petroula Laiou, Faith Matcham, Katie White, Femke Lamers, Sara Siddi, Sara Simblett, Inez Myin-Germeys, Aki Rintala, Til Wykes, Josep Maria Haro, Brenda WJH Pennix, Vaibhav A Narayan, Matthew Hotopf, Richard JB Dobson

Abstract: Research in mental health has implicated sleep pathologies with depression. However, the gold standard for sleep assessment, polysomnography, is not suitable for long-term, continuous, monitoring of daily sleep, and methods such as sleep diaries rely on subjective recall, which is qualitative and inaccurate. Wearable devices, on the other hand, provide a low-cost and convenient means to monitor sl… ▽ More Research in mental health has implicated sleep pathologies with depression. However, the gold standard for sleep assessment, polysomnography, is not suitable for long-term, continuous, monitoring of daily sleep, and methods such as sleep diaries rely on subjective recall, which is qualitative and inaccurate. Wearable devices, on the other hand, provide a low-cost and convenient means to monitor sleep in home settings. The main aim of this study was to devise and extract sleep features, from data collected using a wearable device, and analyse their correlation with depressive symptom severity and sleep quality, as measured by the self-assessed Patient Health Questionnaire 8-item. Daily sleep data were collected passively by Fitbit wristband devices, and depressive symptom severity was self-reported every two weeks by the PHQ-8. The data used in this paper included 2,812 PHQ-8 records from 368 participants recruited from three study sites in the Netherlands, Spain, and the UK.We extracted 21 sleep features from Fitbit data which describe sleep in the following five aspects: sleep architecture, sleep stability, sleep quality, insomnia, and hypersomnia. Linear mixed regression models were used to explore associations between sleep features and depressive symptom severity. The z-test was used to evaluate the significance of the coefficient of each feature. We tested our models on the entire dataset and individually on the data of three different study sites. We identified 16 sleep features that were significantly correlated with the PHQ-8 score on the entire dataset. Associations between sleep features and the PHQ-8 score varied across different sites, possibly due to the difference in the populations. △ Less

Submitted 27 September, 2020; originally announced September 2020.

arXiv:2009.09648 [pdf]

Measuring the effect of Non-Pharmaceutical Interventions (NPIs) on mobility during the COVID-19 pandemic using global mobility data

Authors: Berber T Snoeijer, Mariska Burger, Shaoxiong Sun, Richard JB Dobson, Amos A Folarin

Abstract: The implementation of governmental Non-Pharmaceutical Interventions (NPIs) has been the primary means of controlling the spread of the COVID-19 disease. The intended effect of these NPIs has been to reduce mobility. A strong reduction in mobility is believed to have a positive effect on the reduction of COVID-19 transmission by limiting the opportunity for the virus to spread in the population. Du… ▽ More The implementation of governmental Non-Pharmaceutical Interventions (NPIs) has been the primary means of controlling the spread of the COVID-19 disease. The intended effect of these NPIs has been to reduce mobility. A strong reduction in mobility is believed to have a positive effect on the reduction of COVID-19 transmission by limiting the opportunity for the virus to spread in the population. Due to the huge costs of implementing these NPIs, it is essential to have a good understanding of their efficacy. Using global mobility data, released by Apple and Google, and ACAPS NPI data, we investigate the proportional contribution of NPIs on i) size of the change (magnitude) of transition between pre- and post-lockdown mobility levels and ii) rate (gradient) of this transition. Using generalized linear models to find the best fit model we found similar results using Apple or Google data. NPIs found to impact the magnitude of the change in mobility were: Lockdown measures (Apple, Google Retail and Recreation (RAR) and Google Transit and Stations (TS)), declaring a state of emergency (Apple, Google RAR and Google TS), closure of businesses and public services (Google RAR) and school closures (Apple). Using cluster analysis and chi square tests we found that closure of businesses and public services, school closures and limiting public gatherings as well as border closures and international flight suspensions were closely related. The implementation of lockdown measures and limiting public gatherings had the greatest effect on the rate of mobility change. In conclusion, we were able to quantitatively assess the efficacy of NPIs in reducing mobility, which enables us to understand their fine grained effects in a timely manner and therefore facilitate well-informed and cost-effective interventions. △ Less

Submitted 21 September, 2020; originally announced September 2020.

Comments: 16 pages, 6 figures

arXiv:2009.00133 [pdf]

Unsupervised and Supervised Structure Learning for Protein Contact Prediction

Authors: Siqi Sun

Abstract: Protein contacts provide key information for the understanding of protein structure and function, and therefore contact prediction from sequences is an important problem. Recent research shows that some correctly predicted long-range contacts could help topology-level structure modeling. Thus, contact prediction and contact-assisted protein folding also proves the importance of this problem. In th… ▽ More Protein contacts provide key information for the understanding of protein structure and function, and therefore contact prediction from sequences is an important problem. Recent research shows that some correctly predicted long-range contacts could help topology-level structure modeling. Thus, contact prediction and contact-assisted protein folding also proves the importance of this problem. In this thesis, I will briefly introduce the extant related work, then show how to establish the contact prediction through unsupervised graphical models with topology constraints. Further, I will explain how to use the supervised deep learning methods to further boost the accuracy of contact prediction. Finally, I will propose a scoring system called diversity score to measure the novelty of contact predictions, as well as an algorithm that predicts contacts with respect to the new scoring system. △ Less

Submitted 31 August, 2020; originally announced September 2020.

Comments: PhD Thesis

arXiv:2007.14585 [pdf]

On the Transcriptomic Signature and General Stress State Associated with Aneuploidy

Authors: Hung-Ji Tsai, Anjali R. Nelliat, Andrei Kucharavy, Mohammad Ikbal Choudhury, Sean X. Sun, Michael C. Schatz, Rong Li

Abstract: Whether aneuploid cells with diverse karyotypes have any properties in common has a been a subject of intense interest. A recent study by Terhorst et al. (1) reinvestigated the common aneuploidy gene expression (CAGE), disputing the conclusion of our recent work (2). In this short article, which has been submitted to PNAS as a Letter to the Editor, we explain our major concerns about Terhorst et a… ▽ More Whether aneuploid cells with diverse karyotypes have any properties in common has a been a subject of intense interest. A recent study by Terhorst et al. (1) reinvestigated the common aneuploidy gene expression (CAGE), disputing the conclusion of our recent work (2). In this short article, which has been submitted to PNAS as a Letter to the Editor, we explain our major concerns about Terhorst et al. and why we believe that our previous conclusion stands valid. △ Less

Submitted 28 July, 2020; originally announced July 2020.

Comments: 1 page, no figure, with new analyses (a letter to PNAS Editor)

arXiv:2006.04480 [pdf]

doi 10.1080/19466315.2020.1785543

Assessing the Impact of COVID-19 on the Objective and Analysis of Oncology Clinical Trials -- Application of the Estimand Framework

Authors: Evgeny Degtyarev, Kaspar Rufibach, Yue Shentu, Godwin Yung, Michelle Casey, Stefan Englert, Feng Liu, Yi Liu, Oliver Sailer, Jonathan Siegel, Steven Sun, Rui Tang, Jiangxiu Zhou

Abstract: COVID-19 outbreak has rapidly evolved into a global pandemic. The impact of COVID-19 on patient journeys in oncology represents a new risk to interpretation of trial results and its broad applicability for future clinical practice. We identify key intercurrent events that may occur due to COVID-19 in oncology clinical trials with a focus on time-to-event endpoints and discuss considerations pertai… ▽ More COVID-19 outbreak has rapidly evolved into a global pandemic. The impact of COVID-19 on patient journeys in oncology represents a new risk to interpretation of trial results and its broad applicability for future clinical practice. We identify key intercurrent events that may occur due to COVID-19 in oncology clinical trials with a focus on time-to-event endpoints and discuss considerations pertaining to the other estimand attributes introduced in the ICH E9 addendum. We propose strategies to handle COVID-19 related intercurrent events, depending on their relationship with malignancy and treatment and the interpretability of data after them. We argue that the clinical trial objective from a world without COVID-19 pandemic remains valid. The estimand framework provides a common language to discuss the impact of COVID-19 in a structured and transparent manner. This demonstrates that the applicability of the framework may even go beyond what it was initially intended for. △ Less

Submitted 21 June, 2020; v1 submitted 8 June, 2020; originally announced June 2020.

Comments: Paper written on behalf of the industry working group on estimands in oncology (www.oncoestimand.org). Accepted for publication in a special issue of Statistics in Biopharmaceutical Research

Journal ref: Statistics in Biopharmaceutical Research, 2020, 12(4), 427-437

arXiv:2004.14331 [pdf]

doi 10.2196/19992

Using smartphones and wearable devices to monitor behavioural changes during COVID-19

Authors: Shaoxiong Sun, Amos Folarin, Yatharth Ranjan, Zulqarnain Rashid, Pauline Conde, Callum Stewart, Nicholas Cummins, Faith Matcham, Gloria Dalla Costa, Sara Simblett, Letizia Leocani, Per Soelberg Sørensen, Mathias Buron, Ana Isabel Guerrero, Ana Zabalza, Brenda WJH Penninx, Femke Lamers, Sara Siddi, Josep Maria Haro, Inez Myin-Germeys, Aki Rintala, Til Wykes, Vaibhav A. Narayan, Giancarlo Comi, Matthew Hotopf , et al. (1 additional authors not shown)

Abstract: We aimed to explore the utility of the recently developed open-source mobile health platform RADAR-base as a toolbox to rapidly test the effect and response to NPIs aimed at limiting the spread of COVID-19. We analysed data extracted from smartphone and wearable devices and managed by the RADAR-base from 1062 participants recruited in Italy, Spain, Denmark, the UK, and the Netherlands. We derived… ▽ More We aimed to explore the utility of the recently developed open-source mobile health platform RADAR-base as a toolbox to rapidly test the effect and response to NPIs aimed at limiting the spread of COVID-19. We analysed data extracted from smartphone and wearable devices and managed by the RADAR-base from 1062 participants recruited in Italy, Spain, Denmark, the UK, and the Netherlands. We derived nine features on a daily basis including time spent at home, maximum distance travelled from home, maximum number of Bluetooth-enabled nearby devices (as a proxy for physical distancing), step count, average heart rate, sleep duration, bedtime, phone unlock duration, and social app use duration. We performed Kruskal-Wallis tests followed by post-hoc Dunns tests to assess differences in these features among baseline, pre-, and during-lockdown periods. We also studied behavioural differences by age, gender, body mass index (BMI), and educational background. We were able to quantify expected changes in time spent at home, distance travelled, and the number of nearby Bluetooth-enabled devices between pre- and during-lockdown periods. We saw reduced sociality as measured through mobility features, and increased virtual sociality through phone usage. People were more active on their phones, spending more time using social media apps, particularly around major news events. Furthermore, participants had lower heart rate, went to bed later, and slept more. We also found that young people had longer homestay than older people during lockdown and fewer daily steps. Although there was no significant difference between the high and low BMI groups in time spent at home, the low BMI group walked more. RADAR-base can be used to rapidly quantify and provide a holistic view of behavioural changes in response to public health interventions as a result of infectious outbreaks such as COVID-19. △ Less

Submitted 22 July, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

arXiv:2003.12232 [pdf, other]

$α$-Satellite: An AI-driven System and Benchmark Datasets for Hierarchical Community-level Risk Assessment to Help Combat COVID-19

Authors: Yanfang Ye, Shifu Hou, Yujie Fan, Yiyue Qian, Yiming Zhang, Shiyu Sun, Qian Peng, Kenneth Laparo

Abstract: The novel coronavirus and its deadly outbreak have posed grand challenges to human society: as of March 26, 2020, there have been 85,377 confirmed cases and 1,293 reported deaths in the United States; and the World Health Organization (WHO) characterized coronavirus disease (COVID-19) - which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries - a global pa… ▽ More The novel coronavirus and its deadly outbreak have posed grand challenges to human society: as of March 26, 2020, there have been 85,377 confirmed cases and 1,293 reported deaths in the United States; and the World Health Organization (WHO) characterized coronavirus disease (COVID-19) - which has infected more than 531,000 people with more than 24,000 deaths in at least 171 countries - a global pandemic. A growing number of areas reporting local sub-national community transmission would represent a significant turn for the worse in the battle against the novel coronavirus, which points to an urgent need for expanded surveillance so we can better understand the spread of COVID-19 and thus better respond with actionable strategies for community mitigation. By advancing capabilities of artificial intelligence (AI) and leveraging the large-scale and real-time data generated from heterogeneous sources (e.g., disease related data from official public health organizations, demographic data, mobility data, and user geneated data from social media), in this work, we propose and develop an AI-driven system (named $α$-Satellite}, as an initial offering, to provide hierarchical community-level risk assessment to assist with the development of strategies for combating the fast evolving COVID-19 pandemic. More specifically, given a specific location (either user input or automatic positioning), the developed system will automatically provide risk indexes associated with it in a hierarchical manner (e.g., state, county, city, specific location) to enable individuals to select appropriate actions for protection while minimizing disruptions to daily life to the extent possible. The developed system and the generated benchmark datasets have been made publicly accessible through our website. The system description and disclaimer are also available in our website. △ Less

Submitted 27 March, 2020; originally announced March 2020.

arXiv:2003.05507 [pdf]

doi 10.18535/ijmsci/v7i08.06

Pathogen Infection Recovery Probability (PIRP) Versus Proinflammatory Anti-Pathogen Species (PIAPS) Levels: Modelling and Therapeutic Strategies

Authors: Sam-Shajing Sun

Abstract: Current CoVID-19 pandemic is spreading rapidly worldwide, and it may become one of the largest pandemic events in modern history if out of control. It appears most of the SARS-CoV2 virus infection resulted deaths are mainly due to dysfunctions or failures of the lung or multiple organs that could be attributed to hosts immunodysfunctions particularly hyperinflammatory type disorders. In this brief… ▽ More Current CoVID-19 pandemic is spreading rapidly worldwide, and it may become one of the largest pandemic events in modern history if out of control. It appears most of the SARS-CoV2 virus infection resulted deaths are mainly due to dysfunctions or failures of the lung or multiple organs that could be attributed to hosts immunodysfunctions particularly hyperinflammatory type disorders. In this brief review and study, a math model is proposed to correlate the Pathogen Infection Recovery Probability (PIRP) versus Proinflammatory Anti-Pathogen Species (PIAPS) levels within a host unit, where a maximum PIRP is exhibited when the PIAPS levels are equal to or around PIAPS equilibrium levels at the pathogen elimination or clearance onset. Based on this model, rational or effective therapeutic strategies at right stages or timing, with right type of agents (immuno-stimulators or immuno-suppressors), and right dosages, may be designed and implemented that are expected to effectively achieve maximum PIRP or reduce the mortality. △ Less

Submitted 5 April, 2020; v1 submitted 11 March, 2020; originally announced March 2020.

Comments: 8 pages, 2 figures, 1 equation

Journal ref: Int. J. Med. Sci. Clin. Inv., Vol. 7 No. 08 (2020) | Page No.: 4925-4930

arXiv:2002.09283 [pdf]

doi 10.1038/s41597-022-01211-x

MODMA dataset: a Multi-modal Open Dataset for Mental-disorder Analysis

Authors: Hanshu Cai, Yiwen Gao, Shuting Sun, Na Li, Fuze Tian, Han Xiao, Jianxiu Li, Zhengwu Yang, Xiaowei Li, Qinglin Zhao, Zhenyu Liu, Zhijun Yao, Minqiang Yang, Hong Peng, Jing Zhu, Xiaowei Zhang, Guoping Gao, Fang Zheng, Rui Li, Zhihua Guo, Rong Ma, Jing Yang, Lan Zhang, Xiping Hu, Yumin Li , et al. (1 additional authors not shown)

Abstract: According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important… ▽ More According to the World Health Organization, the number of mental disorder patients, especially depression patients, has grown rapidly and become a leading contributor to the global burden of disease. However, the present common practice of depression diagnosis is based on interviews and clinical scales carried out by doctors, which is not only labor-consuming but also time-consuming. One important reason is due to the lack of physiological indicators for mental disorders. With the rising of tools such as data mining and artificial intelligence, using physiological data to explore new possible physiological indicators of mental disorder and creating new applications for mental disorder diagnosis has become a new research hot topic. However, good quality physiological data for mental disorder patients are hard to acquire. We present a multi-modal open dataset for mental-disorder analysis. The dataset includes EEG and audio data from clinically depressed patients and matching normal controls. All our patients were carefully diagnosed and selected by professional psychiatrists in hospitals. The EEG dataset includes not only data collected using traditional 128-electrodes mounted elastic cap, but also a novel wearable 3-electrode EEG collector for pervasive applications. The 128-electrodes EEG signals of 53 subjects were recorded as both in resting state and under stimulation; the 3-electrode EEG signals of 55 subjects were recorded in resting state; the audio data of 52 subjects were recorded during interviewing, reading, and picture description. We encourage other researchers in the field to use it for testing their methods of mental-disorder analysis. △ Less

Submitted 4 March, 2020; v1 submitted 20 February, 2020; originally announced February 2020.

Journal ref: Sci Data 9, 178 (2022)

arXiv:2001.02844 [pdf, other]

doi 10.1126/sciadv.aba9636

Real-time nanodiamond thermometry probing in-vivo thermogenic responses

Authors: Masazumi Fujiwara, Simo Sun, Alexander Dohms, Yushi Nishimura, Ken Suto, Yuka Takezawa, Keisuke Oshimi, Li Zhao, Nikola Sadzak, Yumi Umehara, Yoshio Teki, Naoki Komatsu, Oliver Benson, Yutaka Shikano, Eriko Kage-Nakadai

Abstract: Real-time temperature monitoring inside living organisms provides a direct measure of their biological activities, such as homeostatic thermoregulation and energy metabolism. However, it is challenging to reduce the size of bio-compatible thermometers down to submicrometers despite their potential applications for the thermal imaging of subtissue structures with single-cell resolution. Light-emitt… ▽ More Real-time temperature monitoring inside living organisms provides a direct measure of their biological activities, such as homeostatic thermoregulation and energy metabolism. However, it is challenging to reduce the size of bio-compatible thermometers down to submicrometers despite their potential applications for the thermal imaging of subtissue structures with single-cell resolution. Light-emitting nanothermometers that remotely sense temperature via optical signals exhibit considerable potential in such \textit{in-vivo} high-spatial-resolution thermometry. Here, using quantum nanothermometers based on optically accessible electron spins in nanodiamonds (NDs), we demonstrate \textit{in-vivo} real-time temperature monitoring inside \textit{Caenorhabditis elegans} (\textit{C. elegans}) worms. We developed a thermometry system that can measure the temperatures of movable NDs inside live adult worms with a precision of $\pm 0.22^{\circ}{\rm C}$. Using this system, we determined the increase in temperature based on the thermogenic responses of the worms during the chemical stimuli of mitochondrial uncouplers. Our technique demonstrates sub-micrometer localization of real-time temperature information in living animals and direct identification of their pharmacological thermogenesis. The results obtained facilitate the development of a method to probe subcellular temperature variation inside living organisms and may allow for quantification of their biological activities based on their energy expenditures. △ Less

Submitted 16 January, 2020; v1 submitted 9 January, 2020; originally announced January 2020.

Comments: 9 + 10 pages, 4 + 11 figures, our submission is jointly with the paper arXiv:2001.02664

Journal ref: Science Advances 6, eaba9636 (2020)

arXiv:1906.11196 [pdf, other]

Seq-SetNet: Exploring Sequence Sets for Inferring Structures

Authors: Fusong Ju, Jianwei Zhu, Guozheng Wei, Qi Zhang, Shiwei Sun, Dongbo Bu

Abstract: Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represen… ▽ More Sequence set is a widely-used type of data source in a large variety of fields. A typical example is protein structure prediction, which takes an multiple sequence alignment (MSA) as input and aims to infer structural information from it. Almost all of the existing approaches exploit MSAs in an indirect fashion, i.e., they transform MSAs into position-specific scoring matrices (PSSM) that represent the distribution of amino acid types at each column. PSSM could capture column-wise characteristics of MSA, however, the column-wise characteristics embedded in each individual component sequence were nearly totally neglected. The drawback of PSSM is rooted in the fact that an MSA is essentially an unordered sequence set rather than a matrix. Specifically, the interchange of any two sequences will not affect the whole MSA. In contrast, the pixels in an image essentially form a matrix since any two rows of pixels cannot be interchanged. Therefore, the traditional deep neural networks designed for image processing cannot be directly applied on sequence sets. Here, we proposed a novel deep neural network framework (called Seq-SetNet) for sequence set processing. By employing a {\it symmetric function} module to integrate features calculated from preceding layers, Seq-SetNet are immune to the order of sequences in the input MSA. This advantage enables us to directly and fully exploit MSAs by considering each component protein individually. We evaluated Seq-SetNet by using it to extract structural information from MSA for protein secondary structure prediction. Experimental results on popular benchmark sets suggests that Seq-SetNet outperforms the state-of-the-art approaches by 3.6% in precision. These results clearly suggest the advantages of Seq-SetNet in sequence set processing and it can be readily used in a wide range of fields, say natural language processing. △ Less

Submitted 6 June, 2019; originally announced June 2019.

arXiv:1903.06113 [pdf, other]

Who and When to Screen: Multi-Round Active Screening for Recurrent Infectious Diseases Under Uncertainty

Authors: Han-Ching Ou, Arunesh Sinha, Sze-Chuan Suen, Andrew Perrault, Milind Tambe

Abstract: Controlling recurrent infectious diseases is a vital yet complicated problem. In this paper, we propose a novel active screening model (ACTS) and algorithms to facilitate active screening for recurrent diseases (no permanent immunity) under infection uncertainty. Our contributions are: (1) A new approach to modeling multi-round network-based screening/contact tracing under uncertainty, which is a… ▽ More Controlling recurrent infectious diseases is a vital yet complicated problem. In this paper, we propose a novel active screening model (ACTS) and algorithms to facilitate active screening for recurrent diseases (no permanent immunity) under infection uncertainty. Our contributions are: (1) A new approach to modeling multi-round network-based screening/contact tracing under uncertainty, which is a common real-life practice in a variety of diseases; (2) Two novel algorithms, Full- and Fast-REMEDY. Full-REMEDY considers the effect of future actions and finds a policy that provides high solution quality, where Fast-REMEDY scales linearly in the size of the network; (3) We evaluate Full- and Fast-REMEDY on several real-world datasets which emulate human contact and find that they control diseases better than the baselines. To the best of our knowledge, this is the first work on multi-round active screening with uncertainty for diseases with no permanent immunity. △ Less

Submitted 13 March, 2019; originally announced March 2019.

Comments: 11 pages

arXiv:1902.07787 [pdf]

Biophysics at the coffee shop: lessons learned working with George Oster

Authors: Oleg Igoshin, Jing Chen, Jianhua Xing, Jian Liu, Timothy C. Elston, Michael Grabe, Kenneth S. Kim, Jasmine Nirody, Padmini Rangamani, Sean Sun, Hongyun Wang, Charles Wolgemuth

Abstract: Over the past 50 years, the use of mathematical models, derived from physical reasoning, to describe molecular and cellular systems has evolved from an art of the few to a cornerstone of biological inquiry. George Oster stood out as a pioneer of this paradigm shift from descriptive to quantitative biology not only through his numerous research accomplishments, but also through the many students an… ▽ More Over the past 50 years, the use of mathematical models, derived from physical reasoning, to describe molecular and cellular systems has evolved from an art of the few to a cornerstone of biological inquiry. George Oster stood out as a pioneer of this paradigm shift from descriptive to quantitative biology not only through his numerous research accomplishments, but also through the many students and postdocs he mentored over his long career. Those of us fortunate enough to have worked with George agree that his sharp intellect, physical intuition and passion for scientific inquiry not only inspired us as scientists but also greatly influenced the way we conduct research. We would like to share a few important lessons we learned from George in honor of his memory and with the hope that they may inspire future generations of scientists. △ Less

Submitted 28 March, 2019; v1 submitted 20 February, 2019; originally announced February 2019.

Comments: 22 pages, 3 figures, accepted in Molecular Biology of the Cell

arXiv:1902.03978 [pdf]

doi 10.1038/s41592-019-0591-8

A complete data processing workflow for CryoET and subtomogram averaging

Authors: Muyuan Chen, James M. Bell, Xiaodong Shi, Stella Y. Sun, Zhao Wang, Steven J. Ludtke

Abstract: Electron cryotomography (CryoET) is currently the only method capable of visualizing cells in 3D at nanometer resolutions. While modern instruments produce massive amounts of tomography data containing extremely rich structural information, the data processing is very labor intensive and results are often limited by the skills of the personnel rather than the data. We present an integrated workflo… ▽ More Electron cryotomography (CryoET) is currently the only method capable of visualizing cells in 3D at nanometer resolutions. While modern instruments produce massive amounts of tomography data containing extremely rich structural information, the data processing is very labor intensive and results are often limited by the skills of the personnel rather than the data. We present an integrated workflow that covers the entire tomography data processing pipeline, from automated tilt series alignment to subnanometer resolution subtomogram averaging. This workflow greatly reduces human effort and increases throughput, and is capable of determining protein structures at state-of-the-art resolutions for both purified macromolecules and cells. △ Less

Submitted 11 February, 2019; originally announced February 2019.

Comments: 21 pages, 4+2 figures

Journal ref: Nature Methods 16 (2019) 1161-1168

arXiv:1809.00083 [pdf, other]

Predicting protein inter-residue contacts using composite likelihood maximization and deep learning

Authors: Haicang Zhang, Qi Zhang, Fusong Ju, Jianwei Zhu, Shiwei Sun, Yujuan Gao, Ziwei Xie, Minghua Deng, Shiwei Sun, Wei-Mou Zheng, Dongbo Bu

Abstract: Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is acc… ▽ More Accurate prediction of inter-residue contacts of a protein is important to calcu- lating its tertiary structure. Analysis of co-evolutionary events among residues has been proved effective to inferring inter-residue contacts. The Markov ran- dom field (MRF) technique, although being widely used for contact prediction, suffers from the following dilemma: the actual likelihood function of MRF is accurate but time-consuming to calculate, in contrast, approximations to the actual likelihood, say pseudo-likelihood, are efficient to calculate but inaccu- rate. Thus, how to achieve both accuracy and efficiency simultaneously remains a challenge. In this study, we present such an approach (called clmDCA) for contact prediction. Unlike plmDCA using pseudo-likelihood, i.e., the product of conditional probability of individual residues, our approach uses composite- likelihood, i.e., the product of conditional probability of all residue pairs. Com- posite likelihood has been theoretically proved as a better approximation to the actual likelihood function than pseudo-likelihood. Meanwhile, composite likelihood is still efficient to maximize, thus ensuring the efficiency of clmDCA. We present comprehensive experiments on popular benchmark datasets, includ- ing PSICOV dataset and CASP-11 dataset, to show that: i) clmDCA alone outperforms the existing MRF-based approaches in prediction accuracy. ii) When equipped with deep learning technique for refinement, the prediction ac- curacy of clmDCA was further significantly improved, suggesting the suitability of clmDCA for subsequent refinement procedure. We further present successful application of the predicted contacts to accurately build tertiary structures for proteins in the PSICOV dataset. Accessibility: The software clmDCA and a server are publicly accessible through http://protein.ict.ac.cn/clmDCA/. △ Less

Submitted 31 August, 2018; originally announced September 2018.

arXiv:1706.09478 [pdf, other]

doi 10.1088/1478-3975/aab0e6

Erg(r)odicity: Hidden Bias and the Growthrate Gain

Authors: Nash Rochman, Dan Popescu, Sean X. Sun

Abstract: Many single-cell observables are highly heterogeneous. A part of this heterogeneity stems from age-related phenomena: the fact that there is a nonuniform distribution of cells with different ages. This has led to a renewed interest in analytic methodologies including use of the "von Foerster equation" for predicting population growth and cell age distributions. Here we discuss how some of the most… ▽ More Many single-cell observables are highly heterogeneous. A part of this heterogeneity stems from age-related phenomena: the fact that there is a nonuniform distribution of cells with different ages. This has led to a renewed interest in analytic methodologies including use of the "von Foerster equation" for predicting population growth and cell age distributions. Here we discuss how some of the most popular implementations of this machinery assume a strong condition on the ergodicity of the cell cycle duration ensemble. We show that one common definition for the term ergodicity, "a single individual observed over many generations recapitulates the behavior of the entire ensemble" is implied by the other, "the probability of observing any state is conserved across time and over all individuals" in an ensemble with a fixed number of individuals but that this is not true when the ensemble is growing. We further explore the impact of generational correlations between cell cycle durations on the population growth rate. Finally, we explore the "growth rate gain" - the phenomenon that variations in the cell cycle duration lead to an improved population-level growth rate - in this context. We highlight that, fundamentally, this effect is due to asymmetric division. △ Less

Submitted 17 April, 2018; v1 submitted 28 June, 2017; originally announced June 2017.

Comments: 17 pages, 4 figures

arXiv:1612.03134 [pdf, other]

"Inchworm Filaments": Motility and Pattern Formation

Authors: Nash Rochman, Sean X. Sun

Abstract: In a previous paper, we examined a class of possible conformations for helically patterned filaments in contact with a bonding surface. In particular, we investigated geometries where contact between the pattern and the surface was improved through a periodic twisting and lifting of the filament. A consequence of this lifting is that the total length of the filament projected onto the surface decr… ▽ More In a previous paper, we examined a class of possible conformations for helically patterned filaments in contact with a bonding surface. In particular, we investigated geometries where contact between the pattern and the surface was improved through a periodic twisting and lifting of the filament. A consequence of this lifting is that the total length of the filament projected onto the surface decreases after bonding. When the bonding character of the surface is actuated, this phenomenon can lead to both lifelike "inchworm" behavior of the filaments and ensemble movement. We illustrate, through simulation, how pattern formation may be achieved through this mechanism. △ Less

Submitted 9 December, 2016; originally announced December 2016.

Comments: 8 pages 4 figures

arXiv:1609.00680 [pdf]

doi 10.1371/journal.pcbi.1005324

Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

Authors: Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, Jinbo Xu

Abstract: Recently exciting progress has been made on protein contact prediction, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep ne… ▽ More Recently exciting progress has been made on protein contact prediction, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual networks. This deep neural network allows us to model very complex sequence-contact relationship as well as long-range inter-contact correlation. Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. Tested on three datasets of 579 proteins, the average top L long-range prediction accuracy obtained our method, the representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints can yield correct folds (i.e., TMscore>0.6) for 203 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 proteins, respectively. Further, our contact-assisted models have much better quality than template-based models. Using our predicted contacts as restraints, we can (ab initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast, when the training proteins of our method are used as templates, homology modeling can only do so for 10 of them. One interesting finding is that even if we do not train our prediction models with any membrane proteins, our method works very well on membrane protein prediction. Finally, in recent blind CAMEO benchmark our method successfully folded 5 test proteins with a novel fold. △ Less

Submitted 27 November, 2016; v1 submitted 2 September, 2016; originally announced September 2016.

Journal ref: PLoS Comput Biol 13(1): e1005324, 2017

arXiv:1603.01579 [pdf, other]

To Grow is Not Enough: Impact of Noise on Cell Environmental Response and Fitness

Authors: Nash Rochman, Fangwei Si, Sean X. Sun

Abstract: Quantitative single cell measurements have shown that cell cycle duration (the time between cell divisions) for diverse cell types is a noisy variable. The underlying distribution is mean scalable with a universal shape for many cell types in a variety of environments. Here we show through both experiment and theory that increasing the amount of noise in the regulation of the cell cycle negatively… ▽ More Quantitative single cell measurements have shown that cell cycle duration (the time between cell divisions) for diverse cell types is a noisy variable. The underlying distribution is mean scalable with a universal shape for many cell types in a variety of environments. Here we show through both experiment and theory that increasing the amount of noise in the regulation of the cell cycle negatively impacts the growth rate but positively correlates with improved cellular response to fluctuating environments. Our findings suggest that even non-cooperative cells in exponential growth phase do not optimize fitness through growth rate alone, but also optimize adaptability to changing conditions. In a manner similar to genetic evolution, increasing the noise in biochemical processes correlates with improved response of the system to environmental changes. △ Less

Submitted 4 March, 2016; originally announced March 2016.

Comments: 5 pages, 4 figures

arXiv:1511.09181 [pdf, other]

Predicting diverse M-best protein contact maps

Authors: Siqi Sun, Jianzhu Ma, Sheng Wang, Jinbo Xu

Abstract: Protein contacts contain important information for protein structure and functional study, but contact prediction from sequence information remains very challenging. Recently evolutionary coupling (EC) analysis, which predicts contacts by detecting co-evolved residues (or columns) in a multiple sequence alignment (MSA), has made good progress due to better statistical assessment techniques and hig… ▽ More Protein contacts contain important information for protein structure and functional study, but contact prediction from sequence information remains very challenging. Recently evolutionary coupling (EC) analysis, which predicts contacts by detecting co-evolved residues (or columns) in a multiple sequence alignment (MSA), has made good progress due to better statistical assessment techniques and high-throughput sequencing. Existing EC analysis methods predict only a single contact map for a given protein, which may have low accuracy especially when the protein under prediction does not have a large number of sequence homologs. Analogous to ab initio folding that usually predicts a few possible 3D models for a given protein sequence, this paper presents a novel structure learning method that can predict a set of diverse contact maps for a given protein sequence, in which the best solution usually has much better accuracy than the first one. Our experimental tests show that for many test proteins, the best out of 5 solutions generated by our method has accuracy at least 0.1 better than the first one when the top L/5 or L/10 (L is the sequence length) predicted long-range contacts are evaluated, especially for protein families with a small number of sequence homologs. Our best solutions also have better quality than those generated by the two popular EC methods Evfold and PSICOV. △ Less

Submitted 30 November, 2015; originally announced November 2015.

Comments: Accepted as oral presentation at Computational Structural Bioinformatics Workshop (In Conjunction With IEEE BIBM 2015 )

Showing 1–47 of 47 results for author: Sun, S