-
Enhancing TCR-Peptide Interaction Prediction with Pretrained Language Models and Molecular Representations
Authors:
Cong Qi,
Hanzhang Fang,
Siqi jiang,
Tianxing Hu,
Wei Zhi
Abstract:
Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a…
▽ More
Understanding the binding specificity between T-cell receptors (TCRs) and peptide-major histocompatibility complexes (pMHCs) is central to immunotherapy and vaccine development. However, current predictive models struggle with generalization, especially in data-scarce settings and when faced with novel epitopes. We present LANTERN (Large lAnguage model-powered TCR-Enhanced Recognition Network), a deep learning framework that combines large-scale protein language models with chemical representations of peptides. By encoding TCR \b{eta}-chain sequences using ESM-1b and transforming peptide sequences into SMILES strings processed by MolFormer, LANTERN captures rich biological and chemical features critical for TCR-peptide recognition. Through extensive benchmarking against existing models such as ChemBERTa, TITAN, and NetTCR, LANTERN demonstrates superior performance, particularly in zero-shot and few-shot learning scenarios. Our model also benefits from a robust negative sampling strategy and shows significant clustering improvements via embedding analysis. These results highlight the potential of LANTERN to advance TCR-pMHC binding prediction and support the development of personalized immunotherapies.
△ Less
Submitted 22 April, 2025;
originally announced May 2025.
-
Bidirectional Mamba for Single-Cell Data: Efficient Context Learning with Biological Fidelity
Authors:
Cong Qi,
Hanzhang Fang,
Tianxing Hu,
Siqi Jiang,
Wei Zhi
Abstract:
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range depende…
▽ More
Single-cell RNA sequencing (scRNA-seq) enables high-resolution analysis of cellular heterogeneity, but its complexity, which is marked by high dimensionality, sparsity, and batch effects, which poses major computational challenges. Transformer-based models have made significant advances in this domain but are often limited by their quadratic complexity and suboptimal handling of long-range dependencies. In this work, we introduce GeneMamba, a scalable and efficient foundation model for single-cell transcriptomics built on state space modeling. Leveraging the Bi-Mamba architecture, GeneMamba captures bidirectional gene context with linear-time complexity, offering substantial computational gains over transformer baselines. The model is pretrained on nearly 30 million cells and incorporates biologically informed objectives, including pathway-aware contrastive loss and rank-based gene encoding. We evaluate GeneMamba across diverse tasks, including multi-batch integration, cell type annotation, and gene-gene correlation, demonstrating strong performance, interpretability, and robustness. These results position GeneMamba as a practical and powerful alternative to transformer-based methods, advancing the development of biologically grounded, scalable tools for large-scale single-cell data analysis.
△ Less
Submitted 22 April, 2025;
originally announced April 2025.
-
A neuromorphic camera for tracking passive and active matter with lower data throughput
Authors:
Gabriel Britto Monteiro,
Megan Lim,
Tiffany Cheow Yuen Tan,
Avinash Upadhya,
Zhuo Liang,
Benjamin Agnew,
Tomonori Hu,
Benjamin J. Eggleton,
Christopher Perrella,
Kylie Dunning,
Kishan Dholakia
Abstract:
We demonstrate the merits of using a neuromorphic, or event-based camera (EBC), for tracking of both passive and active matter. For passive matter, we tracked the Brownian motion of different micro-particles and estimated their diffusion coefficient. For active matter, we explored the case of tracking murine spermatozoa and extracted motility parameters from the motion of cells. This has applicati…
▽ More
We demonstrate the merits of using a neuromorphic, or event-based camera (EBC), for tracking of both passive and active matter. For passive matter, we tracked the Brownian motion of different micro-particles and estimated their diffusion coefficient. For active matter, we explored the case of tracking murine spermatozoa and extracted motility parameters from the motion of cells. This has applications in enhancing outcomes for clinical fertility treatments. Using the EBC, we obtain results equivalent to those from an sCMOS camera, yet achieve a reduction in file size of up to two orders of magnitude. This is important in the modern computer era, as it reduces data throughput, and is well-aligned with edge-computing applications. We believe the EBC is an excellent choice, particularly for long-term studies of active matter.
△ Less
Submitted 14 January, 2025; v1 submitted 13 January, 2025;
originally announced January 2025.
-
Coronary CTA and Quantitative Cardiac CT Perfusion (CCTP) in Coronary Artery Disease
Authors:
Hao Wu,
Yingnan Song,
Ammar Hoori,
Ananya Subramaniam,
Juhwan Lee,
Justin Kim,
Tao Hu,
Sadeer Al-Kindi,
Wei-Ming Huang,
Chun-Ho Yun,
Chung-Lieh Hung,
Sanjay Rajagopalan,
David L. Wilson
Abstract:
We assessed the benefit of combining stress cardiac CT perfusion (CCTP) myocardial blood flow (MBF) with coronary CT angiography (CCTA) using our innovative CCTP software. By combining CCTA and CCTP, one can uniquely identify a flow limiting stenosis (obstructive-lesion + low-MBF) versus MVD (no-obstructive-lesion + low-MBF. We retrospectively evaluated 104 patients with suspected CAD, including 1…
▽ More
We assessed the benefit of combining stress cardiac CT perfusion (CCTP) myocardial blood flow (MBF) with coronary CT angiography (CCTA) using our innovative CCTP software. By combining CCTA and CCTP, one can uniquely identify a flow limiting stenosis (obstructive-lesion + low-MBF) versus MVD (no-obstructive-lesion + low-MBF. We retrospectively evaluated 104 patients with suspected CAD, including 18 with diabetes, who underwent CCTA+CCTP. Whole heart and territorial MBF was assessed using our automated pipeline for CCTP analysis that included beam hardening correction; temporal scan registration; automated segmentation; fast, accurate, robust MBF estimation; and visualization. Stenosis severity was scored using the CCTA coronary-artery-disease-reporting-and-data-system (CAD-RADS), with obstructive stenosis deemed as CAD-RADS>=3. We established a threshold MBF (MBF=199-mL/min-100g) for normal perfusion. In patients with CAD-RADS>=3, 28/37(76%) patients showed ischemia in the corresponding territory. Two patients with obstructive disease had normal perfusion, suggesting collaterals and/or a hemodynamically insignificant stenosis. Among diabetics, 10 of 18 (56%) demonstrated diffuse ischemia consistent with MVD. Among non-diabetics, only 6% had MVD. Sex-specific prevalence of MVD was 21%/24% (M/F). On a per-vessel basis (n=256), MBF showed a significant difference between territories with and without obstructive stenosis (165 +/- 61 mL/min-100g vs. 274 +/- 62 mL/min-100g, p <0.05). A significant and negative rank correlation (rho=-0.53, p<0.05) between territory MBF and CAD-RADS was seen. CCTA in conjunction with a new automated quantitative CCTP approach can augment the interpretation of CAD, enabling the distinction of ischemia due to obstructive lesions and MVD.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
AI prediction of cardiovascular events using opportunistic epicardial adipose tissue assessments from CT calcium score
Authors:
Tao Hu,
Joshua Freeze,
Prerna Singh,
Justin Kim,
Yingnan Song,
Hao Wu,
Juhwan Lee,
Sadeer Al-Kindi,
Sanjay Rajagopalan,
David L. Wilson,
Ammar Hoori
Abstract:
Background: Recent studies have used basic epicardial adipose tissue (EAT) assessments (e.g., volume and mean HU) to predict risk of atherosclerosis-related, major adverse cardiovascular events (MACE). Objectives: Create novel, hand-crafted EAT features, 'fat-omics', to capture the pathophysiology of EAT and improve MACE prediction. Methods: We segmented EAT using a previously-validated deep learn…
▽ More
Background: Recent studies have used basic epicardial adipose tissue (EAT) assessments (e.g., volume and mean HU) to predict risk of atherosclerosis-related, major adverse cardiovascular events (MACE). Objectives: Create novel, hand-crafted EAT features, 'fat-omics', to capture the pathophysiology of EAT and improve MACE prediction. Methods: We segmented EAT using a previously-validated deep learning method with optional manual correction. We extracted 148 radiomic features (morphological, spatial, and intensity) and used Cox elastic-net for feature reduction and prediction of MACE. Results: Traditional fat features gave marginal prediction (EAT-volume/EAT-mean-HU/ BMI gave C-index 0.53/0.55/0.57, respectively). Significant improvement was obtained with 15 fat-omics features (C-index=0.69, test set). High-risk features included volume-of-voxels-having-elevated-HU-[-50, -30-HU] and HU-negative-skewness, both of which assess high HU, which as been implicated in fat inflammation. Other high-risk features include kurtosis-of-EAT-thickness, reflecting the heterogeneity of thicknesses, and EAT-volume-in-the-top-25%-of-the-heart, emphasizing adipose near the proximal coronary arteries. Kaplan-Meyer plots of Cox-identified, high- and low-risk patients were well separated with the median of the fat-omics risk, while high-risk group having HR 2.4 times that of the low-risk group (P<0.001). Conclusion: Preliminary findings indicate an opportunity to use more finely tuned, explainable assessments on EAT for improved cardiovascular risk prediction.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Enhancing cardiovascular risk prediction through AI-enabled calcium-omics
Authors:
Ammar Hoori,
Sadeer Al-Kindi,
Tao Hu,
Yingnan Song,
Hao Wu,
Juhwan Lee,
Nour Tashtish,
Pingfu Fu,
Robert Gilkeson,
Sanjay Rajagopalan,
David L. Wilson
Abstract:
Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease.
Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can…
▽ More
Background. Coronary artery calcium (CAC) is a powerful predictor of major adverse cardiovascular events (MACE). Traditional Agatston score simply sums the calcium, albeit in a non-linear way, leaving room for improved calcification assessments that will more fully capture the extent of disease.
Objective. To determine if AI methods using detailed calcification features (i.e., calcium-omics) can improve MACE prediction.
Methods. We investigated additional features of calcification including assessment of mass, volume, density, spatial distribution, territory, etc. We used a Cox model with elastic-net regularization on 2457 CT calcium score (CTCS) enriched for MACE events obtained from a large no-cost CLARIFY program (ClinicalTri-als.gov Identifier: NCT04075162). We employed sampling techniques to enhance model training. We also investigated Cox models with selected features to identify explainable high-risk characteristics.
Results. Our proposed calcium-omics model with modified synthetic down sampling and up sampling gave C-index (80.5%/71.6%) and two-year AUC (82.4%/74.8%) for (80:20, training/testing), respectively (sampling was applied to the training set only). Results compared favorably to Agatston which gave C-index (71.3%/70.3%) and AUC (71.8%/68.8%), respectively. Among calcium-omics features, numbers of calcifications, LAD mass, and diffusivity (a measure of spatial distribution) were important determinants of increased risk, with dense calcification (>1000HU) associated with lower risk. The calcium-omics model reclassified 63% of MACE patients to the high risk group in a held-out test. The categorical net-reclassification index was NRI=0.153.
Conclusions. AI analysis of coronary calcification can lead to improved results as compared to Agatston scoring. Our findings suggest the utility of calcium-omics in improved prediction of risk.
△ Less
Submitted 23 August, 2023;
originally announced August 2023.
-
Phenotype Search Trajectory Networks for Linear Genetic Programming
Authors:
Ting Hu,
Gabriela Ochoa,
Wolfgang Banzhaf
Abstract:
Genotype-to-phenotype mappings translate genotypic variations such as mutations into phenotypic changes. Neutrality is the observation that some mutations do not lead to phenotypic changes. Studying the search trajectories in genotypic and phenotypic spaces, especially through neutral mutations, helps us to better understand the progression of evolution and its algorithmic behaviour. In this study…
▽ More
Genotype-to-phenotype mappings translate genotypic variations such as mutations into phenotypic changes. Neutrality is the observation that some mutations do not lead to phenotypic changes. Studying the search trajectories in genotypic and phenotypic spaces, especially through neutral mutations, helps us to better understand the progression of evolution and its algorithmic behaviour. In this study, we visualise the search trajectories of a genetic programming system as graph-based models, where nodes are genotypes/phenotypes and edges represent their mutational transitions. We also quantitatively measure the characteristics of phenotypes including their genotypic abundance (the requirement for neutrality) and Kolmogorov complexity. We connect these quantified metrics with search trajectory visualisations, and find that more complex phenotypes are under-represented by fewer genotypes and are harder for evolution to discover. Less complex phenotypes, on the other hand, are over-represented by genotypes, are easier to find, and frequently serve as stepping-stones for evolution.
△ Less
Submitted 23 June, 2023; v1 submitted 15 November, 2022;
originally announced November 2022.
-
Exploring the impact of under-reported cases on the COVID-19 spatiotemporal distribution using healthcare worker infection data
Authors:
Peixiao Wang,
Tao Hu,
Hongqiang Liu,
Xinyan Zhu
Abstract:
A timely understanding of the spatiotemporal pattern and development trend of COVID-19 is critical for timely prevention and control. However, the under-reporting of cases is widespread in fields associated with public health. It is also possible to draw biased inferences and formulate inappropriate prevention and control policies if the phenomenon of under-reporting is not taken into account. The…
▽ More
A timely understanding of the spatiotemporal pattern and development trend of COVID-19 is critical for timely prevention and control. However, the under-reporting of cases is widespread in fields associated with public health. It is also possible to draw biased inferences and formulate inappropriate prevention and control policies if the phenomenon of under-reporting is not taken into account. Therefore, in this paper, a novel framework was proposed to explore the impact of under-reporting on COVID-19 spatiotemporal distributions, and empirical analysis was carried out using infection data of healthcare workers in Wuhan and Hubei (excluding Wuhan). The results show that (1) the lognormal distribution was the most suitable to describe the evolution of epidemic with time; (2) the estimated peak infection time of the reported cases lagged the peak infection time of the healthcare worker cases, and the estimated infection time interval of the reported cases was smaller than that of the healthcare worker cases. (3) The impact of under-reporting cases on the early stages of the pandemic was greater than that on its later stages, and the impact on the early onset area was greater than that on the late onset area. (4) Although the number of reported cases was lower than the actual number of cases, a high spatial correlation existed between the cumulatively reported cases and healthcare worker cases. The proposed framework of this study is highly extensible, and relevant researchers can use data sources from other counties to carry out similar research.
△ Less
Submitted 20 January, 2022; v1 submitted 9 November, 2020;
originally announced November 2020.
-
Taking the pulse of COVID-19: A spatiotemporal perspective
Authors:
Chaowei Yang,
Dexuan Sha,
Qian Liu,
Yun Li,
Hai Lan,
Weihe Wendy Guan,
Tao Hu,
Zhenlong Li,
Zhiran Zhang,
John Hoot Thompson,
Zifu Wang,
David Wong,
Shiyang Ruan,
Manzhu Yu,
Douglas Richardson,
Luyao Zhang,
Ruizhi Hou,
You Zhou,
Cheng Zhong,
Yifei Tian,
Fayez Beaini,
Kyla Carte,
Colin Flynn,
Wei Liu,
Dieter Pfoser
, et al. (10 additional authors not shown)
Abstract:
The sudden outbreak of the Coronavirus disease (COVID-19) swept across the world in early 2020, triggering the lockdowns of several billion people across many countries, including China, Spain, India, the U.K., Italy, France, Germany, and most states of the U.S. The transmission of the virus accelerated rapidly with the most confirmed cases in the U.S., and New York City became an epicenter of the…
▽ More
The sudden outbreak of the Coronavirus disease (COVID-19) swept across the world in early 2020, triggering the lockdowns of several billion people across many countries, including China, Spain, India, the U.K., Italy, France, Germany, and most states of the U.S. The transmission of the virus accelerated rapidly with the most confirmed cases in the U.S., and New York City became an epicenter of the pandemic by the end of March. In response to this national and global emergency, the NSF Spatiotemporal Innovation Center brought together a taskforce of international researchers and assembled implemented strategies to rapidly respond to this crisis, for supporting research, saving lives, and protecting the health of global citizens. This perspective paper presents our collective view on the global health emergency and our effort in collecting, analyzing, and sharing relevant data on global policy and government responses, geospatial indicators of the outbreak and evolving forecasts; in developing research capabilities and mitigation measures with global scientists, promoting collaborative research on outbreak dynamics, and reflecting on the dynamic responses from human societies.
△ Less
Submitted 8 May, 2020;
originally announced May 2020.
-
Screening for REM Sleep Behaviour Disorder with Minimal Sensors
Authors:
Navin Cooray,
Fernando Andreotti,
Christine Lo,
Mkael Symmonds,
Michele T. M. Hu,
Maarten De Vos
Abstract:
Rapid-Eye-Movement (REM) sleep behaviour disorder (RBD) is an early predictor of Parkinson's disease, dementia with Lewy bodies, and multiple system atrophy. This study investigates a minimal set of sensors to achieve effective screening for RBD in the population, integrating automated sleep staging (three state) followed by RBD detection without the need for cumbersome electroencephalogram (EEG)…
▽ More
Rapid-Eye-Movement (REM) sleep behaviour disorder (RBD) is an early predictor of Parkinson's disease, dementia with Lewy bodies, and multiple system atrophy. This study investigates a minimal set of sensors to achieve effective screening for RBD in the population, integrating automated sleep staging (three state) followed by RBD detection without the need for cumbersome electroencephalogram (EEG) sensors. Polysomnography signals from 50 participants with RBD and 50 age-matched healthy controls were used to evaluate this study. Three stage sleep classification was achieved using a Random Forest (RF) classifier and features derived from a combination of cost-effective and easy to use sensors, namely electrocardiogram (ECG), electrooculogram (EOG), and electromyogram (EMG) channels. Subsequently, RBD detection was achieved using established and new metrics derived from ECG and EMG metrics. The EOG and EMG combination provided the best minimalist fully automated performance, achieving $0.57\pm0.19$ kappa (3 stage) for sleep staging and an RBD detection accuracy of $0.90\pm0.11$, (sensitivity, and specificity $0.88\pm0.13$, and $0.92\pm0.098$). A single ECG sensor allowed three state sleep staging with $0.28\pm0.06$ kappa and RBD detection accuracy of $0.62\pm0.10$. This study demonstrated the feasibility of using signals from a single EOG and EMG sensor to detect RBD using fully-automated techniques. This study proposes a cost-effective, practical, and simple RBD identification support tool using only two sensors (EMG and EOG), ideal for screening purposes.
△ Less
Submitted 24 October, 2019;
originally announced October 2019.
-
Detection of REM Sleep Behaviour Disorder by Automated Polysomnography Analysis
Authors:
Navin Cooray,
Fernando Andreotti,
Christine Lo,
Mkael Symmonds,
Michele T. M. Hu,
Maarten De Vos
Abstract:
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage cla…
▽ More
Evidence suggests Rapid-Eye-Movement (REM) Sleep Behaviour Disorder (RBD) is an early predictor of Parkinson's disease. This study proposes a fully-automated framework for RBD detection consisting of automated sleep staging followed by RBD identification. Analysis was assessed using a limited polysomnography montage from 53 participants with RBD and 53 age-matched healthy controls. Sleep stage classification was achieved using a Random Forest (RF) classifier and 156 features extracted from electroencephalogram (EEG), electrooculogram (EOG) and electromyogram (EMG) channels. For RBD detection, a RF classifier was trained combining established techniques to quantify muscle atonia with additional features that incorporate sleep architecture and the EMG fractal exponent. Automated multi-state sleep staging achieved a 0.62 Cohen's Kappa score. RBD detection accuracy improved by 10% to 96% (compared to individual established metrics) when using manually annotated sleep staging. Accuracy remained high (92%) when using automated sleep staging. This study outperforms established metrics and demonstrates that incorporating sleep architecture and sleep stage transitions can benefit RBD detection. This study also achieved automated sleep staging with a level of accuracy comparable to manual annotation. This study validates a tractable, fully-automated, and sensitive pipeline for RBD identification that could be translated to wearable take-home technology.
△ Less
Submitted 12 November, 2018;
originally announced November 2018.
-
A Hebbian/Anti-Hebbian Network for Online Sparse Dictionary Learning Derived from Symmetric Matrix Factorization
Authors:
Tao Hu,
Cengiz Pehlevan,
Dmitri B. Chklovskii
Abstract:
Olshausen and Field (OF) proposed that neural computations in the primary visual cortex (V1) can be partially modeled by sparse dictionary learning. By minimizing the regularized representation error they derived an online algorithm, which learns Gabor-filter receptive fields from a natural image ensemble in agreement with physiological experiments. Whereas the OF algorithm can be mapped onto the…
▽ More
Olshausen and Field (OF) proposed that neural computations in the primary visual cortex (V1) can be partially modeled by sparse dictionary learning. By minimizing the regularized representation error they derived an online algorithm, which learns Gabor-filter receptive fields from a natural image ensemble in agreement with physiological experiments. Whereas the OF algorithm can be mapped onto the dynamics and synaptic plasticity in a single-layer neural network, the derived learning rule is nonlocal - the synaptic weight update depends on the activity of neurons other than just pre- and postsynaptic ones - and hence biologically implausible. Here, to overcome this problem, we derive sparse dictionary learning from a novel cost-function - a regularized error of the symmetric factorization of the input's similarity matrix. Our algorithm maps onto a neural network of the same architecture as OF but using only biologically plausible local learning rules. When trained on natural images our network learns Gabor-filter receptive fields and reproduces the correlation among synaptic weights hard-wired in the OF network. Therefore, online symmetric matrix factorization may serve as an algorithmic theory of neural computation.
△ Less
Submitted 30 November, 2015; v1 submitted 2 March, 2015;
originally announced March 2015.
-
A Hebbian/Anti-Hebbian Neural Network for Linear Subspace Learning: A Derivation from Multidimensional Scaling of Streaming Data
Authors:
Cengiz Pehlevan,
Tao Hu,
Dmitri B. Chklovskii
Abstract:
Neural network models of early sensory processing typically reduce the dimensionality of streaming input data. Such networks learn the principal subspace, in the sense of principal component analysis (PCA), by adjusting synaptic weights according to activity-dependent learning rules. When derived from a principled cost function these rules are nonlocal and hence biologically implausible. At the sa…
▽ More
Neural network models of early sensory processing typically reduce the dimensionality of streaming input data. Such networks learn the principal subspace, in the sense of principal component analysis (PCA), by adjusting synaptic weights according to activity-dependent learning rules. When derived from a principled cost function these rules are nonlocal and hence biologically implausible. At the same time, biologically plausible local rules have been postulated rather than derived from a principled cost function. Here, to bridge this gap, we derive a biologically plausible network for subspace learning on streaming data by minimizing a principled cost function. In a departure from previous work, where cost was quantified by the representation, or reconstruction, error, we adopt a multidimensional scaling (MDS) cost function for streaming data. The resulting algorithm relies only on biologically plausible Hebbian and anti-Hebbian local learning rules. In a stochastic setting, synaptic weights converge to a stationary state which projects the input data onto the principal subspace. If the data are generated by a nonstationary distribution, the network can track the principal subspace. Thus, our result makes a step towards an algorithmic theory of neural computation.
△ Less
Submitted 2 March, 2015;
originally announced March 2015.
-
A genomic map of the effects of linked selection in Drosophila
Authors:
Eyal Elyashiv,
Shmuel Sattath,
Tina T. Hu,
Alon Strustovsky,
Graham McVicker,
Peter Andolfatto,
Graham Coop,
Guy Sella
Abstract:
Natural selection at one site shapes patterns of genetic variation at linked sites. Quantifying the effects of 'linked selection' on levels of genetic diversity is key to making reliable inference about demography, building a null model in scans for targets of adaptation, and learning about the dynamics of natural selection. Here, we introduce the first method that jointly infers parameters of dis…
▽ More
Natural selection at one site shapes patterns of genetic variation at linked sites. Quantifying the effects of 'linked selection' on levels of genetic diversity is key to making reliable inference about demography, building a null model in scans for targets of adaptation, and learning about the dynamics of natural selection. Here, we introduce the first method that jointly infers parameters of distinct modes of linked selection, notably background selection and selective sweeps, from genome-wide diversity data, functional annotations and genetic maps. The central idea is to calculate the probability that a neutral site is polymorphic given local annotations, substitution patterns, and recombination rates. Information is then combined across sites and samples using composite likelihood in order to estimate genome-wide parameters of distinct modes of selection. In addition to parameter estimation, this approach yields a map of the expected neutral diversity levels along the genome. To illustrate the utility of our approach, we apply it to genome-wide resequencing data from 125 lines in Drosophila melanogaster and reliably predict diversity levels at the 1Mb scale. Our results corroborate estimates of a high fraction of beneficial substitutions in proteins and untranslated regions (UTR). They allow us to distinguish between the contribution of sweeps and other modes of selection around amino acid substitutions and to uncover evidence for pervasive sweeps in untranslated regions (UTRs). Our inference further suggests a substantial effect of linked selection from non-classic sweeps. More generally, we demonstrate that linked selection has had a larger effect in reducing diversity levels and increasing their variance in D. melanogaster than previously appreciated.
△ Less
Submitted 23 August, 2016; v1 submitted 23 August, 2014;
originally announced August 2014.
-
Fast Genome-Wide QTL Analysis Using Mendel
Authors:
Hua Zhou,
Jin Zhou,
Tao Hu,
Eric M Sobel,
Kenneth Lange
Abstract:
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (…
▽ More
Pedigree GWAS (Option 29) in the current version of the Mendel software is an optimized subroutine for performing large scale genome-wide QTL analysis. This analysis (a) works for random sample data, pedigree data, or a mix of both, (b) is highly efficient in both run time and memory requirement, (c) accommodates both univariate and multivariate traits, (d) works for autosomal and x-linked loci, (e) correctly deals with missing data in traits, covariates, and genotypes, (f) allows for covariate adjustment and constraints among parameters, (g) uses either theoretical or SNP-based empirical kinship matrix for additive polygenic effects, (h) allows extra variance components such as dominant polygenic effects and household effects, (i) detects and reports outlier individuals and pedigrees, and (j) allows for robust estimation via the $t$-distribution. The current paper assesses these capabilities on the genetics analysis workshop 19 (GAW19) sequencing data. We analyzed simulated and real phenotypes for both family and random sample data sets. For instance, when jointly testing the 8 longitudinally measured systolic blood pressure (SBP) and diastolic blood pressure (DBP) traits, it takes Mendel 78 minutes on a standard laptop computer to read, quality check, and analyze a data set with 849 individuals and 8.3 million SNPs. Genome-wide eQTL analysis of 20,643 expression traits on 641 individuals with 8.3 million SNPs takes 30 hours using 20 parallel runs on a cluster. Mendel is freely available at \url{http://www.genetics.ucla.edu/software}.
△ Less
Submitted 30 July, 2014;
originally announced July 2014.
-
A Neuron as a Signal Processing Device
Authors:
Tao Hu,
Zaid J. Towfic,
Cengiz Pehlevan,
Alex Genkin,
Dmitri B. Chklovskii
Abstract:
A neuron is a basic physiological and computational unit of the brain. While much is known about the physiological properties of a neuron, its computational role is poorly understood. Here we propose to view a neuron as a signal processing device that represents the incoming streaming data matrix as a sparse vector of synaptic weights scaled by an outgoing sparse activity vector. Formally, a neuro…
▽ More
A neuron is a basic physiological and computational unit of the brain. While much is known about the physiological properties of a neuron, its computational role is poorly understood. Here we propose to view a neuron as a signal processing device that represents the incoming streaming data matrix as a sparse vector of synaptic weights scaled by an outgoing sparse activity vector. Formally, a neuron minimizes a cost function comprising a cumulative squared representation error and regularization terms. We derive an online algorithm that minimizes such cost function by alternating between the minimization with respect to activity and with respect to synaptic weights. The steps of this algorithm reproduce well-known physiological properties of a neuron, such as weighted summation and leaky integration of synaptic inputs, as well as an Oja-like, but parameter-free, synaptic learning rule. Our theoretical framework makes several predictions, some of which can be verified by the existing data, others require further experiments. Such framework should allow modeling the function of neuronal circuits without necessarily measuring all the microscopic biophysical parameters, as well as facilitate the design of neuromorphic electronics.
△ Less
Submitted 12 May, 2014;
originally announced May 2014.
-
Tandem duplications and the limits of natural selection in Drosophila yakuba and Drosophila simulans
Authors:
Rebekah L Rogers,
Julie M Cridland,
Ling Shao,
Tina T Hu,
Peter Andolfatto,
Kevin R Thornton
Abstract:
Tandem duplications are an essential source of genetic novelty, and their variation in natural populations is expected to influence adaptive walks. Here, we describe evolutionary impacts of recently-derived, segregating tandem duplications in Drosophila yakuba and Drosophila simulans. We observe an excess of duplicated genes involved in defense against pathogens, insecticide resistance, chorion de…
▽ More
Tandem duplications are an essential source of genetic novelty, and their variation in natural populations is expected to influence adaptive walks. Here, we describe evolutionary impacts of recently-derived, segregating tandem duplications in Drosophila yakuba and Drosophila simulans. We observe an excess of duplicated genes involved in defense against pathogens, insecticide resistance, chorion development, cuticular peptides, and lipases or endopeptidases associated with the accessory glands, suggesting that duplications function in Red Queen dynamics and rapid evolution. We document evidence of widespread selection on the D. simulans X, suggesting adaptation through duplication is common on the X. Despite the evidence for positive selection, duplicates display an excess of low frequency variants consistent with largely detrimental impacts, limiting the variation that can effectively facilitate adaptation. Although we observe hundreds of gene duplications, we show that segregating variation is insufficient to provide duplicate copies of the entire genome, and the number of duplications in the population spans 13.4\% of major chromosome arms in D. yakuba and 9.7\% in D. simulans. Whole gene duplication rates are low at $1.17\times10^{-9}$ per gene per generation in D. yakuba and $6.03\times10^{-10}$ per gene per generation in D. simulans, suggesting long wait times for new mutations on the order of thousands of years for the establishment of sweeps. Hence, in cases where adaption depends on individual tandem duplications, evolution will be severely limited by mutation. We observe low levels of parallel recruitment of the same duplicated gene in different species, suggesting that the span of standing variation will define evolutionary outcomes in spite of convergence across gene ontologies consistent with rapidly evolving phenotypes.} }
△ Less
Submitted 26 August, 2014; v1 submitted 2 May, 2014;
originally announced May 2014.
-
Landscape of standing variation for tandem duplications in Drosophila yakuba and Drosophila simulans
Authors:
Rebekah L. Rogers,
Julie M. Cridland,
Ling Shao,
Tina T. Hu,
Peter Andolfatto,
Kevin R. Thornton
Abstract:
We have used whole genome paired-end Illumina sequence data to identify tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines of D. simulans and performed genome wide validation with PacBio long molecule sequencing. We identify 1,415 tandem duplications that are segregating in D. yakuba as well as 975 duplications in D. simulans, indicating greater variation in D. yakuba.…
▽ More
We have used whole genome paired-end Illumina sequence data to identify tandem duplications in 20 isofemale lines of D. yakuba, and 20 isofemale lines of D. simulans and performed genome wide validation with PacBio long molecule sequencing. We identify 1,415 tandem duplications that are segregating in D. yakuba as well as 975 duplications in D. simulans, indicating greater variation in D. yakuba. Additionally, we observe high rates of secondary deletions at duplicated sites, with 8% of duplicated sites in D. simulans and 17% of sites in D. yakuba modified with deletions. These secondary deletions are consistent with the action of the large loop mismatch repair system acting to remove polymorphic tandem duplication, resulting in rapid dynamics of gain and loss in duplicated alleles and a richer substrate of genetic novelty than has been previously reported. Most duplications are present in only single strains, suggesting deleterious impacts are common. D. simulans shows larger numbers of whole gene duplications in comparison to larger proportions of gene fragments in D. yakuba. D. simulans displays an excess of high frequency variants on the X chromosome, consistent with adaptive evolution through duplications on the D. simulans X or demographic forces driving duplicates to high frequency. We identify 78 chimeric genes in D. yakuba and 38 chimeric genes in D. simulans, as well as 143 cases of recruited non-coding sequence in D. yakuba and 96 in D. simulans, in agreement with rates of chimeric gene origination in D. melanogaster. Together, these results suggest that tandem duplications often result in complex variation beyond whole gene duplications that offers a rich substrate of standing variation that is likely to contribute both to detrimental phenotypes and disease, as well as to adaptive evolutionary change.
△ Less
Submitted 22 April, 2014; v1 submitted 28 January, 2014;
originally announced January 2014.
-
Online computation of sparse representations of time varying stimuli using a biologically motivated neural network
Authors:
Tao Hu,
Dmitri B. Chklovskii
Abstract:
Natural stimuli are highly redundant, possessing significant spatial and temporal correlations. While sparse coding has been proposed as an efficient strategy employed by neural systems to encode sensory stimuli, the underlying mechanisms are still not well understood. Most previous approaches model the neural dynamics by the sparse representation dictionary itself and compute the representation c…
▽ More
Natural stimuli are highly redundant, possessing significant spatial and temporal correlations. While sparse coding has been proposed as an efficient strategy employed by neural systems to encode sensory stimuli, the underlying mechanisms are still not well understood. Most previous approaches model the neural dynamics by the sparse representation dictionary itself and compute the representation coefficients offline. In reality, faced with the challenge of constantly changing stimuli, neurons must compute the sparse representations dynamically in an online fashion. Here, we describe a leaky linearized Bregman iteration (LLBI) algorithm which computes the time varying sparse representations using a biologically motivated network of leaky rectifying neurons. Compared to previous attempt of dynamic sparse coding, LLBI exploits the temporal correlation of stimuli and demonstrate better performance both in representation error and the smoothness of temporal evolution of sparse coefficients.
△ Less
Submitted 13 October, 2012;
originally announced October 2012.
-
Reconstruction of Sparse Circuits Using Multi-neuronal Excitation (RESCUME)
Authors:
Tao Hu,
Dmitri B. Chklovskii
Abstract:
One of the central problems in neuroscience is reconstructing synaptic connectivity in neural circuits. Synapses onto a neuron can be probed by sequentially stimulating potentially pre-synaptic neurons while monitoring the membrane voltage of the post-synaptic neuron. Reconstructing a large neural circuit using such a "brute force" approach is rather time-consuming and inefficient because the conn…
▽ More
One of the central problems in neuroscience is reconstructing synaptic connectivity in neural circuits. Synapses onto a neuron can be probed by sequentially stimulating potentially pre-synaptic neurons while monitoring the membrane voltage of the post-synaptic neuron. Reconstructing a large neural circuit using such a "brute force" approach is rather time-consuming and inefficient because the connectivity in neural circuits is sparse. Instead, we propose to measure a post-synaptic neuron's voltage while stimulating sequentially random subsets of multiple potentially pre-synaptic neurons. To reconstruct these synaptic connections from the recorded voltage we apply a decoding algorithm recently developed for compressive sensing. Compared to the brute force approach, our method promises significant time savings that grow with the size of the circuit. We use computer simulations to find optimal stimulation parameters and explore the feasibility of our reconstruction method under realistic experimental conditions including noise and non-linear synaptic integration. Multineuronal stimulation allows reconstructing synaptic connectivity just from the spiking activity of post-synaptic neurons, even when sub-threshold voltage is unavailable. By using calcium indicators, voltage-sensitive dyes, or multi-electrode arrays one could monitor activity of multiple postsynaptic neurons simultaneously, thus mapping their synaptic inputs in parallel, potentially reconstructing a complete neural circuit.
△ Less
Submitted 4 October, 2012;
originally announced October 2012.
-
A network of spiking neurons for computing sparse representations in an energy efficient way
Authors:
Tao Hu,
Alexander Genkin,
Dmitri B. Chklovskii
Abstract:
Computing sparse redundant representations is an important problem both in applied mathematics and neuroscience. In many applications, this problem must be solved in an energy efficient way. Here, we propose a hybrid distributed algorithm (HDA), which solves this problem on a network of simple nodes communicating via low-bandwidth channels. HDA nodes perform both gradient-descent-like steps on ana…
▽ More
Computing sparse redundant representations is an important problem both in applied mathematics and neuroscience. In many applications, this problem must be solved in an energy efficient way. Here, we propose a hybrid distributed algorithm (HDA), which solves this problem on a network of simple nodes communicating via low-bandwidth channels. HDA nodes perform both gradient-descent-like steps on analog internal variables and coordinate-descent-like steps via quantized external variables communicated to each other. Interestingly, such operation is equivalent to a network of integrate-and-fire neurons, suggesting that HDA may serve as a model of neural computation. We show that the numerical performance of HDA is on par with existing algorithms. In the asymptotic regime the representation error of HDA decays with time, t, as 1/t. HDA is stable against time-varying noise, specifically, the representation error decays as 1/sqrt(t) for Gaussian white noise.
△ Less
Submitted 4 October, 2012;
originally announced October 2012.
-
Super-resolution using Sparse Representations over Learned Dictionaries: Reconstruction of Brain Structure using Electron Microscopy
Authors:
Tao Hu,
Juan Nunez-Iglesias,
Shiv Vitaladevuni,
Lou Scheffer,
Shan Xu,
Mehdi Bolorizadeh,
Harald Hess,
Richard Fetter,
Dmitri Chklovskii
Abstract:
A central problem in neuroscience is reconstructing neuronal circuits on the synapse level. Due to a wide range of scales in brain architecture such reconstruction requires imaging that is both high-resolution and high-throughput. Existing electron microscopy (EM) techniques possess required resolution in the lateral plane and either high-throughput or high depth resolution but not both. Here, we…
▽ More
A central problem in neuroscience is reconstructing neuronal circuits on the synapse level. Due to a wide range of scales in brain architecture such reconstruction requires imaging that is both high-resolution and high-throughput. Existing electron microscopy (EM) techniques possess required resolution in the lateral plane and either high-throughput or high depth resolution but not both. Here, we exploit recent advances in unsupervised learning and signal processing to obtain high depth-resolution EM images computationally without sacrificing throughput. First, we show that the brain tissue can be represented as a sparse linear combination of localized basis functions that are learned using high-resolution datasets. We then develop compressive sensing-inspired techniques that can reconstruct the brain tissue from very few (typically 5) tomographic views of each section. This enables tracing of neuronal processes and, hence, high throughput reconstruction of neural circuits on the level of individual synapses.
△ Less
Submitted 1 October, 2012;
originally announced October 2012.
-
Theory of DNA translocation through narrow ion channels and nanopores with charged walls
Authors:
Tao Hu,
B. I. Shklovskii
Abstract:
Translocation of a single stranded DNA through genetically engineered $α$-hemolysin channels with positively charged walls is studied. It is predicted that transport properties of such channels are dramatically different from neutral wild type $α$-hemolysin channel. We assume that the wall charges compensate the fraction $x$ of the bare charge $q_{b}$ of the DNA piece residing in the channel. Ou…
▽ More
Translocation of a single stranded DNA through genetically engineered $α$-hemolysin channels with positively charged walls is studied. It is predicted that transport properties of such channels are dramatically different from neutral wild type $α$-hemolysin channel. We assume that the wall charges compensate the fraction $x$ of the bare charge $q_{b}$ of the DNA piece residing in the channel. Our prediction are as follows (i) At small concentration of salt the blocked ion current decreases with $x$. (ii) The effective charge $q$ of DNA piece, which is very small at $x = 0$ (neutral channel) grows with $x$ and at $x=1$ reaches $q_{b}$. (iii) The rate of DNA capture by the channel exponentially grows with $x$. Our theory is also applicable to translocation of a double stranded DNA in narrow solid state nanopores with positively charged walls.
△ Less
Submitted 19 June, 2008; v1 submitted 4 March, 2008;
originally announced March 2008.
-
How a protein searches for its specific site on DNA: the role of intersegment transfer
Authors:
Tao Hu,
B. I. Shklovskii
Abstract:
Proteins are known to locate their specific targets on DNA up to two orders of magnitude faster than predicted by the Smoluchowski three-dimensional diffusion rate. One of the mechanisms proposed to resolve this discrepancy is termed "intersegment transfer". Many proteins have two DNA binding sites and can transfer from one DNA segment to another without dissociation to water. We calculate the t…
▽ More
Proteins are known to locate their specific targets on DNA up to two orders of magnitude faster than predicted by the Smoluchowski three-dimensional diffusion rate. One of the mechanisms proposed to resolve this discrepancy is termed "intersegment transfer". Many proteins have two DNA binding sites and can transfer from one DNA segment to another without dissociation to water. We calculate the target search rate for such proteins in a dense globular DNA, taking into account intersegment transfer working in conjunction with DNA motion and protein sliding along DNA. We show that intersegment transfer plays a very important role in cases where the protein spends most of its time adsorbed on DNA.
△ Less
Submitted 24 July, 2007;
originally announced July 2007.
-
Kinetics of viral self-assembly: the role of ss RNA antenna
Authors:
Tao Hu,
B. I. Shklovskii
Abstract:
A big class of viruses self-assemble from a large number of identical capsid proteins with long flexible N-terminal tails and ss RNA. We study the role of the strong Coulomb interaction of positive N-terminal tails with ss RNA in the kinetics of the in vitro virus self-assembly. Capsid proteins stick to unassembled chain of ss RNA (which we call "antenna") and slide on it towards the assembly si…
▽ More
A big class of viruses self-assemble from a large number of identical capsid proteins with long flexible N-terminal tails and ss RNA. We study the role of the strong Coulomb interaction of positive N-terminal tails with ss RNA in the kinetics of the in vitro virus self-assembly. Capsid proteins stick to unassembled chain of ss RNA (which we call "antenna") and slide on it towards the assembly site. We show that at excess of capsid proteins such one-dimensional diffusion accelerates self-assembly more than ten times. On the other hand at excess of ss RNA, antenna slows self-assembly down. Several experiments are proposed to verify the role of ss RNA antenna.
△ Less
Submitted 2 February, 2007; v1 submitted 20 November, 2006;
originally announced November 2006.
-
Electrostatic theory of viral self-assembly: a toy model
Authors:
Tao Hu,
Rui Zhang,
B. I. Shklovskii
Abstract:
Viruses self-assemble from identical capsid proteins and their genome consisting, for example, of a long single stranded (ss) RNA. For a big class of T = 3 viruses capsid proteins have long positive N-terminal tails. We explore the role played by the Coulomb interaction between the brush of positive N-terminal tails rooted at the inner surface of the capsid and the negative ss RNA molecule. We s…
▽ More
Viruses self-assemble from identical capsid proteins and their genome consisting, for example, of a long single stranded (ss) RNA. For a big class of T = 3 viruses capsid proteins have long positive N-terminal tails. We explore the role played by the Coulomb interaction between the brush of positive N-terminal tails rooted at the inner surface of the capsid and the negative ss RNA molecule. We show that viruses are most stable when the total contour length of ss RNA is close to the total length of the tails. For such a structure the absolute value of the total RNA charge is approximately twice larger than the charge of the capsid. This conclusion agrees with structural data.
△ Less
Submitted 2 February, 2007; v1 submitted 3 October, 2006;
originally announced October 2006.
-
How does a protein search for the specific site on DNA: the role of disorder
Authors:
Tao Hu,
B. I. Shklovskii
Abstract:
Proteins can locate their specific targets on DNA up to two orders of magnitude faster than the Smoluchowski three-dimensional diffusion rate. This happens due to non-specific adsorption of proteins to DNA and subsequent one-dimensional sliding along DNA. We call such one-dimensional route towards the target "antenna". We studied the role of the dispersion of nonspecific binding energies within…
▽ More
Proteins can locate their specific targets on DNA up to two orders of magnitude faster than the Smoluchowski three-dimensional diffusion rate. This happens due to non-specific adsorption of proteins to DNA and subsequent one-dimensional sliding along DNA. We call such one-dimensional route towards the target "antenna". We studied the role of the dispersion of nonspecific binding energies within the antenna due to quasi random sequence of natural DNA. Random energy profile for sliding proteins slows the searching rate for the target. We show that this slowdown is different for the macroscopic and mesoscopic antennas.
△ Less
Submitted 24 April, 2006; v1 submitted 20 February, 2006;
originally announced February 2006.
-
How do proteins search for their specific sites on coiled or globular DNA
Authors:
Tao Hu,
A. Yu. Grosberg,
B. I. Shklovskii
Abstract:
It is known since the early days of molecular biology that proteins locate their specific targets on DNA up to two orders of magnitude faster than the Smoluchowski 3D diffusion rate. It was the idea due to Delbruck that they are non-specifically adsorbed on DNA, and sliding along DNA provides for the faster 1D search. Surprisingly, the role of DNA conformation was never considered in this contex…
▽ More
It is known since the early days of molecular biology that proteins locate their specific targets on DNA up to two orders of magnitude faster than the Smoluchowski 3D diffusion rate. It was the idea due to Delbruck that they are non-specifically adsorbed on DNA, and sliding along DNA provides for the faster 1D search. Surprisingly, the role of DNA conformation was never considered in this context. In this article, we explicitly address the relative role of 3D diffusion and 1D sliding along coiled or globular DNA and the possibility of correlated re-adsorbtion of desorbed proteins. We have identified a wealth of new different scaling regimes. We also found the maximal possible acceleration of the reaction due to sliding, we found that the maximum on the rate-versus-ionic strength curve is asymmetric, and that sliding can lead not only to acceleration, but in some regimes to dramatic deceleration of the reaction.
△ Less
Submitted 24 October, 2005;
originally announced October 2005.