-
Sequence Analysis Using the Bezier Curve
Authors:
Taslim Murad,
Sarwan Ali,
Murray Patterson
Abstract:
The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrin…
▽ More
The analysis of sequences (e.g., protein, DNA, and SMILES string) is essential for disease diagnosis, biomaterial engineering, genetic engineering, and drug discovery domains. Conventional analytical methods focus on transforming sequences into numerical representations for applying machine learning/deep learning-based sequence characterization. However, their efficacy is constrained by the intrinsic nature of deep learning (DL) models, which tend to exhibit suboptimal performance when applied to tabular data. An alternative group of methodologies endeavors to convert biological sequences into image forms by applying the concept of Chaos Game Representation (CGR). However, a noteworthy drawback of these methods lies in their tendency to map individual elements of the sequence onto a relatively small subset of designated pixels within the generated image. The resulting sparse image representation may not adequately encapsulate the comprehensive sequence information, potentially resulting in suboptimal predictions. In this study, we introduce a novel approach to transform sequences into images using the Bézier curve concept for element mapping. Mapping the elements onto a curve enhances the sequence information representation in the respective images, hence yielding better DL-based classification performance. We employed different sequence datasets to validate our system by using different classification tasks, and the results illustrate that our Bézier curve method is able to achieve good performance for all the tasks.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
A data augmentation strategy for deep neural networks with application to epidemic modelling
Authors:
Muhammad Awais,
Abu Safyan Ali,
Giacomo Dimarco,
Federica Ferrarese,
Lorenzo Pareschi
Abstract:
In this work, we integrate the predictive capabilities of compartmental disease dynamics models with machine learning ability to analyze complex, high-dimensional data and uncover patterns that conventional models may overlook. Specifically, we present a proof of concept demonstrating the application of data-driven methods and deep neural networks to a recently introduced Susceptible-Infected-Reco…
▽ More
In this work, we integrate the predictive capabilities of compartmental disease dynamics models with machine learning ability to analyze complex, high-dimensional data and uncover patterns that conventional models may overlook. Specifically, we present a proof of concept demonstrating the application of data-driven methods and deep neural networks to a recently introduced Susceptible-Infected-Recovered type model with social features, including a saturated incidence rate, to improve epidemic prediction and forecasting. Our results show that a robust data augmentation strategy trough suitable data-driven models can improve the reliability of Feed-Forward Neural Networks and Nonlinear Autoregressive Networks, providing a complementary strategy to Physics-Informed Neural Networks, particularly in settings where data augmentation from mechanistic models can enhance learning. This approach enhances the ability to handle nonlinear dynamics and offers scalable, data-driven solutions for epidemic forecasting, prioritizing predictive accuracy over the constraints of physics-based models. Numerical simulations of the lockdown and post-lockdown phase of the COVID-19 epidemic in Italy and Spain validate our methodology.
△ Less
Submitted 27 May, 2025; v1 submitted 28 February, 2025;
originally announced February 2025.
-
Hilbert Curve Based Molecular Sequence Analysis
Authors:
Sarwan Ali,
Tamkanat E Ali,
Imdad Ullah Khan,
Murray Patterson
Abstract:
Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free tech…
▽ More
Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods.
△ Less
Submitted 29 December, 2024;
originally announced December 2024.
-
DANCE: Deep Learning-Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images
Authors:
Taslim Murad,
Prakash Chourasia,
Sarwan Ali,
Imdad Ullah Khan,
Murray Patterson
Abstract:
Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immu…
▽ More
Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information. T-cell protein sequences pose unique challenges due to their relatively smaller lengths compared to other biomolecules. An image-based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T-cell protein sequences. In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR) using the Kaleidoscopic images approach. This Deep Learning Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. The TCR sequences are converted into images using the DANCE method. We employ deep-learning vision models to perform the classification to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties. By combining CGR-based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain.
△ Less
Submitted 11 June, 2025; v1 submitted 10 September, 2024;
originally announced September 2024.
-
Nearest Neighbor CCP-Based Molecular Sequence Analysis
Authors:
Sarwan Ali,
Prakash Chourasia,
Bipin Koirala,
Murray Patterson
Abstract:
Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality redu…
▽ More
Molecular sequence analysis is crucial for comprehending several biological processes, including protein-protein interactions, functional annotation, and disease classification. The large number of sequences and the inherently complicated nature of protein structures make it challenging to analyze such data. Finding patterns and enhancing subsequent research requires the use of dimensionality reduction and feature selection approaches. Recently, a method called Correlated Clustering and Projection (CCP) has been proposed as an effective method for biological sequencing data. The CCP technique is still costly to compute even though it is effective for sequence visualization. Furthermore, its utility for classifying molecular sequences is still uncertain. To solve these two problems, we present a Nearest Neighbor Correlated Clustering and Projection (CCP-NN)-based technique for efficiently preprocessing molecular sequence data. To group related molecular sequences and produce representative supersequences, CCP makes use of sequence-to-sequence correlations. As opposed to conventional methods, CCP doesn't rely on matrix diagonalization, therefore it can be applied to a range of machine-learning problems. We estimate the density map and compute the correlation using a nearest-neighbor search technique. We performed molecular sequence classification using CCP and CCP-NN representations to assess the efficacy of our proposed approach. Our findings show that CCP-NN considerably improves classification task accuracy as well as significantly outperforms CCP in terms of computational runtime.
△ Less
Submitted 7 September, 2024;
originally announced September 2024.
-
Expanding Chemical Representation with k-mers and Fragment-based Fingerprints for Molecular Fingerprinting
Authors:
Sarwan Ali,
Prakash Chourasia,
Murray Patterson
Abstract:
This study introduces a novel approach, combining substruct counting, $k$-mers, and Daylight-like fingerprints, to expand the representation of chemical structures in SMILES strings. The integrated method generates comprehensive molecular embeddings that enhance discriminative power and information content. Experimental evaluations demonstrate its superiority over traditional Morgan fingerprinting…
▽ More
This study introduces a novel approach, combining substruct counting, $k$-mers, and Daylight-like fingerprints, to expand the representation of chemical structures in SMILES strings. The integrated method generates comprehensive molecular embeddings that enhance discriminative power and information content. Experimental evaluations demonstrate its superiority over traditional Morgan fingerprinting, MACCS, and Daylight fingerprint alone, improving chemoinformatics tasks such as drug classification. The proposed method offers a more informative representation of chemical structures, advancing molecular similarity analysis and facilitating applications in molecular design and drug discovery. It presents a promising avenue for molecular structure analysis and design, with significant potential for practical implementation.
△ Less
Submitted 28 March, 2024;
originally announced March 2024.
-
A Universal Non-Parametric Approach For Improved Molecular Sequence Analysis
Authors:
Sarwan Ali,
Tamkanat E Ali,
Prakash Chourasia,
Murray Patterson
Abstract:
In the field of biological research, it is essential to comprehend the characteristics and functions of molecular sequences. The classification of molecular sequences has seen widespread use of neural network-based techniques. Despite their astounding accuracy, these models often require a substantial number of parameters and more data collection. In this work, we present a novel approach based on…
▽ More
In the field of biological research, it is essential to comprehend the characteristics and functions of molecular sequences. The classification of molecular sequences has seen widespread use of neural network-based techniques. Despite their astounding accuracy, these models often require a substantial number of parameters and more data collection. In this work, we present a novel approach based on the compression-based Model, motivated from \cite{jiang2023low}, which combines the simplicity of basic compression algorithms like Gzip and Bz2, with Normalized Compression Distance (NCD) algorithm to achieve better performance on classification tasks without relying on handcrafted features or pre-trained models. Firstly, we compress the molecular sequence using well-known compression algorithms, such as Gzip and Bz2. By leveraging the latent structure encoded in compressed files, we compute the Normalized Compression Distance between each pair of molecular sequences, which is derived from the Kolmogorov complexity. This gives us a distance matrix, which is the input for generating a kernel matrix using a Gaussian kernel. Next, we employ kernel Principal Component Analysis (PCA) to get the vector representations for the corresponding molecular sequence, capturing important structural and functional information. The resulting vector representations provide an efficient yet effective solution for molecular sequence analysis and can be used in ML-based downstream tasks. The proposed approach eliminates the need for computationally intensive Deep Neural Networks (DNNs), with their large parameter counts and data requirements. Instead, it leverages a lightweight and universally accessible compression-based model.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Sequence-Based Nanobody-Antigen Binding Prediction
Authors:
Usama Sardar,
Sarwan Ali,
Muhammad Sohaib Ayub,
Muhammad Shoaib,
Khurram Bashir,
Imdad Ullah Khan,
Murray Patterson
Abstract:
Nanobodies (Nb) are monomeric heavy-chain fragments derived from heavy-chain only antibodies naturally found in Camelids and Sharks. Their considerably small size (~3-4 nm; 13 kDa) and favorable biophysical properties make them attractive targets for recombinant production. Furthermore, their unique ability to bind selectively to specific antigens, such as toxins, chemicals, bacteria, and viruses,…
▽ More
Nanobodies (Nb) are monomeric heavy-chain fragments derived from heavy-chain only antibodies naturally found in Camelids and Sharks. Their considerably small size (~3-4 nm; 13 kDa) and favorable biophysical properties make them attractive targets for recombinant production. Furthermore, their unique ability to bind selectively to specific antigens, such as toxins, chemicals, bacteria, and viruses, makes them powerful tools in cell biology, structural biology, medical diagnostics, and future therapeutic agents in treating cancer and other serious illnesses. However, a critical challenge in nanobodies production is the unavailability of nanobodies for a majority of antigens. Although some computational methods have been proposed to screen potential nanobodies for given target antigens, their practical application is highly restricted due to their reliance on 3D structures. Moreover, predicting nanobodyantigen interactions (binding) is a time-consuming and labor-intensive task. This study aims to develop a machine-learning method to predict Nanobody-Antigen binding solely based on the sequence data. We curated a comprehensive dataset of Nanobody-Antigen binding and nonbinding data and devised an embedding method based on gapped k-mers to predict binding based only on sequences of nanobody and antigen. Our approach achieves up to 90% accuracy in binding prediction and is significantly more efficient compared to the widely-used computational docking technique.
△ Less
Submitted 14 July, 2023;
originally announced August 2023.
-
Robust Brain Age Estimation via Regression Models and MRI-derived Features
Authors:
Mansoor Ahmed,
Usama Sardar,
Sarwan Ali,
Shafiq Alam,
Murray Patterson,
Imdad Ullah Khan
Abstract:
The determination of biological brain age is a crucial biomarker in the assessment of neurological disorders and understanding of the morphological changes that occur during aging. Various machine learning models have been proposed for estimating brain age through Magnetic Resonance Imaging (MRI) of healthy controls. However, developing a robust brain age estimation (BAE) framework has been challe…
▽ More
The determination of biological brain age is a crucial biomarker in the assessment of neurological disorders and understanding of the morphological changes that occur during aging. Various machine learning models have been proposed for estimating brain age through Magnetic Resonance Imaging (MRI) of healthy controls. However, developing a robust brain age estimation (BAE) framework has been challenging due to the selection of appropriate MRI-derived features and the high cost of MRI acquisition. In this study, we present a novel BAE framework using the Open Big Healthy Brain (OpenBHB) dataset, which is a new multi-site and publicly available benchmark dataset that includes region-wise feature metrics derived from T1-weighted (T1-w) brain MRI scans of 3965 healthy controls aged between 6 to 86 years. Our approach integrates three different MRI-derived region-wise features and different regression models, resulting in a highly accurate brain age estimation with a Mean Absolute Error (MAE) of 3.25 years, demonstrating the framework's robustness. We also analyze our model's regression-based performance on gender-wise (male and female) healthy test groups. The proposed BAE framework provides a new approach for estimating brain age, which has important implications for the understanding of neurological disorders and age-related brain changes.
△ Less
Submitted 8 June, 2023;
originally announced June 2023.
-
T Cell Receptor Protein Sequences and Sparse Coding: A Novel Approach to Cancer Classification
Authors:
Zahra Tayebi,
Sarwan Ali,
Prakash Chourasia,
Taslim Murad,
Murray Patterson
Abstract:
Cancer is a complex disease characterized by uncontrolled cell growth and proliferation. T cell receptors (TCRs) are essential proteins for the adaptive immune system, and their specific recognition of antigens plays a crucial role in the immune response against diseases, including cancer. The diversity and specificity of TCRs make them ideal for targeting cancer cells, and recent advancements in…
▽ More
Cancer is a complex disease characterized by uncontrolled cell growth and proliferation. T cell receptors (TCRs) are essential proteins for the adaptive immune system, and their specific recognition of antigens plays a crucial role in the immune response against diseases, including cancer. The diversity and specificity of TCRs make them ideal for targeting cancer cells, and recent advancements in sequencing technologies have enabled the comprehensive profiling of TCR repertoires. This has led to the discovery of TCRs with potent anti-cancer activity and the development of TCR-based immunotherapies. In this study, we investigate the use of sparse coding for the multi-class classification of TCR protein sequences with cancer categories as target labels. Sparse coding is a popular technique in machine learning that enables the representation of data with a set of informative features and can capture complex relationships between amino acids and identify subtle patterns in the sequence that might be missed by low-dimensional methods. We first compute the k-mers from the TCR sequences and then apply sparse coding to capture the essential features of the data. To improve the predictive performance of the final embeddings, we integrate domain knowledge regarding different types of cancer properties. We then train different machine learning (linear and non-linear) classifiers on the embeddings of TCR sequences for the purpose of supervised analysis. Our proposed embedding method on a benchmark dataset of TCR sequences significantly outperforms the baselines in terms of predictive performance, achieving an accuracy of 99.8\%. Our study highlights the potential of sparse coding for the analysis of TCR protein sequences in cancer research and other related fields.
△ Less
Submitted 5 September, 2023; v1 submitted 25 April, 2023;
originally announced April 2023.
-
Virus2Vec: Viral Sequence Classification Using Machine Learning
Authors:
Sarwan Ali,
Babatunde Bello,
Prakash Chourasia,
Ria Thazhe Punathil,
Pin-Yu Chen,
Imdad Ullah Khan,
Murray Patterson
Abstract:
Understanding the host-specificity of different families of viruses sheds light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans. It enables epidemiologists, medical professionals, and policymakers to curb existing epidemics and prevent future ones promptly. In the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known that the spike protein…
▽ More
Understanding the host-specificity of different families of viruses sheds light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans. It enables epidemiologists, medical professionals, and policymakers to curb existing epidemics and prevent future ones promptly. In the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known that the spike protein is the point of contact between the virus and the host cell membrane. On the other hand, the two traditional mammalian orders, Carnivora (carnivores) and Chiroptera (bats) are recognized to be responsible for maintaining and spreading the Rabies Lyssavirus (RABV). We propose Virus2Vec, a feature-vector representation for viral (nucleotide or amino acid) sequences that enable vector-space-based machine learning models to identify viral hosts. Virus2Vec generates numerical feature vectors for unaligned sequences, allowing us to forego the computationally expensive sequence alignment step from the pipeline. Virus2Vec leverages the power of both the \emph{minimizer} and position weight matrix (PWM) to generate compact feature vectors. Using several classifiers, we empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host (identifying the reservoirs of infection). Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods.
△ Less
Submitted 24 April, 2023;
originally announced April 2023.
-
PCD2Vec: A Poisson Correction Distance-Based Approach for Viral Host Classification
Authors:
Sarwan Ali,
Taslim Murad,
Murray Patterson
Abstract:
Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA viruses belonging to the Coronaviridae family. Various animal species, mainly mammalian and avian, are severely infected by various coronaviruses, causing serious concerns like the recent pandemic (COVID-19). Therefore, building a deeper understanding of these viruses is essential to devise prevention and mitigation mechanisms…
▽ More
Coronaviruses are membrane-enveloped, non-segmented positive-strand RNA viruses belonging to the Coronaviridae family. Various animal species, mainly mammalian and avian, are severely infected by various coronaviruses, causing serious concerns like the recent pandemic (COVID-19). Therefore, building a deeper understanding of these viruses is essential to devise prevention and mitigation mechanisms. In the Coronavirus genome, an essential structural region is the spike region, and it's responsible for attaching the virus to the host cell membrane. Therefore, the usage of only the spike protein, instead of the full genome, provides most of the essential information for performing analyses such as host classification. In this paper, we propose a novel method for predicting the host specificity of coronaviruses by analyzing spike protein sequences from different viral subgenera and species. Our method involves using the Poisson correction distance to generate a distance matrix, followed by using a radial basis function (RBF) kernel and kernel principal component analysis (PCA) to generate a low-dimensional embedding. Finally, we apply classification algorithms to the low-dimensional embedding to generate the resulting predictions of the host specificity of coronaviruses. We provide theoretical proofs for the non-negativity, symmetry, and triangle inequality properties of the Poisson correction distance metric, which are important properties in a machine-learning setting. By encoding the spike protein structure and sequences using this comprehensive approach, we aim to uncover hidden patterns in the biological sequences to make accurate predictions about host specificity. Finally, our classification results illustrate that our method can achieve higher predictive accuracy and improve performance over existing baselines.
△ Less
Submitted 12 April, 2023;
originally announced April 2023.
-
ViralVectors: Compact and Scalable Alignment-free Virome Feature Generation
Authors:
Sarwan Ali,
Prakash Chourasia,
Zahra Tayebi,
Babatunde Bello,
Murray Patterson
Abstract:
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous so…
▽ More
The amount of sequencing data for SARS-CoV-2 is several orders of magnitude larger than any virus. This will continue to grow geometrically for SARS-CoV-2, and other viruses, as many countries heavily finance genomic surveillance efforts. Hence, we need methods for processing large amounts of sequence data to allow for effective yet timely decision-making. Such data will come from heterogeneous sources: aligned, unaligned, or even unassembled raw nucleotide or amino acid sequencing reads pertaining to the whole genome or regions (e.g., spike) of interest. In this work, we propose \emph{ViralVectors}, a compact feature vector generation from virome sequencing data that allows effective downstream analysis. Such generation is based on \emph{minimizers}, a type of lightweight "signature" of a sequence, used traditionally in assembly and read mapping -- to our knowledge, the first use minimizers in this way. We validate our approach on different types of sequencing data: (a) 2.5M SARS-CoV-2 spike sequences (to show scalability); (b) 3K Coronaviridae spike sequences (to show robustness to more genomic variability); and (c) 4K raw WGS reads sets taken from nasal-swab PCR tests (to show the ability to process unassembled reads). Our results show that ViralVectors outperforms current benchmarks in most classification and clustering tasks.
△ Less
Submitted 7 April, 2023; v1 submitted 6 April, 2023;
originally announced April 2023.
-
BioSequence2Vec: Efficient Embedding Generation For Biological Sequences
Authors:
Sarwan Ali,
Usama Sardar,
Murray Patterson,
Imdad Ullah Khan
Abstract:
Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three…
▽ More
Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance.
△ Less
Submitted 1 April, 2023;
originally announced April 2023.
-
Reproduction number of SARS-CoV-2 Omicron variants, China, December 2022-January 2023
Authors:
Yuan Bai,
Zengyang Shao,
Xiao Zhang,
Ruohan Chen,
Lin Wang,
Sheikh Taslim Ali,
Tianmu Chen,
Eric H. Y. Lau,
Dong-Yan Jin,
Zhanwei Du
Abstract:
China adjusted the zero-COVID strategy in late 2022, triggering an unprecedented Omicron wave. We estimated the time-varying reproduction numbers of 32 provincial-level administrative divisions from December 2022 to January 2023. We found that the pooled estimate of initial reproduction numbers is 4.74 (95% CI: 4.41, 5.07).
China adjusted the zero-COVID strategy in late 2022, triggering an unprecedented Omicron wave. We estimated the time-varying reproduction numbers of 32 provincial-level administrative divisions from December 2022 to January 2023. We found that the pooled estimate of initial reproduction numbers is 4.74 (95% CI: 4.41, 5.07).
△ Less
Submitted 19 March, 2023;
originally announced March 2023.
-
Exploring The Potential Of GANs In Biological Sequence Analysis
Authors:
Taslim Murad,
Sarwan Ali,
Murray Patterson
Abstract:
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, like viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become pandemics glo…
▽ More
Biological sequence analysis is an essential step toward building a deeper understanding of the underlying functions, structures, and behaviors of the sequences. It can help in identifying the characteristics of the associated organisms, like viruses, etc., and building prevention mechanisms to eradicate their spread and impact, as viruses are known to cause epidemics that can become pandemics globally. New tools for biological sequence analysis are provided by machine learning (ML) technologies to effectively analyze the functions and structures of the sequences. However, these ML-based methods undergo challenges with data imbalance, generally associated with biological sequence datasets, which hinders their performance. Although various strategies are present to address this issue, like the SMOTE algorithm, which creates synthetic data, however, they focus on local information rather than the overall class distribution. In this work, we explore a novel approach to handle the data imbalance issue based on Generative Adversarial Networks (GANs) which use the overall data distribution. GANs are utilized to generate synthetic data that closely resembles the real one, thus this generated data can be employed to enhance the ML models' performance by eradicating the class imbalance problem for biological sequence analysis. We perform 3 distinct classification tasks by using 3 different sequence datasets (Influenza A Virus, PALMdb, VDjDB) and our results illustrate that GANs can improve the overall classification performance.
△ Less
Submitted 4 March, 2023;
originally announced March 2023.
-
Evaluating COVID-19 Sequence Data Using Nearest-Neighbors Based Network Model
Authors:
Sarwan Ali
Abstract:
The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and…
▽ More
The SARS-CoV-2 coronavirus is the cause of the COVID-19 disease in humans. Like many coronaviruses, it can adapt to different hosts and evolve into different lineages. It is well-known that the major SARS-CoV-2 lineages are characterized by mutations that happen predominantly in the spike protein. Understanding the spike protein structure and how it can be perturbed is vital for understanding and determining if a lineage is of concern. These are crucial to identifying and controlling current outbreaks and preventing future pandemics. Machine learning (ML) methods are a viable solution to this effort, given the volume of available sequencing data, much of which is unaligned or even unassembled. However, such ML methods require fixed-length numerical feature vectors in Euclidean space to be applicable. Similarly, euclidean space is not considered the best choice when working with the classification and clustering tasks for biological sequences. For this purpose, we design a method that converts the protein (spike) sequences into the sequence similarity network (SSN). We can then use SSN as an input for the classical algorithms from the graph mining domain for the typical tasks such as classification and clustering to understand the data. We show that the proposed alignment-free method is able to outperform the current SOTA method in terms of clustering results. Similarly, we are able to achieve higher classification accuracy using well-known Node2Vec-based embedding compared to other baseline embedding approaches.
△ Less
Submitted 22 November, 2022; v1 submitted 18 November, 2022;
originally announced November 2022.
-
Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data
Authors:
Prakash Chourasia,
Sarwan Ali,
Simone Ciccolella,
Gianluca Della Vedova,
Murray Patterson
Abstract:
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned and curated full-length sequences, su…
▽ More
The massive amount of genomic data appearing for SARS-CoV-2 since the beginning of the COVID-19 pandemic has challenged traditional methods for studying its dynamics. As a result, new methods such as Pangolin, which can scale to the millions of samples of SARS-CoV-2 currently available, have appeared. Such a tool is tailored to take as input assembled, aligned and curated full-length sequences, such as those found in the GISAID database. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly.
In this paper, we propose Reads2Vec, an alignment-free embedding approach that can generate a fixed-length feature vector representation directly from the raw sequencing reads without requiring assembly. Furthermore, since such an embedding is a numerical representation, it may be applied to highly optimized classification and clustering algorithms. Experiments on simulated data show that our proposed embedding obtains better classification results and better clustering properties contrary to existing alignment-free baselines. In a study on real data, we show that alignment-free embeddings have better clustering properties than the Pangolin tool and that the spike region of the SARS-CoV-2 genome heavily informs the alignment-free clusterings, which is consistent with current biological knowledge of SARS-CoV-2.
△ Less
Submitted 15 November, 2022;
originally announced November 2022.
-
SUPRA: Superpixel Guided Loss for Improved Multi-modal Segmentation in Endoscopy
Authors:
Rafael Martinez-Garcia-Peña,
Mansoor Ali Teevno,
Gilberto Ochoa-Ruiz,
Sharib Ali
Abstract:
Domain shift is a well-known problem in the medical imaging community. In particular, for endoscopic image analysis where the data can have different modalities the performance of deep learning (DL) methods gets adversely affected. In other words, methods developed on one modality cannot be used for a different modality. However, in real clinical settings, endoscopists switch between modalities fo…
▽ More
Domain shift is a well-known problem in the medical imaging community. In particular, for endoscopic image analysis where the data can have different modalities the performance of deep learning (DL) methods gets adversely affected. In other words, methods developed on one modality cannot be used for a different modality. However, in real clinical settings, endoscopists switch between modalities for better mucosal visualisation. In this paper, we explore the domain generalisation technique to enable DL methods to be used in such scenarios. To this extend, we propose to use super pixels generated with Simple Linear Iterative Clustering (SLIC) which we refer to as "SUPRA" for SUPeRpixel Augmented method. SUPRA first generates a preliminary segmentation mask making use of our new loss "SLICLoss" that encourages both an accurate and color-consistent segmentation. We demonstrate that SLICLoss when combined with Binary Cross Entropy loss (BCE) can improve the model's generalisability with data that presents significant domain shift. We validate this novel compound loss on a vanilla U-Net using the EndoUDA dataset, which contains images for Barret's Esophagus and polyps from two modalities. We show that our method yields an improvement of nearly 20% in the target domain set compared to the baseline.
△ Less
Submitted 9 April, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
Efficient Approximate Kernel Based Spike Sequence Classification
Authors:
Sarwan Ali,
Bikram Sahoo,
Muhammad Asad Khan,
Alexander Zelikovsky,
Imdad Ullah Khan,
Murray Patterson
Abstract:
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estim…
▽ More
Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain.
△ Less
Submitted 11 September, 2022;
originally announced September 2022.
-
Benchmarking Machine Learning Robustness in Covid-19 Genome Sequence Classification
Authors:
Sarwan Ali,
Bikram Sahoo,
Alexander Zelikovskiy,
Pin-Yu Chen,
Murray Patterson
Abstract:
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as…
▽ More
The rapid spread of the COVID-19 pandemic has resulted in an unprecedented amount of sequence data of the SARS-CoV-2 genome -- millions of sequences and counting. This amount of data, while being orders of magnitude beyond the capacity of traditional approaches to understanding the diversity, dynamics, and evolution of viruses is nonetheless a rich resource for machine learning (ML) approaches as alternatives for extracting such important information from these data. It is of hence utmost importance to design a framework for testing and benchmarking the robustness of these ML models.
This paper makes the first effort (to our knowledge) to benchmark the robustness of ML models by simulating biological sequences with errors. In this paper, we introduce several ways to perturb SARS-CoV-2 genome sequences to mimic the error profiles of common sequencing platforms such as Illumina and PacBio. We show from experiments on a wide array of ML models that some simulation-based approaches are more robust (and accurate) than others for specific embedding methods to certain adversarial attacks to the input sequences. Our benchmarking framework may assist researchers in properly assessing different ML models and help them understand the behavior of the SARS-CoV-2 virus or avoid possible future pandemics.
△ Less
Submitted 18 July, 2022;
originally announced July 2022.
-
PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences
Authors:
Sarwan Ali,
Babatunde Bello,
Prakash Chourasia,
Ria Thazhe Punathil,
Yijing Zhou,
Murray Patterson
Abstract:
COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating…
▽ More
COVID-19 pandemic, is still unknown and is an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona-) viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is an important part of determining host specificity since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among avians, bats, camels, swines, humans and weasels, to name a few. We propose a feature embedding based on the well-known position-weight matrix (PWM), which we call PWM2Vec, and use to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications such as determining protein function, or identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs in the context of host classification from viral sequences to generate a fixed-length feature vector representation. The results on the real world data show that in using PWM2Vec, we are able to perform comparably well as compared to baseline models. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus.
△ Less
Submitted 6 January, 2022;
originally announced January 2022.
-
Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants
Authors:
Zahra Tayebi,
Sarwan Ali,
Murray Patterson
Abstract:
The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail unlike any virus before it. One one had, this will help biologists, policy makers and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On…
▽ More
The widespread availability of large amounts of genomic data on the SARS-CoV-2 virus, as a result of the COVID-19 pandemic, has created an opportunity for researchers to analyze the disease at a level of detail unlike any virus before it. One one had, this will help biologists, policy makers and other authorities to make timely and appropriate decisions to control the spread of the coronavirus. On the other hand, such studies will help to more effectively deal with any possible future pandemic. Since the SARS-CoV-2 virus contains different variants, each of them having different mutations, performing any analysis on such data becomes a difficult task. It is well known that much of the variation in the SARS-CoV-2 genome happens disproportionately in the spike region of the genome sequence -- the relatively short region which codes for the spike protein(s). Hence, in this paper, we propose an approach to cluster spike protein sequences in order to study the behavior of different known variants that are increasing at very high rate throughout the world. We use a k-mers based approach to first generate a fixed-length feature vector representation for the spike sequences. We then show that with the appropriate feature selection, we can efficiently and effectively cluster the spike sequences based on the different variants. Using a publicly available set of SARS-CoV-2 spike sequences, we perform clustering of these sequences using both hard and soft clustering methods and show that with our feature selection methods, we can achieve higher F1 scores for the clusters.
△ Less
Submitted 18 October, 2021;
originally announced October 2021.
-
Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location
Authors:
Sarwan Ali,
Babatunde Bello,
Zahra Tayebi,
Murray Patterson
Abstract:
With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis towards the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is t…
▽ More
With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis towards the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected -- the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this paper, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
△ Less
Submitted 12 October, 2022; v1 submitted 2 October, 2021;
originally announced October 2021.
-
Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences
Authors:
Sarwan Ali,
Murray Patterson
Abstract:
With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such \emph{Big Data} creates a new opportunity for researchers to study this virus…
▽ More
With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such \emph{Big Data} creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues. In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1 score, etc.
△ Less
Submitted 15 November, 2021; v1 submitted 11 September, 2021;
originally announced September 2021.
-
Effective and scalable clustering of SARS-CoV-2 sequences
Authors:
Sarwan Ali,
Tamkanat-E-Ali,
Muhammad Asad Khan,
Imdadullah Khan,
Murray Patterson
Abstract:
SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several order…
▽ More
SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million. This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a virus's evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected.
Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations -- something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USA's Centers for Disease Control and Prevention (CDC), further demonstrating our approach.
△ Less
Submitted 12 October, 2021; v1 submitted 18 August, 2021;
originally announced August 2021.
-
A k-mer Based Approach for SARS-CoV-2 Variant Identification
Authors:
Sarwan Ali,
Bikram Sahoo,
Naimat Ullah,
Alexander Zelikovskiy,
Murray Patterson,
Imdadullah Khan
Abstract:
With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role…
▽ More
With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants helps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants.
In this paper, we use spike sequences to classify different variants of the coronavirus in humans. We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples ($1\%$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USA's Centers for Disease Control and Prevention (CDC).
△ Less
Submitted 12 October, 2021; v1 submitted 7 August, 2021;
originally announced August 2021.
-
Predicting Clinical Outcomes in COVID-19 using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study
Authors:
Joseph Bae,
Saarthak Kapse,
Gagandeep Singh,
Rishabh Gattu,
Syed Ali,
Neal Shah,
Colin Marshall,
Jonathan Pierce,
Tej Phatak,
Amit Gupta,
Jeremy Green,
Nikhil Madan,
Prateek Prasanna
Abstract:
We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning cl…
▽ More
We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using patient CXRs. A novel radiomic embedding framework was also explored for outcome prediction. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic and DL classification models had mAUCs of 0.78+/-0.02 and 0.81+/-0.04, compared with expert scores mAUCs of 0.75+/-0.02 and 0.79+/-0.05 for mechanical ventilation requirement and mortality prediction, respectively. Combined classifiers using both radiomics and expert severity scores resulted in mAUCs of 0.79+/-0.04 and 0.83+/-0.04 for each prediction task, demonstrating improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances where inclusion of radiomic features in DL improves model predictions, something that might be explored in other pathologies. The models proposed in this study and the prognostic information they provide might aid physician decision making and resource allocation during the COVID-19 pandemic.
△ Less
Submitted 1 July, 2021; v1 submitted 15 July, 2020;
originally announced July 2020.
-
Self-Attentive Adversarial Stain Normalization
Authors:
Aman Shrivastava,
Will Adorno,
Yash Sharma,
Lubaina Ehsan,
S. Asad Ali,
Sean R. Moore,
Beatrice C. Amadi,
Paul Kelly,
Sana Syed,
Donald E. Brown
Abstract:
Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSIs) are utilized for biopsy visualization-based diagnostic and prognostic assessment of diseases. Variation in the H&E staining process across different lab sites can lead to significant variations in biopsy image appearance. These variations introduce an undesirable bias when the slides are examined by pathologists or used for training dee…
▽ More
Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSIs) are utilized for biopsy visualization-based diagnostic and prognostic assessment of diseases. Variation in the H&E staining process across different lab sites can lead to significant variations in biopsy image appearance. These variations introduce an undesirable bias when the slides are examined by pathologists or used for training deep learning models. To reduce this bias, slides need to be translated to a common domain of stain appearance before analysis. We propose a Self-Attentive Adversarial Stain Normalization (SAASN) approach for the normalization of multiple stain appearances to a common domain. This unsupervised generative adversarial approach includes self-attention mechanism for synthesizing images with finer detail while preserving the structural consistency of the biopsy features during translation. SAASN demonstrates consistent and superior performance compared to other popular stain normalization techniques on H&E stained duodenal biopsy image data.
△ Less
Submitted 22 November, 2020; v1 submitted 4 September, 2019;
originally announced September 2019.
-
Diagnosis of Celiac Disease and Environmental Enteropathy on Biopsy Images Using Color Balancing on Convolutional Neural Networks
Authors:
Kamran Kowsari,
Rasoul Sali,
Marium N. Khan,
William Adorno,
S. Asad Ali,
Sean R. Moore,
Beatrice C. Amadi,
Paul Kelly,
Sana Syed,
Donald E. Brown
Abstract:
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results…
▽ More
Celiac Disease (CD) and Environmental Enteropathy (EE) are common causes of malnutrition and adversely impact normal childhood development. CD is an autoimmune disorder that is prevalent worldwide and is caused by an increased sensitivity to gluten. Gluten exposure destructs the small intestinal epithelial barrier, resulting in nutrient mal-absorption and childhood under-nutrition. EE also results in barrier dysfunction but is thought to be caused by an increased vulnerability to infections. EE has been implicated as the predominant cause of under-nutrition, oral vaccine failure, and impaired cognitive development in low-and-middle-income countries. Both conditions require a tissue biopsy for diagnosis, and a major challenge of interpreting clinical biopsy images to differentiate between these gastrointestinal diseases is striking histopathologic overlap between them. In the current study, we propose a convolutional neural network (CNN) to classify duodenal biopsy images from subjects with CD, EE, and healthy controls. We evaluated the performance of our proposed model using a large cohort containing 1000 biopsy images. Our evaluations show that the proposed model achieves an area under ROC of 0.99, 1.00, and 0.97 for CD, EE, and healthy controls, respectively. These results demonstrate the discriminative power of the proposed model in duodenal biopsies classification.
△ Less
Submitted 9 October, 2019; v1 submitted 10 April, 2019;
originally announced April 2019.
-
Hair histology as a tool for forensic identification of some domestic animal species
Authors:
Yasser A. Ahmed,
Safwat Ali,
Ahmed Ghallab
Abstract:
Animal hair examination at a criminal scene may provide valuable information in forensic investigations. However, local reference databases for animal hair identification are rare. In the present study, we provide differential histological analysis of hair of some domestic animals in Upper Egypt. For this purpose, guard hair of large ruminants (buffalo, camel and cow), small ruminants (sheep and g…
▽ More
Animal hair examination at a criminal scene may provide valuable information in forensic investigations. However, local reference databases for animal hair identification are rare. In the present study, we provide differential histological analysis of hair of some domestic animals in Upper Egypt. For this purpose, guard hair of large ruminants (buffalo, camel and cow), small ruminants (sheep and goat), equine (horse and donkey) and canine (dog and cat) were collected and comparative analysis was performed by light microscopy. Based on the hair cuticle scale pattern, type and diameter of the medulla, and the pigmentation, characteristic differential features of each animal species were identified. The cuticle scale pattern was imbricate in all tested animals except in donkey, in which coronal scales were identified. The cuticle scale margin type, shape and the distance in between were characteristic for each animal species. The hair medulla was continuous in most of the tested animal species with the exception of sheep, in which fragmental medulla was detected. The diameter of the hair medulla and the margins differ according to the animal species. Hair shaft pigmentation were not detected in all tested animals with the exception of camel and buffalo, in which granules and streak-like pigmentation were detected. In conclusion, the present study provides a first-step towards preparation of a complete local reference database for animal hair identification that can be used in forensic investigations.
△ Less
Submitted 7 July, 2018;
originally announced July 2018.
-
Threshold Response to Stochasticity in Morphogenesis
Authors:
George Courcoubetis,
Sammi Ali,
Sergey V Nuzhdin,
Paul Marjoram,
Stephan Haas
Abstract:
During development of biological organisms, multiple complex structures are formed. In many instances, these structures need to exhibit a high degree of order to be functional, although many of their constituents are intrinsically stochastic. Hence, it has been suggested that biological robustness ultimately must rely on complex gene regulatory networks and clean-up mechanisms. Here we explore dev…
▽ More
During development of biological organisms, multiple complex structures are formed. In many instances, these structures need to exhibit a high degree of order to be functional, although many of their constituents are intrinsically stochastic. Hence, it has been suggested that biological robustness ultimately must rely on complex gene regulatory networks and clean-up mechanisms. Here we explore developmental processes that have evolved inherent robustness against stochasticity. In the context of the Drosophila eye disc, multiple optical units, ommatidia, develop into crystal-like patterns. During the larva-to-pupa stage of metamorphosis, the centers of the ommatidia are specified initially through the diffusion of morphogens, followed by the specification of R8 cells. Establishing the R8 cell is crucial in setting up the geometric, and functional, relationships of cells within an ommatidium and among neighboring ommatidia. Here we study a mathematical model of these spatio-temporal processes in the presence of stochasticity, defining and applying measures that quantify order within the resulting spatial patterns. We observe a universal sigmoidal response to increasing transcriptional noise. Ordered patterns persist up to a threshold noise level in the model parameters. As the noise is further increased past a threshold point of no return, these ordered patterns rapidly become disordered. Such robustness in development allows for the accumulation of genetic variation without any observable changes in phenotype. We argue that the observed sigmoidal dependence introduces robustness allowing for sizable amounts of genetic variation and transcriptional noise to be tolerated in natural populations without resulting in phenotype variation.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Pharmacokinetics and Molecular Docking studies of Plant-Derived Natural Compounds to Exploring Potential Anti-Alzheimer Activity
Authors:
Aftab Alam,
Naaila Tamkeen,
Nikhat Imam,
Anam Farooqui,
Mohd Murshad Ahmed,
Shahnawaz Ali,
Md Zubbair Malik,
Romana Ishrat
Abstract:
Alzheimer disease (AD) is the leading cause of dementia, accounts for 60 to 80 percent cases. Two main factors called beta amyloid plaques and tangles are prime suspects in damaging and killing nerve cells. However, oxidative stress, the process which produces free radicals in cells, is believed to promote its progression to the extent that it may responsible for the cognitive and functional decli…
▽ More
Alzheimer disease (AD) is the leading cause of dementia, accounts for 60 to 80 percent cases. Two main factors called beta amyloid plaques and tangles are prime suspects in damaging and killing nerve cells. However, oxidative stress, the process which produces free radicals in cells, is believed to promote its progression to the extent that it may responsible for the cognitive and functional decline observed in AD. As of today there are few FDA approved drugs in the market for treatment, but their cholinergic adverse effect, potentially distressing toxicity and limited targets in AD pathology limits their use. Therefore, it is crucial to find an effective compounds to combat AD. We choose 45 plant derived natural compounds that have antioxidant properties to slow down disease progression by quenching free redicals or promoting endogenous antioxidant capacity. However, we performed molecular docking studies to investigate the binding interactions between natural compounds and 13 various anti Alzheimer drug targets. Three known Cholinesterase inhibitors (Donepezil, Galantamine and Rivastigmine) were taken as reference drugs over natural compounds for comparison and drug likeness studies. Few of these compounds showed good inhibitory activity besides anti oxidant activity. Most of these compounds followed pharmacokinetics properties that make them potentially promising drug candidates for the treatment of Alzheimer disease.
△ Less
Submitted 19 July, 2017;
originally announced September 2017.
-
Dynamics of p53 and Wnt cross talk
Authors:
Md. Zubbair Malik,
Shahnawaz Ali,
Md. Jahoor Alam,
Romana Ishrat,
R. K. Brojen Singh
Abstract:
We present the mechanism of interaction of Wnt network module, which is responsible for periodic sometogenesis, with p53 regulatory network, which is one of the main regulators of various cellular functions, and switching of various oscillating states by investigating p53-Wnt model. The variation in Nutlin concentration in p53 regulating network drives the Wnt network module to different states, s…
▽ More
We present the mechanism of interaction of Wnt network module, which is responsible for periodic sometogenesis, with p53 regulatory network, which is one of the main regulators of various cellular functions, and switching of various oscillating states by investigating p53-Wnt model. The variation in Nutlin concentration in p53 regulating network drives the Wnt network module to different states, stabilized, damped and sustain oscillation states, and even to cycle arrest. Similarly, the change in Axin concentration in Wnt could able to modulate the p53 dynamics at these states. We then solve the set of coupled ordinary differential equations of the model using quasi steady state approximation. We, further, demonstrate the change of p53 and Gsk3 interaction rate, due to hypothetical catalytic reaction or external stimuli, can able to regulate the dynamics of the two network modules, and even can control their dynamics to protect the system from cycle arrest (apoptosis).
△ Less
Submitted 16 March, 2015;
originally announced March 2015.
-
Dynamical System Modeling Of Immune Reconstitution Following Allogeneic Stem Cell Transplantation Identifies Patients At Risk For Adverse Outcomes
Authors:
Amir A. Toor,
Roy T. Sabo,
Catherine H. Roberts,
Bonny L. Moore,
Salman R. Salman,
Allison F. Scalora,
May T. Aziz,
Ali S. Shubar Ali,
Charles E. Hall,
Jeremy Meier,
Radhika M. Thorn,
Elaine Wang,
Shiyu Song,
Kristin Miller,
Kathryn Rizzo,
William B. Clark,
John M. McCarty,
Harold M. Chung,
Masoud H. Manjili,
Michael C. Neale
Abstract:
Systems that evolve over time and follow mathematical laws as they do so, are called dynamical systems. Lymphocyte recovery and clinical outcomes in 41 allograft recipients conditioned using anti-thymocyte globulin (ATG) and 4.5 Gray total-body-irradiation were studied to determine if immune reconstitution could be described as a dynamical system. Survival, relapse, and graft vs. host disease (GVH…
▽ More
Systems that evolve over time and follow mathematical laws as they do so, are called dynamical systems. Lymphocyte recovery and clinical outcomes in 41 allograft recipients conditioned using anti-thymocyte globulin (ATG) and 4.5 Gray total-body-irradiation were studied to determine if immune reconstitution could be described as a dynamical system. Survival, relapse, and graft vs. host disease (GVHD) were not significantly different in two cohorts of patients receiving different doses of ATG. However, donor-derived CD3+ (ddCD3) cell reconstitution was superior in the lower ATG dose cohort, and there were fewer instances of donor lymphocyte infusion (DLI). Lymphoid recovery was plotted in each individual over time and demonstrated one of three sigmoid growth patterns; Pattern A (n=15), had rapid growth with high lymphocyte counts, pattern B (n=14), slower growth with intermediate recovery and pattern C, poor lymphocyte reconstitution (n=10). There was a significant association between lymphocyte recovery patterns and both the rate of change of ddCD3 at day 30 post-SCT and the clinical outcomes. GVHD was observed more frequently with pattern A; relapse and DLI more so with pattern C, with a consequent survival advantage in patients with patterns A and B. We conclude that evaluating immune reconstitution following SCT as a dynamical system may differentiate patients at risk of adverse outcomes and allow early intervention to modulate that risk.
△ Less
Submitted 14 December, 2014;
originally announced December 2014.