Search | arXiv e-print repository

arXiv:2506.19106 [pdf, ps, other]

Staining normalization in histopathology: Method benchmarking using multicenter dataset

Authors: Umair Khan, Jouni Härkönen, Marjukka Friman, Leena Latonen, Teijo Kuopio, Pekka Ruusuvuori

Abstract: Hematoxylin and Eosin (H&E) has been the gold standard in tissue analysis for decades, however, tissue specimens stained in different laboratories vary, often significantly, in appearance. This variation poses a challenge for both pathologists' and AI-based downstream analysis. Minimizing stain variation computationally is an active area of research. To further investigate this problem, we collect… ▽ More Hematoxylin and Eosin (H&E) has been the gold standard in tissue analysis for decades, however, tissue specimens stained in different laboratories vary, often significantly, in appearance. This variation poses a challenge for both pathologists' and AI-based downstream analysis. Minimizing stain variation computationally is an active area of research. To further investigate this problem, we collected a unique multi-center tissue image dataset, wherein tissue samples from colon, kidney, and skin tissue blocks were distributed to 66 different labs for routine H&E staining. To isolate staining variation, other factors affecting the tissue appearance were kept constant. Further, we used this tissue image dataset to compare the performance of eight different stain normalization methods, including four traditional methods, namely, histogram matching, Macenko, Vahadane, and Reinhard normalization, and two deep learning-based methods namely CycleGAN and Pixp2pix, both with two variants each. We used both quantitative and qualitative evaluation to assess the performance of these methods. The dataset's inter-laboratory staining variation could also guide strategies to improve model generalizability through varied training data △ Less

Submitted 23 June, 2025; originally announced June 2025.

Comments: 18 pages, 9 figures

ACM Class: I.2.1; I.4.0

arXiv:2412.20616 [pdf, other]

Hilbert Curve Based Molecular Sequence Analysis

Authors: Sarwan Ali, Tamkanat E Ali, Imdad Ullah Khan, Murray Patterson

Abstract: Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free tech… ▽ More Accurate molecular sequence analysis is a key task in the field of bioinformatics. To apply molecular sequence classification algorithms, we first need to generate the appropriate representations of the sequences. Traditional numeric sequence representation techniques are mostly based on sequence alignment that faces limitations in the form of lack of accuracy. Although several alignment-free techniques have also been introduced, their tabular data form results in low performance when used with Deep Learning (DL) models compared to the competitive performance observed in the case of image-based data. To find a solution to this problem and to make Deep Learning (DL) models function to their maximum potential while capturing the important spatial information in the sequence data, we propose a universal Hibert curve-based Chaos Game Representation (CGR) method. This method is a transformative function that involves a novel Alphabetic index mapping technique used in constructing Hilbert curve-based image representation from molecular sequences. Our method can be globally applied to any type of molecular sequence data. The Hilbert curve-based image representations can be used as input to sophisticated vision DL models for sequence classification. The proposed method shows promising results as it outperforms current state-of-the-art methods by achieving a high accuracy of $94.5$\% and an F1 score of $93.9\%$ when tested with the CNN model on the lung cancer dataset. This approach opens up a new horizon for exploring molecular sequence analysis using image classification methods. △ Less

Submitted 29 December, 2024; originally announced December 2024.

arXiv:2409.06694 [pdf, ps, other]

DANCE: Deep Learning-Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images

Authors: Taslim Murad, Prakash Chourasia, Sarwan Ali, Imdad Ullah Khan, Murray Patterson

Abstract: Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immu… ▽ More Cancer is a complex disease characterized by uncontrolled cell growth. T cell receptors (TCRs), crucial proteins in the immune system, play a key role in recognizing antigens, including those associated with cancer. Recent advancements in sequencing technologies have facilitated comprehensive profiling of TCR repertoires, uncovering TCRs with potent anti-cancer activity and enabling TCR-based immunotherapies. However, analyzing these intricate biomolecules necessitates efficient representations that capture their structural and functional information. T-cell protein sequences pose unique challenges due to their relatively smaller lengths compared to other biomolecules. An image-based representation approach becomes a preferred choice for efficient embeddings, allowing for the preservation of essential details and enabling comprehensive analysis of T-cell protein sequences. In this paper, we propose to generate images from the protein sequences using the idea of Chaos Game Representation (CGR) using the Kaleidoscopic images approach. This Deep Learning Assisted Analysis of Protein Sequences Using Chaos Enhanced Kaleidoscopic Images (called DANCE) provides a unique way to visualize protein sequences by recursively applying chaos game rules around a central seed point. we perform the classification of the T cell receptors (TCRs) protein sequences in terms of their respective target cancer cells, as TCRs are known for their immune response against cancer disease. The TCR sequences are converted into images using the DANCE method. We employ deep-learning vision models to perform the classification to obtain insights into the relationship between the visual patterns observed in the generated kaleidoscopic images and the underlying protein properties. By combining CGR-based image generation with deep learning classification, this study opens novel possibilities in the protein analysis domain. △ Less

Submitted 11 June, 2025; v1 submitted 10 September, 2024; originally announced September 2024.

arXiv:2308.01920 [pdf, other]

Sequence-Based Nanobody-Antigen Binding Prediction

Authors: Usama Sardar, Sarwan Ali, Muhammad Sohaib Ayub, Muhammad Shoaib, Khurram Bashir, Imdad Ullah Khan, Murray Patterson

Abstract: Nanobodies (Nb) are monomeric heavy-chain fragments derived from heavy-chain only antibodies naturally found in Camelids and Sharks. Their considerably small size (~3-4 nm; 13 kDa) and favorable biophysical properties make them attractive targets for recombinant production. Furthermore, their unique ability to bind selectively to specific antigens, such as toxins, chemicals, bacteria, and viruses,… ▽ More Nanobodies (Nb) are monomeric heavy-chain fragments derived from heavy-chain only antibodies naturally found in Camelids and Sharks. Their considerably small size (~3-4 nm; 13 kDa) and favorable biophysical properties make them attractive targets for recombinant production. Furthermore, their unique ability to bind selectively to specific antigens, such as toxins, chemicals, bacteria, and viruses, makes them powerful tools in cell biology, structural biology, medical diagnostics, and future therapeutic agents in treating cancer and other serious illnesses. However, a critical challenge in nanobodies production is the unavailability of nanobodies for a majority of antigens. Although some computational methods have been proposed to screen potential nanobodies for given target antigens, their practical application is highly restricted due to their reliance on 3D structures. Moreover, predicting nanobodyantigen interactions (binding) is a time-consuming and labor-intensive task. This study aims to develop a machine-learning method to predict Nanobody-Antigen binding solely based on the sequence data. We curated a comprehensive dataset of Nanobody-Antigen binding and nonbinding data and devised an embedding method based on gapped k-mers to predict binding based only on sequences of nanobody and antigen. Our approach achieves up to 90% accuracy in binding prediction and is significantly more efficient compared to the widely-used computational docking technique. △ Less

Submitted 14 July, 2023; originally announced August 2023.

arXiv:2307.05519 [pdf]

Physical Color Calibration of Digital Pathology Scanners for Robust Artificial Intelligence Assisted Cancer Diagnosis

Authors: Xiaoyi Ji, Richard Salmon, Nita Mulliqi, Umair Khan, Yinxi Wang, Anders Blilie, Henrik Olsson, Bodil Ginnerup Pedersen, Karina Dalsgaard Sørensen, Benedicte Parm Ulhøi, Svein R Kjosavik, Emilius AM Janssen, Mattias Rantalainen, Lars Egevad, Pekka Ruusuvuori, Martin Eklund, Kimmo Kartasalo

Abstract: The potential of artificial intelligence (AI) in digital pathology is limited by technical inconsistencies in the production of whole slide images (WSIs), leading to degraded AI performance and posing a challenge for widespread clinical application as fine-tuning algorithms for each new site is impractical. Changes in the imaging workflow can also lead to compromised diagnoses and patient safety r… ▽ More The potential of artificial intelligence (AI) in digital pathology is limited by technical inconsistencies in the production of whole slide images (WSIs), leading to degraded AI performance and posing a challenge for widespread clinical application as fine-tuning algorithms for each new site is impractical. Changes in the imaging workflow can also lead to compromised diagnoses and patient safety risks. We evaluated whether physical color calibration of scanners can standardize WSI appearance and enable robust AI performance. We employed a color calibration slide in four different laboratories and evaluated its impact on the performance of an AI system for prostate cancer diagnosis on 1,161 WSIs. Color standardization resulted in consistently improved AI model calibration and significant improvements in Gleason grading performance. The study demonstrates that physical color calibration provides a potential solution to the variation introduced by different scanners, making AI-based cancer diagnostics more reliable and applicable in clinical settings. △ Less

Submitted 7 July, 2023; originally announced July 2023.

arXiv:2306.05514 [pdf, other]

Robust Brain Age Estimation via Regression Models and MRI-derived Features

Authors: Mansoor Ahmed, Usama Sardar, Sarwan Ali, Shafiq Alam, Murray Patterson, Imdad Ullah Khan

Abstract: The determination of biological brain age is a crucial biomarker in the assessment of neurological disorders and understanding of the morphological changes that occur during aging. Various machine learning models have been proposed for estimating brain age through Magnetic Resonance Imaging (MRI) of healthy controls. However, developing a robust brain age estimation (BAE) framework has been challe… ▽ More The determination of biological brain age is a crucial biomarker in the assessment of neurological disorders and understanding of the morphological changes that occur during aging. Various machine learning models have been proposed for estimating brain age through Magnetic Resonance Imaging (MRI) of healthy controls. However, developing a robust brain age estimation (BAE) framework has been challenging due to the selection of appropriate MRI-derived features and the high cost of MRI acquisition. In this study, we present a novel BAE framework using the Open Big Healthy Brain (OpenBHB) dataset, which is a new multi-site and publicly available benchmark dataset that includes region-wise feature metrics derived from T1-weighted (T1-w) brain MRI scans of 3965 healthy controls aged between 6 to 86 years. Our approach integrates three different MRI-derived region-wise features and different regression models, resulting in a highly accurate brain age estimation with a Mean Absolute Error (MAE) of 3.25 years, demonstrating the framework's robustness. We also analyze our model's regression-based performance on gender-wise (male and female) healthy test groups. The proposed BAE framework provides a new approach for estimating brain age, which has important implications for the understanding of neurological disorders and age-related brain changes. △ Less

Submitted 8 June, 2023; originally announced June 2023.

Comments: Published at the 15th International Conference on Computational Collective Intelligence

arXiv:2304.12328 [pdf, other]

Virus2Vec: Viral Sequence Classification Using Machine Learning

Authors: Sarwan Ali, Babatunde Bello, Prakash Chourasia, Ria Thazhe Punathil, Pin-Yu Chen, Imdad Ullah Khan, Murray Patterson

Abstract: Understanding the host-specificity of different families of viruses sheds light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans. It enables epidemiologists, medical professionals, and policymakers to curb existing epidemics and prevent future ones promptly. In the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known that the spike protein… ▽ More Understanding the host-specificity of different families of viruses sheds light on the origin of, e.g., SARS-CoV-2, rabies, and other such zoonotic pathogens in humans. It enables epidemiologists, medical professionals, and policymakers to curb existing epidemics and prevent future ones promptly. In the family Coronaviridae (of which SARS-CoV-2 is a member), it is well-known that the spike protein is the point of contact between the virus and the host cell membrane. On the other hand, the two traditional mammalian orders, Carnivora (carnivores) and Chiroptera (bats) are recognized to be responsible for maintaining and spreading the Rabies Lyssavirus (RABV). We propose Virus2Vec, a feature-vector representation for viral (nucleotide or amino acid) sequences that enable vector-space-based machine learning models to identify viral hosts. Virus2Vec generates numerical feature vectors for unaligned sequences, allowing us to forego the computationally expensive sequence alignment step from the pipeline. Virus2Vec leverages the power of both the \emph{minimizer} and position weight matrix (PWM) to generate compact feature vectors. Using several classifiers, we empirically evaluate Virus2Vec on real-world spike sequences of Coronaviridae and rabies virus sequence data to predict the host (identifying the reservoirs of infection). Our results demonstrate that Virus2Vec outperforms the predictive accuracies of baseline and state-of-the-art methods. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: 11 Pages 6 Figures Accepted in conference Conference on Health, Inference, and Learning (CHIL) 2023

arXiv:2304.00291 [pdf, ps, other]

BioSequence2Vec: Efficient Embedding Generation For Biological Sequences

Authors: Sarwan Ali, Usama Sardar, Murray Patterson, Imdad Ullah Khan

Abstract: Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three… ▽ More Representation learning is an important step in the machine learning pipeline. Given the current biological sequencing data volume, learning an explicit representation is prohibitive due to the dimensionality of the resulting feature vectors. Kernel-based methods, e.g., SVM, are a proven efficient and useful alternative for several machine learning (ML) tasks such as sequence classification. Three challenges with kernel methods are (i) the computation time, (ii) the memory usage (storing an $n\times n$ matrix), and (iii) the usage of kernel matrices limited to kernel-based ML methods (difficult to generalize on non-kernel classifiers). While (i) can be solved using approximate methods, challenge (ii) remains for typical kernel methods. Similarly, although non-kernel-based ML methods can be applied to kernel matrices by extracting principal components (kernel PCA), it may result in information loss, while being computationally expensive. In this paper, we propose a general-purpose representation learning approach that embodies kernel methods' qualities while avoiding computation, memory, and generalizability challenges. This involves computing a low-dimensional embedding of each sequence, using random projections of its $k$-mer frequency vectors, significantly reducing the computation needed to compute the dot product and the memory needed to store the resulting representation. Our proposed fast and alignment-free embedding method can be used as input to any distance (e.g., $k$ nearest neighbors) and non-distance (e.g., decision tree) based ML method for classification and clustering tasks. Using different forms of biological sequences as input, we perform a variety of real-world classification tasks, such as SARS-CoV-2 lineage and gene family classification, outperforming several state-of-the-art embedding and kernel methods in predictive performance. △ Less

Submitted 1 April, 2023; originally announced April 2023.

Comments: Accepted to PAKDD 2023

arXiv:2211.08350 [pdf, other]

Motor imagery classification using EEG spectrograms

Authors: Saadat Ullah Khan, Muhammad Majid, Syed Muhammad Anwar

Abstract: The loss of limb motion arising from damage to the spinal cord is a disability that could effect people while performing their day-to-day activities. The restoration of limb movement would enable people with spinal cord injury to interact with their environment more naturally and this is where a brain-computer interface (BCI) system could be beneficial. The detection of limb movement imagination (… ▽ More The loss of limb motion arising from damage to the spinal cord is a disability that could effect people while performing their day-to-day activities. The restoration of limb movement would enable people with spinal cord injury to interact with their environment more naturally and this is where a brain-computer interface (BCI) system could be beneficial. The detection of limb movement imagination (MI) could be significant for such a BCI, where the detected MI can guide the computer system. Using MI detection through electroencephalography (EEG), we can recognize the imagination of movement in a user and translate this into a physical movement. In this paper, we utilize pre-trained deep learning (DL) algorithms for the classification of imagined upper limb movements. We use a publicly available EEG dataset with data representing seven classes of limb movements. We compute the spectrograms of the time series EEG signal and use them as an input to the DL model for MI classification. Our novel approach for the classification of upper limb movements using pre-trained DL algorithms and spectrograms has achieved significantly improved results for seven movement classes. When compared with the recently proposed state-of-the-art methods, our algorithm achieved a significant average accuracy of 84.9% for classifying seven movements. △ Less

Submitted 15 November, 2022; originally announced November 2022.

Comments: Submitted to ISBI 2023

arXiv:2210.14330 [pdf, other]

A single-cell gene expression language model

Authors: William Connell, Umair Khan, Michael J. Keiser

Abstract: Gene regulation is a dynamic process that connects genotype and phenotype. Given the difficulty of physically mapping mammalian gene circuitry, we require new computational methods to learn regulatory rules. Natural language is a valuable analogy to the communication of regulatory control. Machine learning systems model natural language by explicitly learning context dependencies between words. We… ▽ More Gene regulation is a dynamic process that connects genotype and phenotype. Given the difficulty of physically mapping mammalian gene circuitry, we require new computational methods to learn regulatory rules. Natural language is a valuable analogy to the communication of regulatory control. Machine learning systems model natural language by explicitly learning context dependencies between words. We propose a similar system applied to single-cell RNA expression profiles to learn context dependencies between genes. Our model, Exceiver, is trained across a diversity of cell types using a self-supervised task formulated for discrete count data, accounting for feature sparsity. We found agreement between the similarity profiles of latent sample representations and learned gene embeddings with respect to biological annotations. We evaluated Exceiver on a new dataset and a downstream prediction task and found that pretraining supports transfer learning. Our work provides a framework to model gene regulation on a single-cell level and transfer knowledge to downstream tasks. △ Less

Submitted 25 October, 2022; originally announced October 2022.

Comments: 10 pages, 5 figures, Accepted at Learning Meaningful Representations of Life Workshop, 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2209.04952 [pdf, other]

Efficient Approximate Kernel Based Spike Sequence Classification

Authors: Sarwan Ali, Bikram Sahoo, Muhammad Asad Khan, Alexander Zelikovsky, Imdad Ullah Khan, Murray Patterson

Abstract: Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estim… ▽ More Machine learning (ML) models, such as SVM, for tasks like classification and clustering of sequences, require a definition of distance/similarity between pairs of sequences. Several methods have been proposed to compute the similarity between sequences, such as the exact approach that counts the number of matches between $k$-mers (sub-sequences of length $k$) and an approximate approach that estimates pairwise similarity scores. Although exact methods yield better classification performance, they pose high computational costs, limiting their applicability to a small number of sequences. The approximate algorithms are proven to be more scalable and perform comparably to (sometimes better than) the exact methods -- they are designed in a "general" way to deal with different types of sequences (e.g., music, protein, etc.). Although general applicability is a desired property of an algorithm, it is not the case in all scenarios. For example, in the current COVID-19 (coronavirus) pandemic, there is a need for an approach that can deal specifically with the coronavirus. To this end, we propose a series of ways to improve the performance of the approximate kernel (using minimizers and information gain) in order to enhance its predictive performance pm coronavirus sequences. More specifically, we improve the quality of the approximate kernel using domain knowledge (computed using information gain) and efficient preprocessing (using minimizers computation) to classify coronavirus spike protein sequences corresponding to different variants (e.g., Alpha, Beta, Gamma). We report results using different classification and clustering algorithms and evaluate their performance using multiple evaluation metrics. Using two datasets, we show that our proposed method helps improve the kernel's performance compared to the baseline and state-of-the-art approaches in the healthcare domain. △ Less

Submitted 11 September, 2022; originally announced September 2022.

Comments: Accepted for publication at "IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)"

arXiv:2109.07846 [pdf, other]

Telehealthcare and Telepathology in Pandemic: A Noninvasive, Low-Cost Micro-Invasive and Multimodal Real-Time Online Application for Early Diagnosis of COVID-19 Infection

Authors: Abdullah Bin Shams, Md. Mohsin Sarker Raihan, Md. Mohi Uddin Khan, Ocean Monjur, Rahat Bin Preo

Abstract: To contain the spread of the virus and stop the overcrowding of hospitalized patients, the coronavirus pandemic crippled healthcare facilities, mandating lockdowns and promoting remote work. As a result, telehealth has become increasingly popular for offering low-risk care to patients. However, the difficulty of preventing the next potential waves of infection has increased by constant virus mutat… ▽ More To contain the spread of the virus and stop the overcrowding of hospitalized patients, the coronavirus pandemic crippled healthcare facilities, mandating lockdowns and promoting remote work. As a result, telehealth has become increasingly popular for offering low-risk care to patients. However, the difficulty of preventing the next potential waves of infection has increased by constant virus mutation into new forms and a general lack of test kits, particularly in developing nations. In this research, a unique cloud-based application for the early identification of individuals who may have COVID-19 infection is proposed. The application provides five modes of diagnosis from possible symptoms (f1), cough sound (f2), specific blood biomarkers (f3), Raman spectral data of blood specimens (f4), and ECG signal paper-based image (f5). When a user selects an option and enters the information, the data is sent to the cloud server. The deployed machine learning (ML) and deep learning (DL) models classify the data in real time and inform the user of the likelihood of COVID-19 infection. Our deployed models can classify with an accuracy of 100%, 99.80%, 99.55%, 95.65%, and 77.59% from f3, f4, f5, f2, and f1 respectively. Moreover, the sensitivity for f2, f3, and f4 is 100%, which indicates the correct identification of COVID positive patients. This is significant in limiting the spread of the virus. Additionally, another ML model, as seen to offer 92% accuracy serves to identify patients who, out of a large group of patients admitted to the hospital cohort, need immediate critical care support by estimating the mortality risk of patients from blood parameters. The instantaneous multimodal nature of our technique offers multiplex and accurate diagnostic methods, highlighting the effectiveness of telehealth as a simple, widely available, and low-cost diagnostic solution, even for future pandemics. △ Less

Submitted 15 October, 2022; v1 submitted 16 September, 2021; originally announced September 2021.

Comments: 32 Pages. This article has been submitted for review to a prestigious journal

arXiv:2108.05660 [pdf, other]

doi 10.1201/9781003256083

Development of a Risk-Free COVID-19 Screening Algorithm from Routine Blood Tests Using Ensemble Machine Learning

Authors: Md. Mohsin Sarker Raihan, Md. Mohi Uddin Khan, Laboni Akter, Abdullah Bin Shams

Abstract: The Reverse Transcription Polymerase Chain Reaction (RTPCR)} test is the silver bullet diagnostic test to discern COVID infection. Rapid antigen detection is a screening test to identify COVID positive patients in little as 15 minutes, but has a lower sensitivity than the PCR tests. Besides having multiple standardized test kits, many people are getting infected and either recovering or dying even… ▽ More The Reverse Transcription Polymerase Chain Reaction (RTPCR)} test is the silver bullet diagnostic test to discern COVID infection. Rapid antigen detection is a screening test to identify COVID positive patients in little as 15 minutes, but has a lower sensitivity than the PCR tests. Besides having multiple standardized test kits, many people are getting infected and either recovering or dying even before the test due to the shortage and cost of kits, lack of indispensable specialists and labs, time-consuming result compared to bulk population especially in developing and underdeveloped countries. Intrigued by the parametric deviations in immunological and hematological profile of a COVID patient, this research work leveraged the concept of COVID-19 detection by proposing a risk-free and highly accurate Stacked Ensemble Machine Learning model to identify a COVID patient from communally available-widespread-cheap routine blood tests which gives a promising accuracy, precision, recall and F1-score of 100%. Analysis from R-curve also shows the preciseness of the risk-free model to be implemented. The proposed method has the potential for large scale ubiquitous low-cost screening application. This can add an extra layer of protection in keeping the number of infected cases to a minimum and control the pandemic by identifying asymptomatic or pre-symptomatic people early. △ Less

Submitted 9 May, 2023; v1 submitted 12 August, 2021; originally announced August 2021.

Comments: Please read the (most updated) published version from here: https://doi.org/10.1201/9781003256083 and cite our article (Chapter-11). Video and BibTex citation format can be found in the description: https://youtu.be/Ci8dznDadJ4

Journal ref: Applied Intelligence for Industry 4.0. Chapman and Hall/CRC. 2023

arXiv:2101.03126 [pdf]

piSAAC: Extended notion of SAAC feature selection novel method for discrimination of Enzymes model using different machine learning algorithm

Authors: Zaheer Ullah Khan, Dechang Pi, Izhar Ahmed Khan, Asif Nawaz, Jamil Ahmad, Mushtaq Hussain

Abstract: Enzymes and proteins are live driven biochemicals, which has a dramatic impact over the environment, in which it is active. So, therefore, it is highly looked-for to build such a robust and highly accurate automatic and computational model to accurately predict enzymes nature. In this study, a novel split amino acid composition model named piSAAC is proposed. In this model, protein sequence is dis… ▽ More Enzymes and proteins are live driven biochemicals, which has a dramatic impact over the environment, in which it is active. So, therefore, it is highly looked-for to build such a robust and highly accurate automatic and computational model to accurately predict enzymes nature. In this study, a novel split amino acid composition model named piSAAC is proposed. In this model, protein sequence is discretized in equal and balanced terminus to fully evaluate the intrinsic correlation properties of the sequence. Several state-of-the-art algorithms have been employed to evaluate the proposed model. A 10-folds cross-validation evaluation is used for finding out the authenticity and robust-ness of the model using different statistical measures e.g. Accuracy, sensitivity, specificity, F-measure and area un-der ROC curve. The experimental results show that, probabilistic neural network algorithm with piSAAC feature extraction yields an accuracy of 98.01%, sensitivity of 97.12%, specificity of 95.87%, f-measure of 0.9812and AUC 0.95812, over dataset S1, accuracy of 97.85%, sensitivity of 97.54%, specificity of 96.24%, f-measure of 0.9774 and AUC 0.9803 over dataset S2. Evident from these excellent empirical results, the proposed model would be a very useful tool for academic research and drug designing related application areas. △ Less

Submitted 15 December, 2020; originally announced January 2021.

Comments: 3 Figures, 5 Tables, 6 Pages

Showing 1–14 of 14 results for author: Khan, U