-
Efficient Rejection Sampling in the Entropy-Optimal Range
Authors:
Thomas L. Draper,
Feras A. Saad
Abstract:
The problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent unbiased coin flips is considered. The Knuth and Yao complexity theory of nonuniform random number generation furnishes a family of "entropy-optimal" sampling algorithms that consume between $H(P)$ and $H(P)+2$ coin flips per generated output, where $H$ is the S…
▽ More
The problem of generating a random variate $X$ from a finite discrete probability distribution $P$ using an entropy source of independent unbiased coin flips is considered. The Knuth and Yao complexity theory of nonuniform random number generation furnishes a family of "entropy-optimal" sampling algorithms that consume between $H(P)$ and $H(P)+2$ coin flips per generated output, where $H$ is the Shannon entropy function. However, the space complexity of entropy-optimal samplers scales exponentially with the number of bits required to encode $P$. This article introduces a family of efficient rejection samplers and characterizes their entropy, space, and time complexity. Within this family is a distinguished sampling algorithm that requires linearithmic space and preprocessing time, and whose expected entropy cost always falls in the entropy-optimal range $[H(P), H(P)+2)$. No previous sampler for discrete probability distributions is known to achieve these characteristics. Numerical experiments demonstrate performance improvements in runtime and entropy of the proposed algorithm compared to the celebrated alias method.
△ Less
Submitted 5 April, 2025;
originally announced April 2025.
-
Gricean Norms as a Basis for Effective Collaboration
Authors:
Fardin Saad,
Pradeep K. Murukannaiah,
Munindar P. Singh
Abstract:
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gric…
▽ More
Effective human-AI collaboration hinges not only on the AI agent's ability to follow explicit instructions but also on its capacity to navigate ambiguity, incompleteness, invalidity, and irrelevance in communication. Gricean conversational and inference norms facilitate collaboration by aligning unclear instructions with cooperative principles. We propose a normative framework that integrates Gricean norms and cognitive frameworks -- common ground, relevance theory, and theory of mind -- into large language model (LLM) based agents. The normative framework adopts the Gricean maxims of quantity, quality, relation, and manner, along with inference, as Gricean norms to interpret unclear instructions, which are: ambiguous, incomplete, invalid, or irrelevant. Within this framework, we introduce Lamoids, GPT-4 powered agents designed to collaborate with humans. To assess the influence of Gricean norms in human-AI collaboration, we evaluate two versions of a Lamoid: one with norms and one without. In our experiments, a Lamoid collaborates with a human to achieve shared goals in a grid world (Doors, Keys, and Gems) by interpreting both clear and unclear natural language instructions. Our results reveal that the Lamoid with Gricean norms achieves higher task accuracy and generates clearer, more accurate, and contextually relevant responses than the Lamoid without norms. This improvement stems from the normative framework, which enhances the agent's pragmatic reasoning, fostering effective human-AI collaboration and enabling context-aware communication in LLM-based agents.
△ Less
Submitted 18 March, 2025;
originally announced March 2025.
-
GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables
Authors:
Mathieu Huot,
Matin Ghavami,
Alexander K. Lew,
Ulrich Schaechtle,
Cameron E. Freer,
Zane Shelby,
Martin C. Rinard,
Feras A. Saad,
Vikash K. Mansinghka
Abstract:
This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic model…
▽ More
This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
△ Less
Submitted 21 June, 2024;
originally announced June 2024.
-
Scalable Spatiotemporal Prediction with Bayesian Neural Fields
Authors:
Feras Saad,
Jacob Burnim,
Colin Carroll,
Brian Patton,
Urs Köster,
Rif A. Saurous,
Matthew Hoffman
Abstract:
Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in diverse applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As the scale of modern datasets increases, there is a growing need for statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle many observat…
▽ More
Spatiotemporal datasets, which consist of spatially-referenced time series, are ubiquitous in diverse applications, such as air pollution monitoring, disease tracking, and cloud-demand forecasting. As the scale of modern datasets increases, there is a growing need for statistical methods that are flexible enough to capture complex spatiotemporal dynamics and scalable enough to handle many observations. This article introduces the Bayesian Neural Field (BayesNF), a domain-general statistical model that infers rich spatiotemporal probability distributions for data-analysis tasks including forecasting, interpolation, and variography. BayesNF integrates a deep neural network architecture for high-capacity function estimation with hierarchical Bayesian inference for robust predictive uncertainty quantification. Evaluations against prominent baselines show that BayesNF delivers improvements on prediction problems from climate and public health data containing tens to hundreds of thousands of measurements. Accompanying the paper is an open-source software package (https://github.com/google/bayesnf) that runs on GPU and TPU accelerators through the JAX machine learning platform.
△ Less
Submitted 26 November, 2024; v1 submitted 12 March, 2024;
originally announced March 2024.
-
A Survey of Deep Learning and Foundation Models for Time Series Forecasting
Authors:
John A. Miller,
Mohammed Aldosari,
Farah Saeed,
Nasid Habib Barna,
Subas Rana,
I. Budak Arpinar,
Ninghao Liu
Abstract:
Deep Learning has been successfully applied to many application domains, yet its advantages have been slow to emerge for time series forecasting. For example, in the well-known Makridakis (M) Competitions, hybrids of traditional statistical or machine learning techniques have only recently become the top performers. With the recent architectural advances in deep learning being applied to time seri…
▽ More
Deep Learning has been successfully applied to many application domains, yet its advantages have been slow to emerge for time series forecasting. For example, in the well-known Makridakis (M) Competitions, hybrids of traditional statistical or machine learning techniques have only recently become the top performers. With the recent architectural advances in deep learning being applied to time series forecasting (e.g., encoder-decoders with attention, transformers, and graph neural networks), deep learning has begun to show significant advantages. Still, in the area of pandemic prediction, there remain challenges for deep learning models: the time series is not long enough for effective training, unawareness of accumulated scientific knowledge, and interpretability of the model. To this end, the development of foundation models (large deep learning models with extensive pre-training) allows models to understand patterns and acquire knowledge that can be applied to new related problems before extensive training data becomes available. Furthermore, there is a vast amount of knowledge available that deep learning models can tap into, including Knowledge Graphs and Large Language Models fine-tuned with scientific domain knowledge. There is ongoing research examining how to utilize or inject such knowledge into deep learning models. In this survey, several state-of-the-art modeling techniques are reviewed, and suggestions for further work are provided.
△ Less
Submitted 24 January, 2024;
originally announced January 2024.
-
Data-CASE: Grounding Data Regulations for Compliant Data Processing Systems
Authors:
Vishal Chakraborty,
Stacy Ann-Elvy,
Sharad Mehrotra,
Faisal Nawab,
Mohammad Sadoghi,
Shantanu Sharma,
Nalini Venkatsubhramanian,
Farhan Saeed
Abstract:
Data regulations, such as GDPR, are increasingly being adopted globally to protect against unsafe data management practices. Such regulations are, often ambiguous (with multiple valid interpretations) when it comes to defining the expected dynamic behavior of data processing systems. This paper argues that it is possible to represent regulations such as GDPR formally as invariants using a (small s…
▽ More
Data regulations, such as GDPR, are increasingly being adopted globally to protect against unsafe data management practices. Such regulations are, often ambiguous (with multiple valid interpretations) when it comes to defining the expected dynamic behavior of data processing systems. This paper argues that it is possible to represent regulations such as GDPR formally as invariants using a (small set of) data processing concepts that capture system behavior. When such concepts are grounded, i.e., they are provided with a single unambiguous interpretation, systems can achieve compliance by demonstrating that the system-actions they implement maintain the invariants (representing the regulations). To illustrate our vision, we propose Data-CASE, a simple yet powerful model that (a) captures key data processing concepts (b) a set of invariants that describe regulations in terms of these concepts. We further illustrate the concept of grounding using "deletion" as an example and highlight several ways in which end-users, companies, and software designers/engineers can use Data-CASE.
△ Less
Submitted 14 August, 2023;
originally announced August 2023.
-
Sequential Monte Carlo Learning for Time Series Structure Discovery
Authors:
Feras A. Saad,
Brian J. Patton,
Matthew D. Hoffman,
Rif A. Saurous,
Vikash K. Mansinghka
Abstract:
This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both…
▽ More
This paper presents a new approach to automatically discovering accurate models of complex time series data. Working within a Bayesian nonparametric prior over a symbolic space of Gaussian process time series models, we present a novel structure learning algorithm that integrates sequential Monte Carlo (SMC) and involutive MCMC for highly effective posterior inference. Our method can be used both in "online" settings, where new data is incorporated sequentially in time, and in "offline" settings, by using nested subsets of historical data to anneal the posterior. Empirical measurements on real-world time series show that our method can deliver 10x--100x runtime speedups over previous MCMC and greedy-search structure learning algorithms targeting the same model family. We use our method to perform the first large-scale evaluation of Gaussian process time series structure learning on a prominent benchmark of 1,428 econometric datasets. The results show that our method discovers sensible models that deliver more accurate point forecasts and interval forecasts over multiple horizons as compared to widely used statistical and neural baselines that struggle on this challenging data.
△ Less
Submitted 13 July, 2023;
originally announced July 2023.
-
Estimators of Entropy and Information via Inference in Probabilistic Models
Authors:
Feras A. Saad,
Marco Cusumano-Towner,
Vikash K. Mansinghka
Abstract:
Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use imp…
▽ More
Estimating information-theoretic quantities such as entropy and mutual information is central to many problems in statistics and machine learning, but challenging in high dimensions. This paper presents estimators of entropy via inference (EEVI), which deliver upper and lower bounds on many information quantities for arbitrary variables in a probabilistic generative model. These estimators use importance sampling with proposal distribution families that include amortized variational inference and sequential Monte Carlo, which can be tailored to the target model and used to squeeze true information values with high accuracy. We present several theoretical properties of EEVI and demonstrate scalability and efficacy on two problems from the medical domain: (i) in an expert system for diagnosing liver disorders, we rank medical tests according to how informative they are about latent diseases, given a pattern of observed symptoms and patient attributes; and (ii) in a differential equation model of carbohydrate metabolism, we find optimal times to take blood glucose measurements that maximize information about a diabetic patient's insulin sensitivity, given their meal and medication schedule.
△ Less
Submitted 12 December, 2022; v1 submitted 24 February, 2022;
originally announced February 2022.
-
A Case Study on the Independence of Speech Emotion Recognition in Bangla and English Languages using Language-Independent Prosodic Features
Authors:
Fardin Saad,
Hasan Mahmud,
Mohammad Ridwan Kabir,
Md. Alamin Shaheen,
Paresha Farastu,
Md. Kamrul Hasan
Abstract:
A language agnostic approach to recognizing emotions from speech remains an incomplete and challenging task. In this paper, we performed a step-by-step comparative analysis of Speech Emotion Recognition (SER) using Bangla and English languages to assess whether distinguishing emotions from speech is independent of language. Six emotions were categorized for this study, such as - happy, angry, neut…
▽ More
A language agnostic approach to recognizing emotions from speech remains an incomplete and challenging task. In this paper, we performed a step-by-step comparative analysis of Speech Emotion Recognition (SER) using Bangla and English languages to assess whether distinguishing emotions from speech is independent of language. Six emotions were categorized for this study, such as - happy, angry, neutral, sad, disgust, and fear. We employed three Emotional Speech Sets (ESS), of which the first two were developed by native Bengali speakers in Bangla and English languages separately. The third was a subset of the Toronto Emotional Speech Set (TESS), which was developed by native English speakers from Canada. We carefully selected language-independent prosodic features, adopted a Support Vector Machine (SVM) model, and conducted three experiments to carry out our proposition. In the first experiment, we measured the performance of the three speech sets individually, followed by the second experiment, where different ESS pairs were integrated to analyze the impact on SER. Finally, we measured the recognition rate by training and testing the model with different speech sets in the third experiment. Although this study reveals that SER in Bangla and English languages is mostly language-independent, some disparities were observed while recognizing emotional states like disgust and fear in these two languages. Moreover, our investigations revealed that non-native speakers convey emotions through speech, much like expressing themselves in their native tongue.
△ Less
Submitted 13 May, 2022; v1 submitted 21 November, 2021;
originally announced November 2021.
-
Designing the Architecture of a Convolutional Neural Network Automatically for Diabetic Retinopathy Diagnosis
Authors:
Fahman Saeed,
Muhammad Hussain,
Hatim A Aboalsamh,
Fadwa Al Adel,
Adi Mohammed Al Owaifeer
Abstract:
The prevalence of diabetic retinopathy (DR) has reached 34.6% worldwide and is a major cause of blindness among middle-aged diabetic patients. Regular DR screening using fundus photography helps detect its complications and prevent its progression to advanced levels. As manual screening is time-consuming and subjective, machine learning (ML) and deep learning (DL) have been employed to aid graders…
▽ More
The prevalence of diabetic retinopathy (DR) has reached 34.6% worldwide and is a major cause of blindness among middle-aged diabetic patients. Regular DR screening using fundus photography helps detect its complications and prevent its progression to advanced levels. As manual screening is time-consuming and subjective, machine learning (ML) and deep learning (DL) have been employed to aid graders. However, the existing CNN-based methods use either pre-trained CNN models or a brute force approach to design new CNN models, which are not customized to the complexity of fundus images. To overcome this issue, we introduce an approach for custom-design of CNN models, whose architectures are adapted to the structural patterns of fundus images and better represent the DR-relevant features. It takes the leverage of k-medoid clustering, principal component analysis (PCA), and inter-class and intra-class variations to automatically determine the depth and width of a CNN model. The designed models are lightweight, adapted to the internal structures of fundus images, and encode the discriminative patterns of DR lesions. The technique is validated on a local dataset from King Saud University Medical City, Saudi Arabia, and two challenging benchmark datasets from Kaggle: EyePACS and APTOS2019. The custom-designed models outperform the famous pre-trained CNN models like ResNet152, Densnet121, and ResNeSt50 with a significant decrease in the number of parameters and compete well with the state-of-the-art CNN-based DR screening methods. The proposed approach is helpful for DR screening under diverse clinical settings and referring the patients who may need further assessment and treatment to expert ophthalmologists.
△ Less
Submitted 7 November, 2022; v1 submitted 7 October, 2021;
originally announced October 2021.
-
Hierarchical Infinite Relational Model
Authors:
Feras A. Saad,
Vikash K. Mansinghka
Abstract:
This paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixt…
▽ More
This paper describes the hierarchical infinite relational model (HIRM), a new probabilistic generative model for noisy, sparse, and heterogeneous relational data. Given a set of relations defined over a collection of domains, the model first infers multiple non-overlapping clusters of relations using a top-level Chinese restaurant process. Within each cluster of relations, a Dirichlet process mixture is then used to partition the domain entities and model the probability distribution of relation values. The HIRM generalizes the standard infinite relational model and can be used for a variety of data analysis tasks including dependence detection, clustering, and density estimation. We present new algorithms for fully Bayesian posterior inference via Gibbs sampling. We illustrate the efficacy of the method on a density estimation benchmark of twenty object-attribute datasets with up to 18 million cells and use it to discover relational structure in real-world datasets from politics and genomics.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Communication-avoiding micro-architecture to compute Xcorr scores for peptide identification
Authors:
Sumesh Kumar,
Fahad Saeed
Abstract:
Database algorithms play a crucial part in systems biology studies by identifying proteins from mass spectrometry data. Many of these database search algorithms incur huge computational costs by computing similarity scores for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Modern MS instrumentation techniques which are capable of generating high-resolution sp…
▽ More
Database algorithms play a crucial part in systems biology studies by identifying proteins from mass spectrometry data. Many of these database search algorithms incur huge computational costs by computing similarity scores for each pair of sparse experimental spectrum and candidate theoretical spectrum vectors. Modern MS instrumentation techniques which are capable of generating high-resolution spectrometry data require comparison against an enormous search space, further emphasizing the need of efficient accelerators. Recent research has shown that the overall cost of scoring, and deducing peptides is dominated by the communication costs between different hierarchies of memory and processing units. However, these communication costs are seldom considered in accelerator-based architectures leading to inefficient DRAM accesses, and poor data-utilization due to irregular memory access patterns. In this paper, we propose a novel communication-avoiding micro-architecture to compute cross-correlation based similarity score by utilizing efficient local cache, and peptide pre-fetching to minimize DRAM accesses, and a custom-designed peptide broadcast bus to allow input reuse. An efficient bus arbitration scheme was designed, and implemented to minimize synchronization cost and exploit parallelism of processing elements. Our simulation results show that the proposed micro-architecture performs on average 24x better than a CPU implementation running on a 3.6 GHz Intel i7-4970 processor with 16GB memory.
△ Less
Submitted 5 August, 2021; v1 submitted 31 July, 2021;
originally announced August 2021.
-
Weight Initialization Techniques for Deep Learning Algorithms in Remote Sensing: Recent Trends and Future Perspectives
Authors:
Wadii Boulila,
Maha Driss,
Mohamed Al-Sarem,
Faisal Saeed,
Moez Krichen
Abstract:
During the last decade, several research works have focused on providing novel deep learning methods in many application fields. However, few of them have investigated the weight initialization process for deep learning, although its importance is revealed in improving deep learning performance. This can be justified by the technical difficulties in proposing new techniques for this promising rese…
▽ More
During the last decade, several research works have focused on providing novel deep learning methods in many application fields. However, few of them have investigated the weight initialization process for deep learning, although its importance is revealed in improving deep learning performance. This can be justified by the technical difficulties in proposing new techniques for this promising research field. In this paper, a survey related to weight initialization techniques for deep algorithms in remote sensing is conducted. This survey will help practitioners to drive further research in this promising field. To the best of our knowledge, this paper constitutes the first survey focusing on weight initialization for deep learning models.
△ Less
Submitted 13 February, 2021;
originally announced February 2021.
-
HiCOPS: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry based Omics Data
Authors:
Muhammad Haseeb,
Fahad Saeed
Abstract:
Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for incre…
▽ More
Database-search algorithms, that deduce peptides from Mass Spectrometry (MS) data, have tried to improve the computational efficiency to accomplish larger, and more complex systems biology studies. Existing serial, and high-performance computing (HPC) search engines, otherwise highly successful, are known to exhibit poor-scalability with increasing size of theoretical search-space needed for increased complexity of modern non-model, multi-species MS-based omics analysis. Consequently, the bottleneck for computational techniques is the communication costs of moving the data between hierarchy of memory, or processing units, and not the arithmetic operations. This post-Moore change in architecture, and demands of modern systems biology experiments have dampened the overall effectiveness of the existing HPC workflows. We present a novel efficient parallel computational method, and its implementation on memory-distributed architectures for peptide identification tool called HiCOPS, that enables more than 100-fold improvement in speed over most existing HPC proteome database search tools. HiCOPS empowers the supercomputing database search concept for comprehensive identification of peptides, and all their modified forms within a reasonable time-frame. We demonstrate this by searching Gigabytes of experimental MS data against Terabytes of databases where HiCOPS completes peptide identification in few minutes using 72 parallel nodes (1728 cores) compared to several weeks required by existing state-of-the-art tools using 1 node (24 cores); 100 minutes vs 5 weeks; 500x speedup. Finally, we formulate a theoretical framework for our overhead-avoiding strategy, and report superior performance evaluation results for key metrics including execution time, CPU utilization, speedups, and I/O efficiency. The software will be made available at: hicops.github.io
△ Less
Submitted 5 February, 2021; v1 submitted 3 February, 2021;
originally announced February 2021.
-
SPPL: Probabilistic Programming with Fast Exact Symbolic Inference
Authors:
Feras A. Saad,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distribut…
▽ More
We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints. We formalize SPPL via a novel translation strategy from probabilistic programs to sum-product expressions and give sound exact algorithms for conditioning on and computing probabilities of events. SPPL imposes a collection of restrictions on probabilistic programs to ensure they can be translated into sum-product expressions, which allow the system to leverage new techniques for improving the scalability of translation and inference by automatically exploiting probabilistic structure. We implement a prototype of SPPL with a modular architecture and evaluate it on benchmarks the system targets, showing that it obtains up to 3500x speedups over state-of-the-art symbolic systems on tasks such as verifying the fairness of decision tree classifiers, smoothing hidden Markov models, conditioning transformed random variables, and computing rare event probabilities.
△ Less
Submitted 11 June, 2021; v1 submitted 7 October, 2020;
originally announced October 2020.
-
Communication Lower-Bounds for Distributed-Memory Computations for Mass Spectrometry based Omics Data
Authors:
Fahad Saeed,
Muhammad Haseeb,
SS Iyengar
Abstract:
Mass spectrometry (MS) based omics data analysis require significant time and resources. To date, few parallel algorithms have been proposed for deducing peptides from mass spectrometry-based data. However, these parallel algorithms were designed, and developed when the amount of data that needed to be processed was smaller in scale. In this paper, we prove that the communication bound that is rea…
▽ More
Mass spectrometry (MS) based omics data analysis require significant time and resources. To date, few parallel algorithms have been proposed for deducing peptides from mass spectrometry-based data. However, these parallel algorithms were designed, and developed when the amount of data that needed to be processed was smaller in scale. In this paper, we prove that the communication bound that is reached by the \emph{existing} parallel algorithms is $Ω(mn+2r\frac{q}{p})$, where $m$ and $n$ are the dimensions of the theoretical database matrix, $q$ and $r$ are dimensions of spectra, and $p$ is the number of processors. We further prove that communication-optimal strategy with fast-memory $\sqrt{M} = mn + \frac{2qr}{p}$ can achieve $Ω({\frac{2mnq}{p}})$ but is not achieved by any existing parallel proteomics algorithms till date. To validate our claim, we performed a meta-analysis of published parallel algorithms, and their performance results. We show that sub-optimal speedups with increasing number of processors is a direct consequence of not achieving the communication lower-bounds. We further validate our claim by performing experiments which demonstrate the communication bounds that are proved in this paper. Consequently, we assert that next-generation of \emph{provable}, and demonstrated superior parallel algorithms are urgently needed for MS based large systems-biology studies especially for meta-proteomics, proteogenomic, microbiome, and proteomics for non-model organisms. Our hope is that this paper will excite the parallel computing community to further investigate parallel algorithms for highly influential MS based omics problems.
△ Less
Submitted 11 August, 2021; v1 submitted 29 September, 2020;
originally announced September 2020.
-
Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images
Authors:
Soumick Chatterjee,
Fatima Saad,
Chompunuch Sarasaen,
Suhita Ghosh,
Valerie Krug,
Rupali Khatun,
Rahul Mishra,
Nirja Desai,
Petia Radeva,
Georg Rose,
Sebastian Stober,
Oliver Speck,
Andreas Nürnberger
Abstract:
The outbreak of COVID-19 has shocked the entire world with its fairly rapid spread and has challenged different sectors. One of the most effective ways to limit its spread is the early and accurate diagnosing infected patients. Medical imaging, such as X-ray and Computed Tomography (CT), combined with the potential of Artificial Intelligence (AI), plays an essential role in supporting medical pers…
▽ More
The outbreak of COVID-19 has shocked the entire world with its fairly rapid spread and has challenged different sectors. One of the most effective ways to limit its spread is the early and accurate diagnosing infected patients. Medical imaging, such as X-ray and Computed Tomography (CT), combined with the potential of Artificial Intelligence (AI), plays an essential role in supporting medical personnel in the diagnosis process. Thus, in this article five different deep learning models (ResNet18, ResNet34, InceptionV3, InceptionResNetV2 and DenseNet161) and their ensemble, using majority voting have been used to classify COVID-19, pneumoniæ and healthy subjects using chest X-ray images. Multilabel classification was performed to predict multiple pathologies for each patient, if present. Firstly, the interpretability of each of the networks was thoroughly studied using local interpretability methods - occlusion, saliency, input X gradient, guided backpropagation, integrated gradients, and DeepLIFT, and using a global technique - neuron activation profiles. The mean Micro-F1 score of the models for COVID-19 classifications ranges from 0.66 to 0.875, and is 0.89 for the ensemble of the network models. The qualitative results showed that the ResNets were the most interpretable models. This research demonstrates the importance of using interpretability methods to compare different models before making a decision regarding the best performing model.
△ Less
Submitted 24 January, 2024; v1 submitted 3 June, 2020;
originally announced June 2020.
-
The Fast Loaded Dice Roller: A Near-Optimal Exact Sampler for Discrete Probability Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. We prove that this algorithm, which we call the Fast Loaded Dice Roller (FLDR), is highly efficient in both space and time: (i) the size of the sampler is guaranteed to be linear in the number of bits…
▽ More
This paper introduces a new algorithm for the fundamental problem of generating a random integer from a discrete probability distribution using a source of independent and unbiased random coin flips. We prove that this algorithm, which we call the Fast Loaded Dice Roller (FLDR), is highly efficient in both space and time: (i) the size of the sampler is guaranteed to be linear in the number of bits needed to encode the input distribution; and (ii) the expected number of bits of entropy it consumes per sample is at most 6 bits more than the information-theoretically optimal rate. We present fast implementations of the linear-time preprocessing and near-optimal sampling algorithms using unsigned integer arithmetic. Empirical evaluations on a broad set of probability distributions establish that FLDR is 2x-10x faster in both preprocessing and sampling than multiple baseline algorithms, including the widely-used alias and interval samplers. It also uses up to 10000x less space than the information-theoretically optimal sampler, at the expense of less than 1.5x runtime overhead.
△ Less
Submitted 1 June, 2020; v1 submitted 8 March, 2020;
originally announced March 2020.
-
Explainable and Scalable Machine-Learning Algorithms for Detection of Autism Spectrum Disorder using fMRI Data
Authors:
Taban Eslami,
Joseph S. Raiker,
Fahad Saeed
Abstract:
Diagnosing Autism Spectrum Disorder (ASD) is a challenging problem, and is based purely on behavioral descriptions of symptomology (DSM-5/ICD-10), and requires informants to observe children with disorder across different settings (e.g. home, school). Numerous limitations (e.g., informant discrepancies, lack of adherence to assessment guidelines, informant biases) to current diagnostic practices h…
▽ More
Diagnosing Autism Spectrum Disorder (ASD) is a challenging problem, and is based purely on behavioral descriptions of symptomology (DSM-5/ICD-10), and requires informants to observe children with disorder across different settings (e.g. home, school). Numerous limitations (e.g., informant discrepancies, lack of adherence to assessment guidelines, informant biases) to current diagnostic practices have the potential to result in over-, under-, or misdiagnosis of the disorder. Advances in neuroimaging technologies are providing a critical step towards a more objective assessment of the disorder. Prior research provides strong evidence that structural and functional magnetic resonance imaging (MRI) data collected from individuals with ASD exhibit distinguishing characteristics that differ in local and global spatial, and temporal neural-patterns of the brain. Our proposed deep-learning model ASD-DiagNet exhibits consistently high accuracy for classification of ASD brain scans from neurotypical scans. We have for the first time integrated traditional machine-learning and deep-learning techniques that allows us to isolate ASD biomarkers from MRI data sets. Our method, called Auto-ASD-Network, uses a combination of deep-learning and Support Vector Machines (SVM) to classify ASD scans from neurotypical scans. Such interpretable models would help explain the decisions made by deep-learning techniques leading to knowledge discovery for neuroscientists, and transparent analysis for clinicians.
△ Less
Submitted 2 March, 2020;
originally announced March 2020.
-
Optimal Approximate Sampling from Discrete Probability Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
This paper addresses a fundamental problem in random variate generation: given access to a random source that emits a stream of independent fair bits, what is the most accurate and entropy-efficient algorithm for sampling from a discrete probability distribution $(p_1, \dots, p_n)$, where the probabilities of the output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ of the sampling algorithm must be…
▽ More
This paper addresses a fundamental problem in random variate generation: given access to a random source that emits a stream of independent fair bits, what is the most accurate and entropy-efficient algorithm for sampling from a discrete probability distribution $(p_1, \dots, p_n)$, where the probabilities of the output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ of the sampling algorithm must be specified using at most $k$ bits of precision? We present a theoretical framework for formulating this problem and provide new techniques for finding sampling algorithms that are optimal both statistically (in the sense of sampling accuracy) and information-theoretically (in the sense of entropy consumption). We leverage these results to build a system that, for a broad family of measures of statistical accuracy, delivers a sampling algorithm whose expected entropy usage is minimal among those that induce the same distribution (i.e., is "entropy-optimal") and whose output distribution $(\hat{p}_1, \dots, \hat{p}_n)$ is a closest approximation to the target distribution $(p_1, \dots, p_n)$ among all entropy-optimal sampling algorithms that operate within the specified $k$-bit precision. This optimal approximate sampler is also a closer approximation than any (possibly entropy-suboptimal) sampler that consumes a bounded amount of entropy with the specified precision, a class which includes floating-point implementations of inversion sampling and related methods found in many software libraries. We evaluate the accuracy, entropy consumption, precision requirements, and wall-clock runtime of our optimal approximate sampling algorithms on a broad set of distributions, demonstrating the ways that they are superior to existing approximate samplers and establishing that they often consume significantly fewer resources than are needed by exact samplers.
△ Less
Submitted 13 January, 2020;
originally announced January 2020.
-
Bayesian Synthesis of Probabilistic Programs for Automatic Data Modeling
Authors:
Feras A. Saad,
Marco F. Cusumano-Towner,
Ulrich Schaechtle,
Martin C. Rinard,
Vikash K. Mansinghka
Abstract:
We present new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. These techniques work with probabilistic domain-specific data modeling languages that capture key properties of a broad class of data generating processes, using Bayesian inference to synthesize probabilistic programs in these modeling languages given observed data. We…
▽ More
We present new techniques for automatically constructing probabilistic programs for data analysis, interpretation, and prediction. These techniques work with probabilistic domain-specific data modeling languages that capture key properties of a broad class of data generating processes, using Bayesian inference to synthesize probabilistic programs in these modeling languages given observed data. We provide a precise formulation of Bayesian synthesis for automatic data modeling that identifies sufficient conditions for the resulting synthesis procedure to be sound. We also derive a general class of synthesis algorithms for domain-specific languages specified by probabilistic context-free grammars and establish the soundness of our approach for these languages. We apply the techniques to automatically synthesize probabilistic programs for time series data and multivariate tabular data. We show how to analyze the structure of the synthesized programs to compute, for key qualitative properties of interest, the probability that the underlying data generating process exhibits each of these properties. Second, we translate probabilistic programs in the domain-specific language into probabilistic programs in Venture, a general-purpose probabilistic programming system. The translated Venture programs are then executed to obtain predictions of new time series data and new multivariate data records. Experimental results show that our techniques can accurately infer qualitative structure in multiple real-world data sets and outperform standard data analysis methods in forecasting and predicting new data.
△ Less
Submitted 14 July, 2019;
originally announced July 2019.
-
ASD-DiagNet: A hybrid learning approach for detection of Autism Spectrum Disorder using fMRI data
Authors:
Taban Eslami,
Vahid Mirjalili,
Alvis Fong,
Angela Laird,
Fahad Saeed
Abstract:
Mental disorders such as Autism Spectrum Disorders (ASD) are heterogeneous disorders that are notoriously difficult to diagnose, especially in children. The current psychiatric diagnostic process is based purely on the behavioural observation of symptomology (DSM-5/ICD-10) and may be prone to over-prescribing of drugs due to misdiagnosis. In order to move the field towards more quantitative fashio…
▽ More
Mental disorders such as Autism Spectrum Disorders (ASD) are heterogeneous disorders that are notoriously difficult to diagnose, especially in children. The current psychiatric diagnostic process is based purely on the behavioural observation of symptomology (DSM-5/ICD-10) and may be prone to over-prescribing of drugs due to misdiagnosis. In order to move the field towards more quantitative fashion, we need advanced and scalable machine learning infrastructure that will allow us to identify reliable biomarkers of mental health disorders. In this paper, we propose a framework called ASD-DiagNet for classifying subjects with ASD from healthy subjects by using only fMRI data. We designed and implemented a joint learning procedure using an autoencoder and a single layer perceptron which results in improved quality of extracted features and optimized parameters for the model. Further, we designed and implemented a data augmentation strategy, based on linear interpolation on available feature vectors, that allows us to produce synthetic datasets needed for training of machine learning models. The proposed approach is evaluated on a public dataset provided by Autism Brain Imaging Data Exchange including 1035 subjects coming from 17 different brain imaging centers. Our machine learning model outperforms other state of the art methods from 13 imaging centers with increase in classification accuracy up to 20% with maximum accuracy of 80%. The machine learning technique presented in this paper, in addition to yielding better quality, gives enormous advantages in terms of execution time (40 minutes vs. 6 hours on other methods). The implemented code is available as GPL license on GitHub portal of our lab (https://github.com/pcdslab/ASD-DiagNet).
△ Less
Submitted 16 April, 2019;
originally announced April 2019.
-
A Family of Exact Goodness-of-Fit Tests for High-Dimensional Discrete Distributions
Authors:
Feras A. Saad,
Cameron E. Freer,
Nathanael L. Ackerman,
Vikash K. Mansinghka
Abstract:
The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedu…
▽ More
The objective of goodness-of-fit testing is to assess whether a dataset of observations is likely to have been drawn from a candidate probability distribution. This paper presents a rank-based family of goodness-of-fit tests that is specialized to discrete distributions on high-dimensional domains. The test is readily implemented using a simulation-based, linear-time procedure. The testing procedure can be customized by the practitioner using knowledge of the underlying data domain. Unlike most existing test statistics, the proposed test statistic is distribution-free and its exact (non-asymptotic) sampling distribution is known in closed form. We establish consistency of the test against all alternatives by showing that the test statistic is distributed as a discrete uniform if and only if the samples were drawn from the candidate distribution. We illustrate its efficacy for assessing the sample quality of approximate sampling algorithms over combinatorially large spaces with intractable probabilities, including random partitions in Dirichlet process mixture models and random lattices in Ising models.
△ Less
Submitted 26 February, 2019;
originally announced February 2019.
-
Temporally-Reweighted Chinese Restaurant Process Mixtures for Clustering, Imputing, and Forecasting Multivariate Time Series
Authors:
Feras A. Saad,
Vikash K. Mansinghka
Abstract:
This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant…
▽ More
This article proposes a Bayesian nonparametric method for forecasting, imputation, and clustering in sparsely observed, multivariate time series data. The method is appropriate for jointly modeling hundreds of time series with widely varying, non-stationary dynamics. Given a collection of $N$ time series, the Bayesian model first partitions them into independent clusters using a Chinese restaurant process prior. Within a cluster, all time series are modeled jointly using a novel "temporally-reweighted" extension of the Chinese restaurant process mixture. Markov chain Monte Carlo techniques are used to obtain samples from the posterior distribution, which are then used to form predictive inferences. We apply the technique to challenging forecasting and imputation tasks using seasonal flu data from the US Center for Disease Control and Prevention, demonstrating superior forecasting accuracy and competitive imputation accuracy as compared to multiple widely used baselines. We further show that the model discovers interpretable clusters in datasets with hundreds of time series, using macroeconomic data from the Gapminder Foundation.
△ Less
Submitted 1 April, 2018; v1 submitted 18 October, 2017;
originally announced October 2017.
-
Probabilistic Search for Structured Data via Probabilistic Programming and Nonparametric Bayes
Authors:
Feras Saad,
Leonardo Casarsa,
Vikash Mansinghka
Abstract:
Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database searc…
▽ More
Databases are widespread, yet extracting relevant data can be difficult. Without substantial domain knowledge, multivariate search queries often return sparse or uninformative results. This paper introduces an approach for searching structured data based on probabilistic programming and nonparametric Bayes. Users specify queries in a probabilistic language that combines standard SQL database search operators with an information theoretic ranking function called predictive relevance. Predictive relevance can be calculated by a fast sparse matrix algorithm based on posterior samples from CrossCat, a nonparametric Bayesian model for high-dimensional, heterogeneously-typed data tables. The result is a flexible search technique that applies to a broad class of information retrieval problems, which we integrate into BayesDB, a probabilistic programming platform for probabilistic data analysis. This paper demonstrates applications to databases of US colleges, global macroeconomic indicators of public health, and classic cars. We found that human evaluators often prefer the results from probabilistic search to results from a standard baseline.
△ Less
Submitted 4 April, 2017;
originally announced April 2017.
-
Detecting Dependencies in Sparse, Multivariate Databases Using Probabilistic Programming and Non-parametric Bayes
Authors:
Feras Saad,
Vikash Mansinghka
Abstract:
Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-p…
▽ More
Datasets with hundreds of variables and many missing values are commonplace. In this setting, it is both statistically and computationally challenging to detect true predictive relationships between variables and also to suppress false positives. This paper proposes an approach that combines probabilistic programming, information theory, and non-parametric Bayes. It shows how to use Bayesian non-parametric modeling to (i) build an ensemble of joint probability models for all the variables; (ii) efficiently detect marginal independencies; and (iii) estimate the conditional mutual information between arbitrary subsets of variables, subject to a broad class of constraints. Users can access these capabilities using BayesDB, a probabilistic programming platform for probabilistic data analysis, by writing queries in a simple, SQL-like language. This paper demonstrates empirically that the method can (i) detect context-specific (in)dependencies on challenging synthetic problems and (ii) yield improved sensitivity and specificity over baselines from statistics and machine learning, on a real-world database of over 300 sparsely observed indicators of macroeconomic development and public health.
△ Less
Submitted 26 March, 2017; v1 submitted 5 November, 2016;
originally announced November 2016.
-
Probabilistic Data Analysis with Probabilistic Programming
Authors:
Feras Saad,
Vikash Mansinghka
Abstract:
Probabilistic techniques are central to data analysis, but different approaches can be difficult to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bayesi…
▽ More
Probabilistic techniques are central to data analysis, but different approaches can be difficult to apply, combine, and compare. This paper introduces composable generative population models (CGPMs), a computational abstraction that extends directed graphical models and can be used to describe and compose a broad class of probabilistic data analysis techniques. Examples include hierarchical Bayesian models, multivariate kernel methods, discriminative machine learning, clustering algorithms, dimensionality reduction, and arbitrary probabilistic programs. We also demonstrate the integration of CGPMs into BayesDB, a probabilistic programming platform that can express data analysis tasks using a modeling language and a structured query language. The practical value is illustrated in two ways. First, CGPMs are used in an analysis that identifies satellite data records which probably violate Kepler's Third Law, by composing causal probabilistic programs with non-parametric Bayes in under 50 lines of probabilistic code. Second, for several representative data analysis tasks, we report on lines of code and accuracy measurements of various CGPMs, plus comparisons with standard baseline solutions from Python and MATLAB libraries.
△ Less
Submitted 18 August, 2016;
originally announced August 2016.
-
An Efficient Algorithm for Clustering of Large-Scale Mass Spectrometry Data
Authors:
Fahad Saeed,
Trairak Pisitkun,
Mark A. Knepper,
Jason D. Hoffert
Abstract:
High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data wh…
▽ More
High-throughput spectrometers are capable of producing data sets containing thousands of spectra for a single biological sample. These data sets contain a substantial amount of redundancy from peptides that may get selected multiple times in a LC-MS/MS experiment. In this paper, we present an efficient algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data which increases both the sensitivity and confidence of spectral assignment. CAMS utilizes a novel metric, called F-set, that allows accurate identification of the spectra that are similar. A graph theoretic framework is defined that allows the use of F-set metric efficiently for accurate cluster identifications. The accuracy of the algorithm is tested on real HCD and CID data sets with varying amounts of peptides. Our experiments show that the proposed algorithm is able to cluster spectra with very high accuracy in a reasonable amount of time for large spectral data sets. Thus, the algorithm is able to decrease the computational time by compressing the data sets while increasing the throughput of the data by interpreting low S/N spectra.
△ Less
Submitted 4 January, 2013;
originally announced January 2013.
-
Mining Temporal Patterns from iTRAQ Mass Spectrometry(LC-MS/MS) Data
Authors:
Fahad Saeed,
Trairak Pisitkun,
Mark A. Knepper,
Jason D. Hoffert
Abstract:
Large-scale proteomic analysis is emerging as a powerful technique in biology and relies heavily on data acquired by state-of-the-art mass spectrometers. As with any other field in Systems Biology, computational tools are required to deal with this ocean of data. iTRAQ (isobaric Tags for Relative and Absolute quantification) is a technique that allows simultaneous quantification of proteins from m…
▽ More
Large-scale proteomic analysis is emerging as a powerful technique in biology and relies heavily on data acquired by state-of-the-art mass spectrometers. As with any other field in Systems Biology, computational tools are required to deal with this ocean of data. iTRAQ (isobaric Tags for Relative and Absolute quantification) is a technique that allows simultaneous quantification of proteins from multiple samples. Although iTRAQ data gives useful insights to the biologist, it is more complex to perform analysis and draw biological conclusions because of its multi-plexed design. One such problem is to find proteins that behave in a similar way (i.e. change in abundance) among various time points since the temporal variations in the proteomics data reveal important biological information. Distance based methods such as Euclidian distance or Pearson coefficient, and clustering techniques such as k-mean etc, are not able to take into account the temporal information of the series. In this paper, we present an linear-time algorithm for clustering similar patterns among various iTRAQ time course data irrespective of their absolute values. The algorithm, referred to as Temporal Pattern Mining(TPM), maps the data from a Cartesian plane to a discrete binary plane. After the mapping a dynamic programming technique allows mining of similar data elements that are temporally closer to each other. The proposed algorithm accurately clusters iTRAQ data that are temporally closer to each other with more than 99% accuracy. Experimental results for different problem sizes are analyzed in terms of quality of clusters, execution time and scalability for large data sets. An example from our proteomics data is provided at the end to demonstrate the performance of the algorithm and its ability to cluster temporal series irrespective of their distance from each other.
△ Less
Submitted 28 April, 2011;
originally announced April 2011.
-
High Transmission Bit Rate of A thermal Arrayed Waveguide Grating (AWG) Module in Passive Optical Networks
Authors:
Abd El Naser A. Mohammed,
Ahmed Nabih Zaki Rashed,
Gaber E. S. M. El Abyad,
Abd El Fattah A. Saad
Abstract:
In the present paper, high transmission bit rate of a thermal arrayed waveguide grating (AWG) which is composed of lithium niobate (LiNbO3)/polymethyl metha acrylate (PMMA) hybrid materials on a silicon substrate in Passive Optical Networks (PONs) has parametrically analyzed and investigated over wide range of the affecting parameters. We have theoretically investigated the temperature dependent…
▽ More
In the present paper, high transmission bit rate of a thermal arrayed waveguide grating (AWG) which is composed of lithium niobate (LiNbO3)/polymethyl metha acrylate (PMMA) hybrid materials on a silicon substrate in Passive Optical Networks (PONs) has parametrically analyzed and investigated over wide range of the affecting parameters. We have theoretically investigated the temperature dependent wavelength shift of the arrayed waveguide grating (AWG) depends on the refractive-indices of the materials and the size of the waveguide. A thermalization of the AWG can be realized by selecting proper values of the material and structural parameters of the device. Moreover, we have analyzed the data transmission bit rate of a thermal AWG in passsive optical networks (PONs) based on Maximum Time Division Multiplexing (MTDM) technique.
△ Less
Submitted 19 June, 2009;
originally announced June 2009.
-
A Domain Decomposition Strategy for Alignment of Multiple Biological Sequences on Multiprocessor Platforms
Authors:
Fahad Saeed,
Ashfaq Khokhar
Abstract:
Multiple Sequences Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing…
▽ More
Multiple Sequences Alignment (MSA) of biological sequences is a fundamental problem in computational biology due to its critical significance in wide ranging applications including haplotype reconstruction, sequence homology, phylogenetic analysis, and prediction of evolutionary origins. The MSA problem is considered NP-hard and known heuristics for the problem do not scale well with increasing number of sequences. On the other hand, with the advent of new breed of fast sequencing techniques it is now possible to generate thousands of sequences very quickly. For rapid sequence analysis, it is therefore desirable to develop fast MSA algorithms that scale well with the increase in the dataset size. In this paper, we present a novel domain decomposition based technique to solve the MSA problem on multiprocessing platforms. The domain decomposition based technique, in addition to yielding better quality, gives enormous advantage in terms of execution time and memory requirements. The proposed strategy allows to decrease the time complexity of any known heuristic of O(N)^x complexity by a factor of O(1/p)^x, where N is the number of sequences, x depends on the underlying heuristic approach, and p is the number of processing nodes. In particular, we propose a highly scalable algorithm, Sample-Align-D, for aligning biological sequences using Muscle system as the underlying heuristic. The proposed algorithm has been implemented on a cluster of workstations using MPI library. Experimental results for different problem sizes are analyzed in terms of quality of alignment, execution time and speed-up.
△ Less
Submitted 11 May, 2009;
originally announced May 2009.
-
Pyro-Align: Sample-Align based Multiple Alignment system for Pyrosequencing Reads of Large Number
Authors:
Fahad Saeed
Abstract:
Pyro-Align is a multiple alignment program specifically designed for pyrosequencing reads of huge number. Multiple sequence alignment is shown to be NP-hard and heuristics are designed for approximate solutions. Multiple sequence alignment of pyrosequenceing reads is complex mainly because of 2 factors. One being the huge number of reads, making the use of traditional heuristics,that scale very…
▽ More
Pyro-Align is a multiple alignment program specifically designed for pyrosequencing reads of huge number. Multiple sequence alignment is shown to be NP-hard and heuristics are designed for approximate solutions. Multiple sequence alignment of pyrosequenceing reads is complex mainly because of 2 factors. One being the huge number of reads, making the use of traditional heuristics,that scale very poorly for large number, unsuitable. The second reason is that the alignment cannot be performed arbitrarily, because the position of the reads with respect to the original genome is important and has to be taken into account.In this report we present a short description of the multiple alignment system for pyrosequencing reads.
△ Less
Submitted 18 January, 2009;
originally announced January 2009.
-
An Overview of Multiple Sequence Alignment Systems
Authors:
Fahad Saeed,
Ashfaq Khokhar
Abstract:
An overview of current multiple alignment systems to date are described.The useful algorithms, the procedures adopted and their limitations are presented.We also present the quality of the alignments obtained and in which cases(kind of alignments, kind of sequences etc) the particular systems are useful.
An overview of current multiple alignment systems to date are described.The useful algorithms, the procedures adopted and their limitations are presented.We also present the quality of the alignments obtained and in which cases(kind of alignments, kind of sequences etc) the particular systems are useful.
△ Less
Submitted 18 January, 2009;
originally announced January 2009.
-
Sample-Align-D: A High Performance Multiple Sequence Alignment System using Phylogenetic Sampling and Domain Decomposition
Authors:
Fahad Saeed,
Ashfaq Khokhar
Abstract:
Multiple Sequence Alignment (MSA) is one of the most computationally intensive tasks in Computational Biology. Existing best known solutions for multiple sequence alignment take several hours (in some cases days) of computation time to align, for example, 2000 homologous sequences of average length 300. Inspired by the Sample Sort approach in parallel processing, in this paper we propose a highl…
▽ More
Multiple Sequence Alignment (MSA) is one of the most computationally intensive tasks in Computational Biology. Existing best known solutions for multiple sequence alignment take several hours (in some cases days) of computation time to align, for example, 2000 homologous sequences of average length 300. Inspired by the Sample Sort approach in parallel processing, in this paper we propose a highly scalable multiprocessor solution for the MSA problem in phylogenetically diverse sequences. Our method employs an intelligent scheme to partition the set of sequences into smaller subsets using kmer count based similarity index, referred to as k-mer rank. Each subset is then independently aligned in parallel using any sequential approach. Further fine tuning of the local alignments is achieved using constraints derived from a global ancestor of the entire set. The proposed Sample-Align-D Algorithm has been implemented on a cluster of workstations using MPI message passing library. The accuracy of the proposed solution has been tested on standard benchmarks such as PREFAB. The accuracy of the alignment produced by our methods is comparable to that of well known sequential MSA techniques. We were able to align 2000 randomly selected sequences from the Methanosarcina acetivorans genome in less than 10 minutes using Sample-Align-D on a 16 node cluster, compared to over 23 hours on sequential MUSCLE system running on a single cluster node.
△ Less
Submitted 18 January, 2009;
originally announced January 2009.