-
Information Consistent Pruning: How to Efficiently Search for Sparse Networks?
Authors:
Soheil Gharatappeh,
Salimeh Yasaei Sekeh
Abstract:
Iterative magnitude pruning methods (IMPs), proven to be successful in reducing the number of insignificant nodes in over-parameterized deep neural networks (DNNs), have been getting an enormous amount of attention with the rapid deployment of DNNs into cutting-edge technologies with computation and memory constraints. Despite IMPs popularity in pruning networks, a fundamental limitation of existi…
▽ More
Iterative magnitude pruning methods (IMPs), proven to be successful in reducing the number of insignificant nodes in over-parameterized deep neural networks (DNNs), have been getting an enormous amount of attention with the rapid deployment of DNNs into cutting-edge technologies with computation and memory constraints. Despite IMPs popularity in pruning networks, a fundamental limitation of existing IMP algorithms is the significant training time required for each pruning iteration. Our paper introduces a novel \textit{stopping criterion} for IMPs that monitors information and gradient flows between networks layers and minimizes the training time. Information Consistent Pruning (\ourmethod{}) eliminates the need to retrain the network to its original performance during intermediate steps while maintaining overall performance at the end of the pruning process. Through our experiments, we demonstrate that our algorithm is more efficient than current IMPs across multiple dataset-DNN combinations. We also provide theoretical insights into the core idea of our algorithm alongside mathematical explanations of flow-based IMP. Our code is available at \url{https://github.com/Sekeh-Lab/InfCoP}.
△ Less
Submitted 26 January, 2025;
originally announced January 2025.
-
Ghost-Connect Net: A Generalization-Enhanced Guidance For Sparse Deep Networks Under Distribution Shifts
Authors:
Mary Isabelle Wisell,
Salimeh Yasaei Sekeh
Abstract:
Sparse deep neural networks (DNNs) excel in real-world applications like robotics and computer vision, by reducing computational demands that hinder usability. However, recent studies aim to boost DNN efficiency by trimming redundant neurons or filters based on task relevance, but neglect their adaptability to distribution shifts. We aim to enhance these existing techniques by introducing a compan…
▽ More
Sparse deep neural networks (DNNs) excel in real-world applications like robotics and computer vision, by reducing computational demands that hinder usability. However, recent studies aim to boost DNN efficiency by trimming redundant neurons or filters based on task relevance, but neglect their adaptability to distribution shifts. We aim to enhance these existing techniques by introducing a companion network, Ghost Connect-Net (GC-Net), to monitor the connections in the original network with distribution generalization advantage. GC-Net's weights represent connectivity measurements between consecutive layers of the original network. After pruning GC-Net, the pruned locations are mapped back to the original network as pruned connections, allowing for the combination of magnitude and connectivity-based pruning methods. Experimental results using common DNN benchmarks, such as CIFAR-10, Fashion MNIST, and Tiny ImageNet show promising results for hybridizing the method, and using GC-Net guidance for later layers of a network and direct pruning on earlier layers. We provide theoretical foundations for GC-Net's approach to improving generalization under distribution shifts.
△ Less
Submitted 14 November, 2024;
originally announced November 2024.
-
Robust Subgraph Learning by Monitoring Early Training Representations
Authors:
Sepideh Neshatfar,
Salimeh Yasaei Sekeh
Abstract:
Graph neural networks (GNNs) have attracted significant attention for their outstanding performance in graph learning and node classification tasks. However, their vulnerability to adversarial attacks, particularly through susceptible nodes, poses a challenge in decision-making. The need for robust graph summarization is evident in adversarial challenges resulting from the propagation of attacks t…
▽ More
Graph neural networks (GNNs) have attracted significant attention for their outstanding performance in graph learning and node classification tasks. However, their vulnerability to adversarial attacks, particularly through susceptible nodes, poses a challenge in decision-making. The need for robust graph summarization is evident in adversarial challenges resulting from the propagation of attacks throughout the entire graph. In this paper, we address both performance and adversarial robustness in graph input by introducing the novel technique SHERD (Subgraph Learning Hale through Early Training Representation Distances). SHERD leverages information from layers of a partially trained graph convolutional network (GCN) to detect susceptible nodes during adversarial attacks using standard distance metrics. The method identifies "vulnerable (bad)" nodes and removes such nodes to form a robust subgraph while maintaining node classification performance. Through our experiments, we demonstrate the increased performance of SHERD in enhancing robustness by comparing the network's performance on original and subgraph inputs against various baselines alongside existing adversarial attacks. Our experiments across multiple datasets, including citation datasets such as Cora, Citeseer, and Pubmed, as well as microanatomical tissue structures of cell graphs in the placenta, highlight that SHERD not only achieves substantial improvement in robust performance but also outperforms several baselines in terms of node classification accuracy and computational complexity.
△ Less
Submitted 18 November, 2024; v1 submitted 14 March, 2024;
originally announced March 2024.
-
FogGuard: guarding YOLO against fog using perceptual loss
Authors:
Soheil Gharatappeh,
Sepideh Neshatfar,
Salimeh Yasaei Sekeh,
Vikas Dhiman
Abstract:
In this paper, we present FogGuard, a novel fog-aware object detection network designed to address the challenges posed by foggy weather conditions. Autonomous driving systems heavily rely on accurate object detection algorithms, but adverse weather conditions can significantly impact the reliability of deep neural networks (DNNs).
Existing approaches include image enhancement techniques like IA…
▽ More
In this paper, we present FogGuard, a novel fog-aware object detection network designed to address the challenges posed by foggy weather conditions. Autonomous driving systems heavily rely on accurate object detection algorithms, but adverse weather conditions can significantly impact the reliability of deep neural networks (DNNs).
Existing approaches include image enhancement techniques like IA-YOLO and domain adaptation methods. While image enhancement aims to generate clear images from foggy ones, which is more challenging than object detection in foggy images, domain adaptation does not require labeled data in the target domain. Our approach involves fine-tuning on a specific dataset to address these challenges efficiently.
FogGuard compensates for foggy conditions in the scene, ensuring robust performance by incorporating YOLOv3 as the baseline algorithm and introducing a unique Teacher-Student Perceptual loss for accurate object detection in foggy environments. Through comprehensive evaluations on standard datasets like PASCAL VOC and RTTS, our network significantly improves performance, achieving a 69.43\% mAP compared to YOLOv3's 57.78\% on the RTTS dataset. Additionally, we demonstrate that while our training method slightly increases time complexity, it doesn't add overhead during inference compared to the regular YOLO network.
△ Less
Submitted 11 October, 2024; v1 submitted 13 March, 2024;
originally announced March 2024.
-
Towards Explaining Deep Neural Network Compression Through a Probabilistic Latent Space
Authors:
Mahsa Mozafari-Nia,
Salimeh Yasaei Sekeh
Abstract:
Despite the impressive performance of deep neural networks (DNNs), their computational complexity and storage space consumption have led to the concept of network compression. While DNN compression techniques such as pruning and low-rank decomposition have been extensively studied, there has been insufficient attention paid to their theoretical explanation. In this paper, we propose a novel theore…
▽ More
Despite the impressive performance of deep neural networks (DNNs), their computational complexity and storage space consumption have led to the concept of network compression. While DNN compression techniques such as pruning and low-rank decomposition have been extensively studied, there has been insufficient attention paid to their theoretical explanation. In this paper, we propose a novel theoretical framework that leverages a probabilistic latent space of DNN weights and explains the optimal network sparsity by using the information-theoretic divergence measures. We introduce new analogous projected patterns (AP2) and analogous-in-probability projected patterns (AP3) notions for DNNs and prove that there exists a relationship between AP3/AP2 property of layers in the network and its performance. Further, we provide a theoretical analysis that explains the training process of the compressed network. The theoretical results are empirically validated through experiments conducted on standard pre-trained benchmarks, including AlexNet, ResNet50, and VGG16, using CIFAR10 and CIFAR100 datasets. Through our experiments, we highlight the relationship of AP3 and AP2 properties with fine-tuning pruned DNNs and sparsity levels.
△ Less
Submitted 20 May, 2025; v1 submitted 29 February, 2024;
originally announced March 2024.
-
A Theoretical Perspective on Subnetwork Contributions to Adversarial Robustness
Authors:
Jovon Craig,
Josh Andle,
Theodore S. Nowak,
Salimeh Yasaei Sekeh
Abstract:
The robustness of deep neural networks (DNNs) against adversarial attacks has been studied extensively in hopes of both better understanding how deep learning models converge and in order to ensure the security of these models in safety-critical applications. Adversarial training is one approach to strengthening DNNs against adversarial attacks, and has been shown to offer a means for doing so at…
▽ More
The robustness of deep neural networks (DNNs) against adversarial attacks has been studied extensively in hopes of both better understanding how deep learning models converge and in order to ensure the security of these models in safety-critical applications. Adversarial training is one approach to strengthening DNNs against adversarial attacks, and has been shown to offer a means for doing so at the cost of applying computationally expensive training methods to the entire model. To better understand these attacks and facilitate more efficient adversarial training, in this paper we develop a novel theoretical framework that investigates how the adversarial robustness of a subnetwork contributes to the robustness of the entire network. To do so we first introduce the concept of semirobustness, which is a measure of the adversarial robustness of a subnetwork. Building on this concept, we then provide a theoretical analysis to show that if a subnetwork is semirobust and there is a sufficient dependency between it and each subsequent layer in the network, then the remaining layers are also guaranteed to be robust. We validate these findings empirically across multiple DNN architectures, datasets, and adversarial attacks. Experiments show the ability of a robust subnetwork to promote full-network robustness, and investigate the layer-wise dependencies required for this full-network robustness to be achieved.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
Promise and Limitations of Supervised Optimal Transport-Based Graph Summarization via Information Theoretic Measures
Authors:
Sepideh Neshatfar,
Abram Magner,
Salimeh Yasaei Sekeh
Abstract:
Graph summarization is the problem of producing smaller graph representations of an input graph dataset, in such a way that the smaller compressed graphs capture relevant structural information for downstream tasks. There is a recent graph summarization method that formulates an optimal transport-based framework that allows prior information about node, edge, and attribute importance (never define…
▽ More
Graph summarization is the problem of producing smaller graph representations of an input graph dataset, in such a way that the smaller compressed graphs capture relevant structural information for downstream tasks. There is a recent graph summarization method that formulates an optimal transport-based framework that allows prior information about node, edge, and attribute importance (never defined in that work) to be incorporated into the graph summarization process. However, very little is known about the statistical properties of this framework. To elucidate this question, we consider the problem of supervised graph summarization, wherein by using information theoretic measures we seek to preserve relevant information about a class label. To gain a theoretical perspective on the supervised summarization problem itself, we first formulate it in terms of maximizing the Shannon mutual information between the summarized graph and the class label. We show an NP-hardness of approximation result for this problem, thereby constraining what one should expect from proposed solutions. We then propose a summarization method that incorporates mutual information estimates between random variables associated with sample graphs and class labels into the optimal transport compression framework. We empirically show performance improvements over previous works in terms of classification accuracy and time on synthetic and certain real datasets. We also theoretically explore the limitations of the optimal transport approach for the supervised summarization problem and we show that it fails to satisfy a certain desirable information monotonicity property.
△ Less
Submitted 11 May, 2023;
originally announced May 2023.
-
Improving Hyperspectral Adversarial Robustness Under Multiple Attacks
Authors:
Nicholas Soucy,
Salimeh Yasaei Sekeh
Abstract:
Semantic segmentation models classifying hyperspectral images (HSI) are vulnerable to adversarial examples. Traditional approaches to adversarial robustness focus on training or retraining a single network on attacked data, however, in the presence of multiple attacks these approaches decrease in performance compared to networks trained individually on each attack. To combat this issue we propose…
▽ More
Semantic segmentation models classifying hyperspectral images (HSI) are vulnerable to adversarial examples. Traditional approaches to adversarial robustness focus on training or retraining a single network on attacked data, however, in the presence of multiple attacks these approaches decrease in performance compared to networks trained individually on each attack. To combat this issue we propose an Adversarial Discriminator Ensemble Network (ADE-Net) which focuses on attack type detection and adversarial robustness under a unified model to preserve per data-type weight optimally while robustifiying the overall network. In the proposed method, a discriminator network is used to separate data by attack type into their specific attack-expert ensemble network.
△ Less
Submitted 11 May, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Theoretical Understanding of the Information Flow on Continual Learning Performance
Authors:
Josh Andle,
Salimeh Yasaei Sekeh
Abstract:
Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data sequentially. CL performance evaluates the model's ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Despite the numerous previous solutions to bypass the catastrophic forgetting (CF) of previously seen tasks duri…
▽ More
Continual learning (CL) is a setting in which an agent has to learn from an incoming stream of data sequentially. CL performance evaluates the model's ability to continually learn and solve new problems with incremental available information over time while retaining previous knowledge. Despite the numerous previous solutions to bypass the catastrophic forgetting (CF) of previously seen tasks during the learning process, most of them still suffer significant forgetting, expensive memory cost, or lack of theoretical understanding of neural networks' conduct while learning new tasks. While the issue that CL performance degrades under different training regimes has been extensively studied empirically, insufficient attention has been paid from a theoretical angle. In this paper, we establish a probabilistic framework to analyze information flow through layers in networks for task sequences and its impact on learning performance. Our objective is to optimize the information preservation between layers while learning new tasks to manage task-specific knowledge passing throughout the layers while maintaining model performance on previous tasks. In particular, we study CL performance's relationship with information flow in the network to answer the question "How can knowledge of information flow between layers be used to alleviate CF?". Our analysis provides novel insights of information adaptation within the layers during the incremental task learning process. Through our experiments, we provide empirical evidence and practically highlight the performance improvement across multiple tasks.
△ Less
Submitted 2 May, 2022; v1 submitted 25 April, 2022;
originally announced April 2022.
-
Q-TART: Quickly Training for Adversarial Robustness and in-Transferability
Authors:
Madan Ravi Ganesh,
Salimeh Yasaei Sekeh,
Jason J. Corso
Abstract:
Raw deep neural network (DNN) performance is not enough; in real-world settings, computational load, training efficiency and adversarial security are just as or even more important. We propose to simultaneously tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART, Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows the intuition that sampl…
▽ More
Raw deep neural network (DNN) performance is not enough; in real-world settings, computational load, training efficiency and adversarial security are just as or even more important. We propose to simultaneously tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART, Quickly Train for Adversarial Robustness and in-Transferability. Q-TART follows the intuition that samples highly susceptible to noise strongly affect the decision boundaries learned by DNNs, which in turn degrades their performance and adversarial susceptibility. By identifying and removing such samples, we demonstrate improved performance and adversarial robustness while using only a subset of the training data. Through our experiments we highlight Q-TART's high performance across multiple Dataset-DNN combinations, including ImageNet, and provide insights into the complementary behavior of Q-TART alongside existing adversarial training approaches to increase robustness by over 1.3% while using up to 17.9% less training time.
△ Less
Submitted 14 April, 2022;
originally announced April 2022.
-
The Stanford Drone Dataset is More Complex than We Think: An Analysis of Key Characteristics
Authors:
Joshua Andle,
Nicholas Soucy,
Simon Socolow,
Salimeh Yasaei Sekeh
Abstract:
Several datasets exist which contain annotated information of individuals' trajectories. Such datasets are vital for many real-world applications, including trajectory prediction and autonomous navigation. One prominent dataset currently in use is the Stanford Drone Dataset (SDD). Despite its prominence, discussion surrounding the characteristics of this dataset is insufficient. We demonstrate how…
▽ More
Several datasets exist which contain annotated information of individuals' trajectories. Such datasets are vital for many real-world applications, including trajectory prediction and autonomous navigation. One prominent dataset currently in use is the Stanford Drone Dataset (SDD). Despite its prominence, discussion surrounding the characteristics of this dataset is insufficient. We demonstrate how this insufficiency reduces the information available to users and can impact performance. Our contributions include the outlining of key characteristics in the SDD, employment of an information-theoretic measure and custom metric to clearly visualize those characteristics, the implementation of the PECNet and Y-Net trajectory prediction models to demonstrate the outlined characteristics' impact on predictive performance, and lastly we provide a comparison between the SDD and Intersection Drone (inD) Dataset. Our analysis of the SDD's key characteristics is important because without adequate information about available datasets a user's ability to select the most suitable dataset for their methods, to reproduce one another's results, and to interpret their own results are hindered. The observations we make through this analysis provide a readily accessible and interpretable source of information for those planning to use the SDD. Our intention is to increase the performance and reproducibility of methods applied to this dataset going forward, while also clearly detailing less obvious features of the dataset for new users.
△ Less
Submitted 22 March, 2022;
originally announced March 2022.
-
CEU-Net: Ensemble Semantic Segmentation of Hyperspectral Images Using Clustering
Authors:
Nicholas Soucy,
Salimeh Yasaei Sekeh
Abstract:
Most semantic segmentation approaches of Hyperspectral images (HSIs) use and require preprocessing steps in the form of patching to accurately classify diversified land cover in remotely sensed images. These approaches use patching to incorporate the rich neighborhood information in images and exploit the simplicity and segmentability of the most common HSI datasets. In contrast, most landmasses i…
▽ More
Most semantic segmentation approaches of Hyperspectral images (HSIs) use and require preprocessing steps in the form of patching to accurately classify diversified land cover in remotely sensed images. These approaches use patching to incorporate the rich neighborhood information in images and exploit the simplicity and segmentability of the most common HSI datasets. In contrast, most landmasses in the world consist of overlapping and diffused classes, making neighborhood information weaker than what is seen in common HSI datasets. To combat this issue and generalize the segmentation models to more complex and diverse HSI datasets, in this work, we propose our novel flagship model: Clustering Ensemble U-Net (CEU-Net). CEU-Net uses the ensemble method to combine spectral information extracted from convolutional neural network (CNN) training on a cluster of landscape pixels. Our CEU-Net model outperforms existing state-of-the-art HSI semantic segmentation methods and gets competitive performance with and without patching when compared to baseline models. We highlight CEU-Net's high performance across Botswana, KSC, and Salinas datasets compared to HybridSN and AeroRIT methods.
△ Less
Submitted 13 March, 2022; v1 submitted 9 March, 2022;
originally announced March 2022.
-
Adaptive County Level COVID-19 Forecast Models: Analysis and Improvement
Authors:
Stewart W Doe,
Tyler Russell Seekins,
David Fitzpatrick,
Dawsin Blanchard,
Salimeh Yasaei Sekeh
Abstract:
Accurately forecasting county level COVID-19 confirmed cases is crucial to optimizing medical resources. Forecasting emerging outbreaks pose a particular challenge because many existing forecasting techniques learn from historical seasons trends. Recurrent neural networks (RNNs) with LSTM-based cells are a logical choice of model due to their ability to learn temporal dynamics. In this paper, we a…
▽ More
Accurately forecasting county level COVID-19 confirmed cases is crucial to optimizing medical resources. Forecasting emerging outbreaks pose a particular challenge because many existing forecasting techniques learn from historical seasons trends. Recurrent neural networks (RNNs) with LSTM-based cells are a logical choice of model due to their ability to learn temporal dynamics. In this paper, we adapt the state and county level influenza model, TDEFSI-LONLY, proposed in Wang et a. [l2020] to national and county level COVID-19 data. We show that this model poorly forecasts the current pandemic. We analyze the two week ahead forecasting capabilities of the TDEFSI-LONLY model with combinations of regularization techniques. Effective training of the TDEFSI-LONLY model requires data augmentation, to overcome this challenge we utilize an SEIR model and present an inter-county mixing extension to this model to simulate sufficient training data. Further, we propose an alternate forecast model, {\it County Level Epidemiological Inference Recurrent Network} (\alg{}) that trains an LSTM backbone on national confirmed cases to learn a low dimensional time pattern and utilizes a time distributed dense layer to learn individual county confirmed case changes each day for a two weeks forecast. We show that the best, worst, and median state forecasts made using CLEIR-Net model are respectively New York, South Carolina, and Montana.
△ Less
Submitted 1 July, 2020; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Slimming Neural Networks using Adaptive Connectivity Scores
Authors:
Madan Ravi Ganesh,
Dawsin Blanchard,
Jason J. Corso,
Salimeh Yasaei Sekeh
Abstract:
In general, deep neural network (DNN) pruning methods fall into two categories: 1) Weight-based deterministic constraints, and 2) Probabilistic frameworks. While each approach has its merits and limitations there are a set of common practical issues such as, trial-and-error to analyze sensitivity and hyper-parameters to prune DNNs, which plague them both. In this work, we propose a new single-shot…
▽ More
In general, deep neural network (DNN) pruning methods fall into two categories: 1) Weight-based deterministic constraints, and 2) Probabilistic frameworks. While each approach has its merits and limitations there are a set of common practical issues such as, trial-and-error to analyze sensitivity and hyper-parameters to prune DNNs, which plague them both. In this work, we propose a new single-shot, fully automated pruning algorithm called Slimming Neural networks using Adaptive Connectivity Scores (SNACS). Our proposed approach combines a probabilistic pruning framework with constraints on the underlying weight matrices, via a novel connectivity measure, at multiple levels to capitalize on the strengths of both approaches while solving their deficiencies. In \alg{}, we propose a fast hash-based estimator of Adaptive Conditional Mutual Information (ACMI), that uses a weight-based scaling criterion, to evaluate the connectivity between filters and prune unimportant ones. To automatically determine the limit up to which a layer can be pruned, we propose a set of operating constraints that jointly define the upper pruning percentage limits across all the layers in a deep network. Finally, we define a novel sensitivity criterion for filters that measures the strength of their contributions to the succeeding layer and highlights critical filters that need to be completely protected from pruning. Through our experimental validation we show that SNACS is faster by over 17x the nearest comparable method and is the state of the art single-shot pruning method across three standard Dataset-DNN pruning benchmarks: CIFAR10-VGG16, CIFAR10-ResNet56 and ILSVRC2012-ResNet50.
△ Less
Submitted 17 December, 2021; v1 submitted 22 June, 2020;
originally announced June 2020.
-
MINT: Deep Network Compression via Mutual Information-based Neuron Trimming
Authors:
Madan Ravi Ganesh,
Jason J. Corso,
Salimeh Yasaei Sekeh
Abstract:
Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard…
▽ More
Most approaches to deep neural network compression via pruning either evaluate a filter's importance using its weights or optimize an alternative objective function with sparsity constraints. While these methods offer a useful way to approximate contributions from similar filters, they often either ignore the dependency between layers or solve a more difficult optimization objective than standard cross-entropy. Our method, Mutual Information-based Neuron Trimming (MINT), approaches deep compression via pruning by enforcing sparsity based on the strength of the relationship between filters of adjacent layers, across every pair of layers. The relationship is calculated using conditional geometric mutual information which evaluates the amount of similar information exchanged between the filters using a graph-based criterion. When pruning a network, we ensure that retained filters contribute the majority of the information towards succeeding layers which ensures high performance. Our novel approach outperforms existing state-of-the-art compression-via-pruning methods on the standard benchmarks for this task: MNIST, CIFAR-10, and ILSVRC2012, across a variety of network architectures. In addition, we discuss our observations of a common denominator between our pruning methodology's response to adversarial attacks and calibration statistics when compared to the original network.
△ Less
Submitted 18 March, 2020;
originally announced March 2020.
-
A Geometric Approach to Online Streaming Feature Selection
Authors:
Salimeh Yasaei Sekeh,
Madan Ravi Ganesh,
Shurjo Banerjee,
Jason J. Corso,
Alfred O. Hero
Abstract:
Online Streaming Feature Selection (OSFS) is a sequential learning problem where individual features across all samples are made available to algorithms in a streaming fashion. In this work, firstly, we assert that OSFS's main assumption of having data from all the samples available at runtime is unrealistic and introduce a new setting where features and samples are streamed concurrently called OS…
▽ More
Online Streaming Feature Selection (OSFS) is a sequential learning problem where individual features across all samples are made available to algorithms in a streaming fashion. In this work, firstly, we assert that OSFS's main assumption of having data from all the samples available at runtime is unrealistic and introduce a new setting where features and samples are streamed concurrently called OSFS with Streaming Samples (OSFS-SS). Secondly, the primary OSFS method, SAOLA utilizes an unbounded mutual information measure and requires multiple comparison steps between the stored and incoming feature sets to evaluate a feature's importance. We introduce Geometric Online Adaption, an algorithm that requires relatively less feature comparison steps and uses a bounded conditional geometric dependency measure. Our algorithm outperforms several OSFS baselines including SAOLA on a variety of datasets. We also extend SAOLA to work in the OSFS-SS setting and show that GOA continues to achieve the best results. Thirdly, the current paradigm of the OSFS algorithm comparison is flawed. Algorithms are measured by comparing the number of features used and the accuracy obtained by the learner, two properties that are fundamentally at odds with one another. Without fixing a limit on either of these properties, the qualities of features obtained by different algorithms are incomparable. We try to rectify this inconsistency by fixing the maximum number of features available to the learner and comparing algorithms in terms of their accuracy. Additionally, we characterize the behaviour of SAOLA and GOA on feature sets derived from popular deep convolutional featurizers.
△ Less
Submitted 16 March, 2020; v1 submitted 2 October, 2019;
originally announced October 2019.
-
Geometric Estimation of Multivariate Dependency
Authors:
Salimeh Yasaei Sekeh,
Alfred O. Hero
Abstract:
This paper proposes a geometric estimator of dependency between a pair of multivariate samples. The proposed estimator of dependency is based on a randomly permuted geometric graph (the minimal spanning tree) over the two multivariate samples. This estimator converges to a quantity that we call the geometric mutual information (GMI), which is equivalent to the Henze-Penrose divergence [1] between…
▽ More
This paper proposes a geometric estimator of dependency between a pair of multivariate samples. The proposed estimator of dependency is based on a randomly permuted geometric graph (the minimal spanning tree) over the two multivariate samples. This estimator converges to a quantity that we call the geometric mutual information (GMI), which is equivalent to the Henze-Penrose divergence [1] between the joint distribution of the multivariate samples and the product of the marginals. The GMI has many of the same properties as standard MI but can be estimated from empirical data without density estimation; making it scalable to large datasets. The proposed empirical estimator of GMI is simple to implement, involving the construction of an MST spanning over both the original data and a randomly permuted version of this data. We establish asymptotic convergence of the estimator and convergence rates of the bias and variance for smooth multivariate density functions belonging to a Hölder class. We demonstrate the advantages of our proposed geometric dependency estimator in a series of experiments.
△ Less
Submitted 21 May, 2019;
originally announced May 2019.
-
Feature Selection for multi-labeled variables via Dependency Maximization
Authors:
Salimeh Yasaei Sekeh,
Alfred O. Hero
Abstract:
Feature selection and reducing the dimensionality of data is an essential step in data analysis. In this work, we propose a new criterion for feature selection that is formulated as conditional information between features given the labeled variable. Instead of using the standard mutual information measure based on Kullback-Leibler divergence, we use our proposed criterion to filter out redundant…
▽ More
Feature selection and reducing the dimensionality of data is an essential step in data analysis. In this work, we propose a new criterion for feature selection that is formulated as conditional information between features given the labeled variable. Instead of using the standard mutual information measure based on Kullback-Leibler divergence, we use our proposed criterion to filter out redundant features for the purpose of multiclass classification. This approach results in an efficient and fast non-parametric implementation of feature selection as it can be directly estimated using a geometric measure of dependency, the global Friedman-Rafsky (FR) multivariate run test statistic constructed by a global minimal spanning tree (MST). We demonstrate the advantages of our proposed feature selection approach through simulation. In addition the proposed feature selection method is applied to the MNIST data set.
△ Less
Submitted 16 May, 2019; v1 submitted 10 February, 2019;
originally announced February 2019.
-
Learning to Bound the Multi-class Bayes Error
Authors:
Salimeh Yasaei Sekeh,
Brandon Oselio,
Alfred O. Hero
Abstract:
In the context of supervised learning, meta learning uses features, metadata and other information to learn about the difficulty, behavior, or composition of the problem. Using this knowledge can be useful to contextualize classifier results or allow for targeted decisions about future data sampling. In this paper, we are specifically interested in learning the Bayes error rate (BER) based on a la…
▽ More
In the context of supervised learning, meta learning uses features, metadata and other information to learn about the difficulty, behavior, or composition of the problem. Using this knowledge can be useful to contextualize classifier results or allow for targeted decisions about future data sampling. In this paper, we are specifically interested in learning the Bayes error rate (BER) based on a labeled data sample. Providing a tight bound on the BER that is also feasible to estimate has been a challenge. Previous work[1] has shown that a pairwise bound based on the sum of Henze-Penrose (HP) divergence over label pairs can be directly estimated using a sum of Friedman-Rafsky (FR) multivariate run test statistics. However, in situations in which the dataset and number of classes are large, this bound is computationally infeasible to calculate and may not be tight. Other multi-class bounds also suffer from computationally complex estimation procedures. In this paper, we present a generalized HP divergence measure that allows us to estimate the Bayes error rate with log-linear computation. We prove that the proposed bound is tighter than both the pairwise method and a bound proposed by Lin [2]. We also empirically show that these bounds are close to the BER. We illustrate the proposed method on the MNIST dataset, and show its utility for the evaluation of feature reduction strategies. We further demonstrate an approach for evaluation of deep learning architectures using the proposed bounds.
△ Less
Submitted 27 April, 2020; v1 submitted 15 November, 2018;
originally announced November 2018.
-
Convergence Rates for Empirical Estimation of Binary Classification Bounds
Authors:
Salimeh Yasaei Sekeh,
Morteza Noshad,
Kevin R. Moon,
Alfred O. Hero
Abstract:
Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze-Penrose (HP) divergence has been proposed for bounding classi…
▽ More
Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze-Penrose (HP) divergence has been proposed for bounding classification error probability. We consider the problem of empirically estimating the HP-divergence from random samples. We derive a bound on the convergence rate for the Friedman-Rafsky (FR) estimator of the HP-divergence, which is related to a multivariate runs statistic for testing between two distributions. The FR estimator is derived from a multicolored Euclidean minimal spanning tree (MST) that spans the merged samples. We obtain a concentration inequality for the Friedman-Rafsky estimator of the Henze-Penrose divergence. We validate our results experimentally and illustrate their application to real datasets.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.
-
A Dimension-Independent discriminant between distributions
Authors:
Salimeh Yasaei Sekeh,
Brandon Oselio,
Alfred O. Hero
Abstract:
Henze-Penrose divergence is a non-parametric divergence measure that can be used to estimate a bound on the Bayes error in a binary classification problem. In this paper, we show that a cross-match statistic based on optimal weighted matching can be used to directly estimate Henze-Penrose divergence. Unlike an earlier approach based on the Friedman-Rafsky minimal spanning tree statistic, the propo…
▽ More
Henze-Penrose divergence is a non-parametric divergence measure that can be used to estimate a bound on the Bayes error in a binary classification problem. In this paper, we show that a cross-match statistic based on optimal weighted matching can be used to directly estimate Henze-Penrose divergence. Unlike an earlier approach based on the Friedman-Rafsky minimal spanning tree statistic, the proposed method is dimension-independent. The new approach is evaluated using simulation and applied to real datasets to obtain Bayes error estimates.
△ Less
Submitted 13 February, 2018;
originally announced February 2018.
-
Direct Estimation of Information Divergence Using Nearest Neighbor Ratios
Authors:
Morteza Noshad,
Kevin R. Moon,
Salimeh Yasaei Sekeh,
Alfred O. Hero III
Abstract:
We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$, respectively with $N$ and $M$ samples, where $η:=M/N$ is a constant value. Considering the $k$-nearest neighbor ($k$-NN) graph of $Y$ in the joint data set $(X,Y)$, we show that the average powered ratio of the number of…
▽ More
We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$, respectively with $N$ and $M$ samples, where $η:=M/N$ is a constant value. Considering the $k$-nearest neighbor ($k$-NN) graph of $Y$ in the joint data set $(X,Y)$, we show that the average powered ratio of the number of $X$ points to the number of $Y$ points among all $k$-NN points is proportional to Rényi divergence of $X$ and $Y$ densities. A similar method can also be used to estimate f-divergence measures. We derive bias and variance rates, and show that for the class of $γ$-Hölder smooth functions, the estimator achieves the MSE rate of $O(N^{-2γ/(γ+d)})$. Furthermore, by using a weighted ensemble estimation technique, for density functions with continuous and bounded derivatives of up to the order $d$, and some extra conditions at the support set boundary, we derive an ensemble estimator that achieves the parametric MSE rate of $O(1/N)$. Our estimators are more computationally tractable than other competing estimators, which makes them appealing in many practical applications.
△ Less
Submitted 20 November, 2017; v1 submitted 16 February, 2017;
originally announced February 2017.
-
Information Theoretic Structure Learning with Confidence
Authors:
Kevin R. Moon,
Morteza Noshad,
Salimeh Yasaei Sekeh,
Alfred O. Hero III
Abstract:
Information theoretic measures (e.g. the Kullback Liebler divergence and Shannon mutual information) have been used for exploring possibly nonlinear multivariate dependencies in high dimension. If these dependencies are assumed to follow a Markov factor graph model, this exploration process is called structure discovery. For discrete-valued samples, estimates of the information divergence over the…
▽ More
Information theoretic measures (e.g. the Kullback Liebler divergence and Shannon mutual information) have been used for exploring possibly nonlinear multivariate dependencies in high dimension. If these dependencies are assumed to follow a Markov factor graph model, this exploration process is called structure discovery. For discrete-valued samples, estimates of the information divergence over the parametric class of multinomial models lead to structure discovery methods whose mean squared error achieves parametric convergence rates as the sample size grows. However, a naive application of this method to continuous nonparametric multivariate models converges much more slowly. In this paper we introduce a new method for nonparametric structure discovery that uses weighted ensemble divergence estimators that achieve parametric convergence rates and obey an asymptotic central limit theorem that facilitates hypothesis testing and other types of statistical validation.
△ Less
Submitted 13 September, 2016;
originally announced September 2016.
-
On weighted Fisher information matrix properties
Authors:
Mark Kelbert,
Yuri Suhov,
Salimeh Yasaei Sekeh
Abstract:
In this paper, we review Fisher information matrices properties in weighted version and discuss inequalities/bounds on it by using reduced weight functions. In particular, an extended form of the Fisher information inequality previously established in [6] is given. Further, along with generalized De-Bruijn's identity, we provide new interpretation of the concavity for the entropy power.
In this paper, we review Fisher information matrices properties in weighted version and discuss inequalities/bounds on it by using reduced weight functions. In particular, an extended form of the Fisher information inequality previously established in [6] is given. Further, along with generalized De-Bruijn's identity, we provide new interpretation of the concavity for the entropy power.
△ Less
Submitted 28 January, 2016; v1 submitted 27 January, 2016;
originally announced January 2016.
-
Results on the solutions of maximum weighted Renyi entropy problems
Authors:
Salimeh Yasaei Sekeh
Abstract:
In this paper, following standard arguments, the maximum Renyi entropy problem for the weighted case is analyzed. We verify that under some constrains on weight function, the Student-r and Student-t distributions maximize the weighted Renyi entropy. Furthermore, an extended version of the Hadamard inequality is derived.
In this paper, following standard arguments, the maximum Renyi entropy problem for the weighted case is analyzed. We verify that under some constrains on weight function, the Student-r and Student-t distributions maximize the weighted Renyi entropy. Furthermore, an extended version of the Hadamard inequality is derived.
△ Less
Submitted 26 October, 2015;
originally announced October 2015.
-
Basic inequalities for weighted entropies
Authors:
Yuri Suhov,
Izabella Stuhl,
Salimeh Yasaei Sekeh,
Mark Kelbert
Abstract:
The concept of weighted entropy takes into account values of different outcomes, i.e., makes entropy context-dependent, through the weight function. In this paper, we establish a number of simple inequalities for the weighted entropies (general as well as specific), mirroring similar bounds on standard (Shannon) entropies and related quantities. The required assumptions are written in terms of var…
▽ More
The concept of weighted entropy takes into account values of different outcomes, i.e., makes entropy context-dependent, through the weight function. In this paper, we establish a number of simple inequalities for the weighted entropies (general as well as specific), mirroring similar bounds on standard (Shannon) entropies and related quantities. The required assumptions are written in terms of various expectations of the weight functions. Examples are weighted Ky Fan and weighted Hadamard inequalities involving determinants of positive-definite matrices, and weighted Cramér-Rao inequalities involving the weighted Fisher information matrix.
△ Less
Submitted 14 January, 2016; v1 submitted 7 October, 2015;
originally announced October 2015.
-
Extended inequalities for weighted Renyi entropy involving generalized Gaussian densities
Authors:
Salimeh Yasaei Sekeh
Abstract:
In this paper the author analyses the weighted Renyi entropy in order to derive several inequalities in weighted case. Furthermore, using the proposed notions $α$-th generalized derivation and ($α$; p)-th weighted Fisher information, extended versions of the moment-entropy, Fisher information and Cramer-Rao inequalities in terms of generalized Gaussian densities are given.
In this paper the author analyses the weighted Renyi entropy in order to derive several inequalities in weighted case. Furthermore, using the proposed notions $α$-th generalized derivation and ($α$; p)-th weighted Fisher information, extended versions of the moment-entropy, Fisher information and Cramer-Rao inequalities in terms of generalized Gaussian densities are given.
△ Less
Submitted 15 October, 2015; v1 submitted 7 September, 2015;
originally announced September 2015.
-
A short note on estimation of WCRE and WCE
Authors:
Salimeh Yasaei Sekeh
Abstract:
In this note the author uses order statistics to estimate WCRE and WCE in terms of empirical and survival functions. An example in both cases normal and exponential WFs is analyzed.
In this note the author uses order statistics to estimate WCRE and WCE in terms of empirical and survival functions. An example in both cases normal and exponential WFs is analyzed.
△ Less
Submitted 25 August, 2015; v1 submitted 19 August, 2015;
originally announced August 2015.
-
On double truncated (interval) WCRE and WCE
Authors:
Salimeh Yasaei Sekeh,
Gholamreza Mohtashami Borzadran,
Abdolhamid Rezaei Roknabadi
Abstract:
Measure of the weighted cumulative entropy about the predictability of failure time of a system have been introduced in [3]. Referring properties of doubly truncated (interval) cumulative residual and past entropy, several bounds and assertions are proposed in weighted version.
Measure of the weighted cumulative entropy about the predictability of failure time of a system have been introduced in [3]. Referring properties of doubly truncated (interval) cumulative residual and past entropy, several bounds and assertions are proposed in weighted version.
△ Less
Submitted 27 August, 2015; v1 submitted 2 August, 2015;
originally announced August 2015.
-
Weighted cumulative entropies: An extension of CRE and CE
Authors:
Yuri Suhov,
Salimeh Yasaei Sekeh
Abstract:
We generalize the weighted cumulative entropies (WCRE and WCE), introduced in [5], for a system or component lifetime. Representing properties of cumulative entropies, several bounds and inequalities for the WCRE is proposed
We generalize the weighted cumulative entropies (WCRE and WCE), introduced in [5], for a system or component lifetime. Representing properties of cumulative entropies, several bounds and inequalities for the WCRE is proposed
△ Less
Submitted 24 July, 2015;
originally announced July 2015.
-
On relative weighted entropies with central moments weight functions
Authors:
Salimeh Yasaei Sekeh,
Adriano Polpo
Abstract:
Following [1], the aim of this paper is to analyze the relative weighted entropy involving the central moments weight functions. We compare the standard relative entropy with the weighted case in two particular forms of Gaussian distributions. As an application, the weighted deviance information criterion is proposed.
Following [1], the aim of this paper is to analyze the relative weighted entropy involving the central moments weight functions. We compare the standard relative entropy with the weighted case in two particular forms of Gaussian distributions. As an application, the weighted deviance information criterion is proposed.
△ Less
Submitted 22 June, 2015; v1 submitted 16 June, 2015;
originally announced June 2015.
-
Weighted Gaussian entropy and determinant inequalities
Authors:
Y. Suhov,
S. Yasaei Sekeh,
I. Stuhl
Abstract:
We produce a series of results extending information-theoretical inequalities (discussed by Dembo--Cover--Thomas in 1989-1991) to a weighted version of entropy. The resulting inequalities involve the Gaussian weighted entropy; they imply a number of new relations for determinants of positive-definite matrices.
We produce a series of results extending information-theoretical inequalities (discussed by Dembo--Cover--Thomas in 1989-1991) to a weighted version of entropy. The resulting inequalities involve the Gaussian weighted entropy; they imply a number of new relations for determinants of positive-definite matrices.
△ Less
Submitted 7 May, 2015;
originally announced May 2015.
-
An extension of the Ky Fan inequality
Authors:
Yuri Suhov,
Salimeh Yasaei Sekeh
Abstract:
The aim of this paper is to analyze the weighted KyFan inequality proposed in [11]. A number of numerical simulations involving the exponential weighted function is given. We show that in several cases and types of examples one can imply an improvement of the standard KyFan inequality.
The aim of this paper is to analyze the weighted KyFan inequality proposed in [11]. A number of numerical simulations involving the exponential weighted function is given. We show that in several cases and types of examples one can imply an improvement of the standard KyFan inequality.
△ Less
Submitted 5 April, 2015;
originally announced April 2015.
-
Entropy-power inequality for weighted entropy
Authors:
Yuri Suhov,
Salimeh Yasaei Sekeh,
Mark Kelbert
Abstract:
We analyse an analog of the entropy-power inequality for the weighted entropy.
We analyse an analog of the entropy-power inequality for the weighted entropy.
△ Less
Submitted 9 March, 2015; v1 submitted 7 February, 2015;
originally announced February 2015.
-
Simple inequalities for weighted entropies
Authors:
Yuri Suhov,
Salimeh Yasaei Sekeh
Abstract:
A number of inequalities for the weighted entropies is proposed, mirroring properties of a standard (Shannon) entropy and related quantities.
A number of inequalities for the weighted entropies is proposed, mirroring properties of a standard (Shannon) entropy and related quantities.
△ Less
Submitted 9 October, 2015; v1 submitted 14 September, 2014;
originally announced September 2014.
-
Comparison results for Garch processes
Authors:
Fabio Bellini,
Franco Pellerey,
Carlo Sgarra,
Salimeh Yasaei Sekeh
Abstract:
We consider the problem of stochastic comparison of general Garch-like processes, for different parameters and different distributions of the innovations. We identify several stochastic orders that are propagated from the innovations to the Garch process itself, and discuss their interpretations. We focus on the convex order and show that in the case of symmetric innovations it is also propagated…
▽ More
We consider the problem of stochastic comparison of general Garch-like processes, for different parameters and different distributions of the innovations. We identify several stochastic orders that are propagated from the innovations to the Garch process itself, and discuss their interpretations. We focus on the convex order and show that in the case of symmetric innovations it is also propagated to the cumulated sums of the Garch process. More generally, we discuss multivariate comparison results related to the multivariate convex and supermodular order. Finally we discuss ordering with respect to the parameters in the Garch (1,1) case. Key words: Garch, Convex Order, Peakedness, Kurtosis, Supermodularity.
△ Less
Submitted 17 April, 2012;
originally announced April 2012.