Search | arXiv e-print repository

A Novel Cholesky Kernel based Support Vector Classifier

Authors: Satyajeet Sahoo, Jhareswar Maiti

Abstract: Support Vector Machine (SVM) is a popular supervised classification model that works by first finding the margin boundaries for the training data classes and then calculating the decision boundary, which is then used to classify the test data. This study demonstrates limitations of traditional support vector classification which uses cartesian coordinate geometry to find the margin and decision bo… ▽ More Support Vector Machine (SVM) is a popular supervised classification model that works by first finding the margin boundaries for the training data classes and then calculating the decision boundary, which is then used to classify the test data. This study demonstrates limitations of traditional support vector classification which uses cartesian coordinate geometry to find the margin and decision boundaries in an input space using only a few support vectors, without considering data variance and correlation. Subsequently, the study proposes a new Cholesky Kernel that adjusts for the effects of variance-covariance structure of the data in the decision boundary equation and margin calculations. The study demonstrates that SVM model is valid only in the Euclidean space, and the Cholesky kernel obtained by decomposing covariance matrix acts as a transformation matrix, which when applied on the original data transforms the data from the input space to the Euclidean space. The effectiveness of the Cholesky kernel based SVM classifier is demonstrated by classifying the Wisconsin Breast Cancer (Diagnostic) Dataset and comparing with traditional SVM approaches. The Cholesky kernel based SVM model shows marked improvement in the precision, recall and F1 scores compared to linear and other kernel SVMs. △ Less

Submitted 6 April, 2025; originally announced April 2025.

arXiv:2503.00307 [pdf, other]

Remasking Discrete Diffusion Models with Inference-Time Scaling

Authors: Guanghan Wang, Yair Schiff, Subham Sekhar Sahoo, Volodymyr Kuleshov

Abstract: Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler… ▽ More Part of the success of diffusion models stems from their ability to perform iterative refinement, i.e., repeatedly correcting outputs during generation. However, modern masked discrete diffusion lacks this capability: when a token is generated, it cannot be updated again, even when it introduces an error. Here, we address this limitation by introducing the remasking diffusion model (ReMDM) sampler, a method that can be applied to pretrained masked diffusion models in a principled way and that is derived from a discrete diffusion model with a custom remasking backward process. Most interestingly, ReMDM endows discrete diffusion with a form of inference-time compute scaling. By increasing the number of sampling steps, ReMDM generates natural language outputs that approach the quality of autoregressive models, whereas when the computation budget is limited, ReMDM better maintains quality. ReMDM also improves sample quality of masked diffusion models for discretized images, and in scientific domains such as molecule design, ReMDM facilitates diffusion guidance and pushes the Pareto frontier of controllability relative to classical masking and uniform noise diffusion. We provide the code along with a blog post on the project page: https://remdm.github.io △ Less

Submitted 21 May, 2025; v1 submitted 28 February, 2025; originally announced March 2025.

Comments: Project page: https://remdm.github.io

arXiv:2502.02233 [pdf]

Variance-Adjusted Cosine Distance as Similarity Metric

Authors: Satyajeet Sahoo, Jhareswar Maiti

Abstract: Cosine similarity is a popular distance measure that measures the similarity between two vectors in the inner product space. It is widely used in many data classification algorithms like K-Nearest Neighbors, Clustering etc. This study demonstrates limitations of application of cosine similarity. Particularly, this study demonstrates that traditional cosine similarity metric is valid only in the Eu… ▽ More Cosine similarity is a popular distance measure that measures the similarity between two vectors in the inner product space. It is widely used in many data classification algorithms like K-Nearest Neighbors, Clustering etc. This study demonstrates limitations of application of cosine similarity. Particularly, this study demonstrates that traditional cosine similarity metric is valid only in the Euclidean space, whereas the original data resides in a random variable space. When there is variance and correlation in the data, then cosine distance is not a completely accurate measure of similarity. While new similarity and distance metrics have been developed to make up for the limitations of cosine similarity, these metrics are used as substitutes to cosine distance, and do not make modifications to cosine distance to overcome its limitations. Subsequently, we propose a modified cosine similarity metric, where cosine distance is adjusted by variance-covariance of the data. Application of variance-adjusted cosine distance gives better similarity performance compared to traditional cosine distance. KNN modelling on the Wisconsin Breast Cancer Dataset is performed using both traditional and modified cosine similarity measures and compared. The modified formula shows 100% test accuracy on the data. △ Less

Submitted 4 February, 2025; originally announced February 2025.

Comments: 6 Pages

arXiv:2412.14527 [pdf, other]

Statistical Undersampling with Mutual Information and Support Points

Authors: Alex Mak, Shubham Sahoo, Shivani Pandey, Yidan Yue, Linglong Kong

Abstract: Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize re… ▽ More Class imbalance and distributional differences in large datasets present significant challenges for classification tasks machine learning, often leading to biased models and poor predictive performance for minority classes. This work introduces two novel undersampling approaches: mutual information-based stratified simple random sampling and support points optimization. These methods prioritize representative data selection, effectively minimizing information loss. Empirical results across multiple classification tasks demonstrate that our methods outperform traditional undersampling techniques, achieving higher balanced classification accuracy. These findings highlight the potential of combining statistical concepts with machine learning to address class imbalance in practical applications. △ Less

Submitted 18 December, 2024; originally announced December 2024.

arXiv:2404.01468 [pdf, other]

Performance triggered adaptive model reduction for soil moisture estimation in precision irrigation

Authors: Sarupa Debnath, Bernard T. Agyeman, Soumya R. Sahoo, Xunyuan Yin, Jinfeng Liu

Abstract: Accurate soil moisture information is crucial for developing precise irrigation control strategies to enhance water use efficiency. Soil moisture estimation based on limited soil moisture sensors is crucial for obtaining comprehensive soil moisture information when dealing with large-scale agricultural fields. The major challenge in soil moisture estimation lies in the high dimensionality of the s… ▽ More Accurate soil moisture information is crucial for developing precise irrigation control strategies to enhance water use efficiency. Soil moisture estimation based on limited soil moisture sensors is crucial for obtaining comprehensive soil moisture information when dealing with large-scale agricultural fields. The major challenge in soil moisture estimation lies in the high dimensionality of the spatially discretized agro-hydrological models. In this work, we propose a performance-triggered adaptive model reduction approach to address this challenge. The proposed approach employs a trajectory-based unsupervised machine learning technique, and a prediction performance-based triggering scheme is designed to govern model updates adaptively in a way such that the prediction error between the reduced model and the original model over a prediction horizon is maintained below a predetermined threshold. An adaptive extended Kalman filter (EKF) is designed based on the reduced model for soil moisture estimation. The applicability and performance of the proposed approach are evaluated extensively through the application to a simulated large-scale agricultural field. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2206.06672 [pdf, other]

Semi-Autoregressive Energy Flows: Exploring Likelihood-Free Training of Normalizing Flows

Authors: Phillip Si, Zeyi Chen, Subham Sekhar Sahoo, Yair Schiff, Volodymyr Kuleshov

Abstract: Training normalizing flow generative models can be challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper studies the likelihood-free training of flows and proposes the energy objective, an alternative sample-based loss based on proper scoring rules. The energy objective is determinant-free and supports flexible model architectures that are not eas… ▽ More Training normalizing flow generative models can be challenging due to the need to calculate computationally expensive determinants of Jacobians. This paper studies the likelihood-free training of flows and proposes the energy objective, an alternative sample-based loss based on proper scoring rules. The energy objective is determinant-free and supports flexible model architectures that are not easily compatible with maximum likelihood training, including semi-autoregressive energy flows, a novel model family that interpolates between fully autoregressive and non-autoregressive models. Energy flows feature competitive sample quality, posterior inference, and generation speed relative to likelihood-based flows; this performance is decorrelated from the quality of log-likelihood estimates, which are generally very poor. Our findings question the use of maximum likelihood as an objective or a metric, and contribute to a scientific study of its role in generative modeling. △ Less

Submitted 22 June, 2023; v1 submitted 14 June, 2022; originally announced June 2022.

Comments: 9 pages, 3 figures, 8 tables, 11 pages appendix

MSC Class: 68T37 (Primary) 68T07 (Secondary)

arXiv:2203.06548 [pdf, other]

doi 10.7910/DVN/QSJNFJ

Impact of sensor placement in soil water estimation: A real-case study

Authors: Erfan Orouskhani, Soumya R. Sahoo, Bernard T. Agyeman, Song Bo, Jinfeng Liu

Abstract: One of the essential elements in implementing a closed-loop irrigation system is soil moisture estimation based on a limited number of available sensors. One associated problem is the determination of the optimal locations to install the sensors such that good soil moisture estimation can be obtained. In our previous work, the modal degree of observability was employed to address the problem of op… ▽ More One of the essential elements in implementing a closed-loop irrigation system is soil moisture estimation based on a limited number of available sensors. One associated problem is the determination of the optimal locations to install the sensors such that good soil moisture estimation can be obtained. In our previous work, the modal degree of observability was employed to address the problem of optimal sensor placement for soil moisture estimation of agro-hydrological systems. It was demonstrated that the optimally placed sensors can improve the soil moisture estimation performance. However, it is unclear whether the optimal sensor placement can significantly improve the soil moisture estimation performance in actual applications. In this work, we investigate the impact of sensor placement in soil moisture estimation for an actual agricultural field in Lethbridge, Alberta, Canada. In an experiment on the studied field, 42 soil moisture sensors were installed at different depths to collect the soil moisture measurements for one growing season. A three-dimensional agro-hydrological model with heterogeneous soil parameters of the studied field is developed. The modal degree of observability is applied to the three-dimensional system to determine the optimal sensor locations. The extended Kalman filter (EKF) is chosen as the data assimilation tool to estimate the soil moisture content of the studied field. Soil moisture estimation results for different scenarios are obtained and analyzed to investigate the effects of sensor placement on the performance of soil moisture estimation in the actual applications. △ Less

Submitted 12 March, 2022; originally announced March 2022.

arXiv:2010.03228 [pdf, other]

FairMixRep : Self-supervised Robust Representation Learning for Heterogeneous Data with Fairness constraints

Authors: Souradip Chakraborty, Ekansh Verma, Saswata Sahoo, Jyotishka Datta

Abstract: Representation Learning in a heterogeneous space with mixed variables of numerical and categorical types has interesting challenges due to its complex feature manifold. Moreover, feature learning in an unsupervised setup, without class labels and a suitable learning loss function, adds to the problem complexity. Further, the learned representation and subsequent predictions should not reflect disc… ▽ More Representation Learning in a heterogeneous space with mixed variables of numerical and categorical types has interesting challenges due to its complex feature manifold. Moreover, feature learning in an unsupervised setup, without class labels and a suitable learning loss function, adds to the problem complexity. Further, the learned representation and subsequent predictions should not reflect discriminatory behavior towards certain sensitive groups or attributes. The proposed feature map should preserve maximum variations present in the data and needs to be fair with respect to the sensitive variables. We propose, in the first phase of our work, an efficient encoder-decoder framework to capture the mixed-domain information. The second phase of our work focuses on de-biasing the mixed space representations by adding relevant fairness constraints. This ensures minimal information loss between the representations before and after the fairness-preserving projections. Both the information content and the fairness aspect of the final representation learned has been validated through several metrics where it shows excellent performance. Our work (FairMixRep) addresses the problem of Mixed Space Fair Representation learning from an unsupervised perspective and learns a Universal representation that is timely, unique, and a novel research contribution. △ Less

Submitted 14 October, 2020; v1 submitted 7 October, 2020; originally announced October 2020.

Comments: This paper has been accepted at the ICDM'2020 DLC Workshop

arXiv:2009.09634 [pdf, other]

Learning Representation for Mixed Data Types with a Nonlinear Deep Encoder-Decoder Framework

Authors: Saswata Sahoo, Souradip Chakraborty

Abstract: Representation of data on mixed variables, numerical and categorical types to get suitable feature map is a challenging task as important information lies in a complex non-linear manifold. The feature transformation should be able to incorporate marginal information of the individual variables and complex cross-dependence structure among the mixed type of variables simultaneously. In this work, we… ▽ More Representation of data on mixed variables, numerical and categorical types to get suitable feature map is a challenging task as important information lies in a complex non-linear manifold. The feature transformation should be able to incorporate marginal information of the individual variables and complex cross-dependence structure among the mixed type of variables simultaneously. In this work, we propose a novel nonlinear Deep Encoder-Decoder framework to capture the cross-domain information for mixed data types. The hidden layers of the network connect the two types of variables through various non-linear transformations to give latent feature maps. We encode the information on the numerical variables in a number of hidden nonlinear units. We use these units to recreate categorical variables through further nonlinear transformations. A separate and similar network is developed switching the roles of the numerical and categorical variables. The hidden representational units are stacked one next to the others and transformed into a common space using a locality preserving projection. The derived feature maps are used to explore the clusters in the data. Various standard datasets are investigated to show nearly the state of the art performance in clustering using the feature maps with simple K-means clustering. △ Less

Submitted 21 September, 2020; originally announced September 2020.

arXiv:2006.16322 [pdf, other]

Scaling Symbolic Methods using Gradients for Neural Model Explanation

Authors: Subham Sekhar Sahoo, Subhashini Venugopalan, Li Li, Rishabh Singh, Patrick Riley

Abstract: Symbolic techniques based on Satisfiability Modulo Theory (SMT) solvers have been proposed for analyzing and verifying neural network properties, but their usage has been fairly limited owing to their poor scalability with larger networks. In this work, we propose a technique for combining gradient-based methods with symbolic techniques to scale such analyses and demonstrate its application for mo… ▽ More Symbolic techniques based on Satisfiability Modulo Theory (SMT) solvers have been proposed for analyzing and verifying neural network properties, but their usage has been fairly limited owing to their poor scalability with larger networks. In this work, we propose a technique for combining gradient-based methods with symbolic techniques to scale such analyses and demonstrate its application for model explanation. In particular, we apply this technique to identify minimal regions in an input that are most relevant for a neural network's prediction. Our approach uses gradient information (based on Integrated Gradients) to focus on a subset of neurons in the first layer, which allows our technique to scale to large networks. The corresponding SMT constraints encode the minimal input mask discovery problem such that after masking the input, the activations of the selected neurons are still above a threshold. After solving for the minimal masks, our approach scores the mask regions to generate a relative ordering of the features within the mask. This produces a saliency map which explains "where a model is looking" when making a prediction. We evaluate our technique on three datasets - MNIST, ImageNet, and Beer Reviews, and demonstrate both quantitatively and qualitatively that the regions generated by our approach are sparser and achieve higher saliency scores compared to the gradient-based methods alone. Code and examples are at - https://github.com/google-research/google-research/tree/master/smug_saliency △ Less

Submitted 5 May, 2021; v1 submitted 29 June, 2020; originally announced June 2020.

arXiv:2005.02817 [pdf, other]

Graph Spectral Feature Learning for Mixed Data of Categorical and Numerical Type

Authors: Saswata Sahoo, Souradip Chakraborty

Abstract: Feature learning in the presence of a mixed type of variables, numerical and categorical types, is an important issue for related modeling problems. For simple neighborhood queries under mixed data space, standard practice is to consider numerical and categorical variables separately and combining them based on some suitable distance functions. Alternatives, such as Kernel learning or Principal Co… ▽ More Feature learning in the presence of a mixed type of variables, numerical and categorical types, is an important issue for related modeling problems. For simple neighborhood queries under mixed data space, standard practice is to consider numerical and categorical variables separately and combining them based on some suitable distance functions. Alternatives, such as Kernel learning or Principal Component do not explicitly consider the inter-dependence structure among the mixed type of variables. In this work, we propose a novel strategy to explicitly model the probabilistic dependence structure among the mixed type of variables by an undirected graph. Spectral decomposition of the graph Laplacian provides the desired feature transformation. The Eigen spectrum of the transformed feature space shows increased separability and more prominent clusterability among the observations. The main novelty of our paper lies in capturing interactions of the mixed feature type in an unsupervised framework using a graphical model. We numerically validate the implications of the feature learning strategy △ Less

Submitted 6 May, 2020; originally announced May 2020.

arXiv:1806.07259 [pdf, other]

Learning Equations for Extrapolation and Control

Authors: Subham S. Sahoo, Christoph H. Lampert, Georg Martius

Abstract: We present an approach to identify concise equations from data using a shallow neural network approach. In contrast to ordinary black-box regression, this approach allows understanding functional relations and generalizing them from observed data to unseen parts of the parameter space. We show how to extend the class of learnable equations for a recently proposed equation learning network to inclu… ▽ More We present an approach to identify concise equations from data using a shallow neural network approach. In contrast to ordinary black-box regression, this approach allows understanding functional relations and generalizing them from observed data to unseen parts of the parameter space. We show how to extend the class of learnable equations for a recently proposed equation learning network to include divisions, and we improve the learning and model selection strategy to be useful for challenging real-world data. For systems governed by analytical expressions, our method can in many cases identify the true underlying equation and extrapolate to unseen domains. We demonstrate its effectiveness by experiments on a cart-pendulum system, where only 2 random rollouts are required to learn the forward dynamics and successfully achieve the swing-up task. △ Less

Submitted 19 June, 2018; originally announced June 2018.

Comments: 9 pages, 9 figures, ICML 2018

MSC Class: 68T05; 68T30; 68T40; 62M20; 62J02; 65D15; 70E60; 93C40 ACM Class: I.2.6; I.2.8

arXiv:1612.06738 [pdf, other]

Local Sparse Approximation for Image Restoration with Adaptive Block Size Selection

Authors: Sujit Kumar Sahoo

Abstract: In this paper the problem of image restoration (denoising and inpainting) is approached using sparse approximation of local image blocks. The local image blocks are extracted by sliding square windows over the image. An adaptive block size selection procedure for local sparse approximation is proposed, which affects the global recovery of underlying image. Ideally the adaptive local block selectio… ▽ More In this paper the problem of image restoration (denoising and inpainting) is approached using sparse approximation of local image blocks. The local image blocks are extracted by sliding square windows over the image. An adaptive block size selection procedure for local sparse approximation is proposed, which affects the global recovery of underlying image. Ideally the adaptive local block selection yields the minimum mean square error (MMSE) in recovered image. This framework gives us a clustered image based on the selected block size, then each cluster is restored separately using sparse approximation. The results obtained using the proposed framework are very much comparable with the recently proposed image restoration techniques. △ Less

Submitted 20 December, 2016; originally announced December 2016.

Showing 1–13 of 13 results for author: Sahoo, S