-
Characterization based Goodness-of-Fit for Generalized Pareto Distribution: A Blend of Stein's Identity and Dynamic Survival Extropy
Authors:
Gaurav Kandpal,
Nitin Gupta
Abstract:
This paper proposes a goodness of fit test for the generalized Pareto distribution (GPD). Firstly, we provide two characterizations of GPD based on Stein's identity and dynamic survival extropy. These characterizations are used to test GPD separately for the positive and negative shape parameter cases. A Monte Carlo simulation is conducted to provide the critical values and power of the proposed t…
▽ More
This paper proposes a goodness of fit test for the generalized Pareto distribution (GPD). Firstly, we provide two characterizations of GPD based on Stein's identity and dynamic survival extropy. These characterizations are used to test GPD separately for the positive and negative shape parameter cases. A Monte Carlo simulation is conducted to provide the critical values and power of the proposed test against a good number of alternatives. Our test is simple to use and it has asymptotic normality and relatively high power, which strengthened the purpose of proposing it. Considering the case of right censored data, we provide the procedure to handle censored case too. A few real-life applications are also included.
△ Less
Submitted 2 June, 2025;
originally announced June 2025.
-
Weighted Tail Random Variable: A Novel Framework with Stochastic Properties and Applications
Authors:
Sarikul Islam,
Nitin Gupta
Abstract:
This paper introduces a novel framework to construct the probability density function (PDF) of non-negative continuous random variables. The proposed framework uses two functions: one is the survival function (SF) of a non-negative continuous random variable, and the other is a weight function, which is an increasing and differentiable function satisfying some properties. The resulting random vari…
▽ More
This paper introduces a novel framework to construct the probability density function (PDF) of non-negative continuous random variables. The proposed framework uses two functions: one is the survival function (SF) of a non-negative continuous random variable, and the other is a weight function, which is an increasing and differentiable function satisfying some properties. The resulting random variable is referred to as the weighted tail random variable (WTRV) corresponding to the given random variable and the weight function. We investigate several reliability properties of the WTRV and establish various stochastic orderings between a random variable and its WTRV, as well as between two WTRVs. Using this framework, we construct a WTRV of the Kumaraswamy distribution. We conduct goodness-of-fit tests for two real-world datasets, applied to the Kumaraswamy distribution and its corresponding WTRV. The test results indicate that the WTRV offers a superior fit compared to the Kumaraswamy distribution, which demonstrates the utility of the proposed framework.
△ Less
Submitted 26 May, 2025;
originally announced May 2025.
-
Inference of Half Logistic Geometric Distribution Based on Generalized Order Statistics
Authors:
Neetu Gupta,
S. K. Neogy,
Qazi J. Azhad,
Bhagwati Devi
Abstract:
As the unification of various models of ordered quantities, generalized order statistics act as a simplistic approach introduced in \cite{kamps1995concept}. In this present study, results pertaining to the expressions of marginal and joint moment generating functions from half logistic geometric distribution are presented based on generalized order statistics framework. We also consider the estima…
▽ More
As the unification of various models of ordered quantities, generalized order statistics act as a simplistic approach introduced in \cite{kamps1995concept}. In this present study, results pertaining to the expressions of marginal and joint moment generating functions from half logistic geometric distribution are presented based on generalized order statistics framework. We also consider the estimation problem of $θ$ and provides a Bayesian framework. The two widely and popular methods called Markov chain Monte Carlo and Lindley approximations are used for obtaining the Bayes estimators.The results are derived under symmetric and asymmetric loss functions. Analysis of the special cases of generalized order statistics, \textit{i.e.,} order statistics is also presented. To have an insight into the practical applicability of the proposed results, two real data sets, one from the field of Demography and, other from reliability have been taken for analysis.
△ Less
Submitted 3 February, 2025;
originally announced February 2025.
-
A characterization of uniform distribution using varextropy with application in testing uniformity
Authors:
Santosh Kumar Chaudhary,
Nitin Gupta
Abstract:
In statistical analysis, quantifying uncertainties through measures such as entropy, extropy, varentropy, and varextropy is of fundamental importance for understanding distribution functions. This paper investigates several properties of varextropy and give a new characterization of uniform distribution using varextropy. The alredy proposed estimators are used as a test statistics. Building on the…
▽ More
In statistical analysis, quantifying uncertainties through measures such as entropy, extropy, varentropy, and varextropy is of fundamental importance for understanding distribution functions. This paper investigates several properties of varextropy and give a new characterization of uniform distribution using varextropy. The alredy proposed estimators are used as a test statistics. Building on the characterization of the uniform distribution using varextropy, we give a uniformity test. The critical value and power of the test statistics are derived. The proposed test procedure is applied to a real-world dataset to assess its performance and effectiveness.
△ Less
Submitted 12 January, 2025;
originally announced January 2025.
-
A goodness-of-fit test for testing exponentiality based on normalized dynamic survival extropy
Authors:
Gaurav Kandpal,
Nitin Gupta
Abstract:
The cumulative residual extropy (CRJ) is a measure of uncertainty that serves as an alternative to extropy. It replaces the probability density function with the survival function in the expression of extropy. This work introduces a new concept called normalized dynamic survival extropy (NDSE), a dynamic variation of CRJ. We observe that NDSE is equivalent to CRJ of the random variable of interest…
▽ More
The cumulative residual extropy (CRJ) is a measure of uncertainty that serves as an alternative to extropy. It replaces the probability density function with the survival function in the expression of extropy. This work introduces a new concept called normalized dynamic survival extropy (NDSE), a dynamic variation of CRJ. We observe that NDSE is equivalent to CRJ of the random variable of interest $X_{[t]}$ in the age replacement model at a fixed time $t$. Additionally, we have demonstrated that NDSE remains constant exclusively for exponential distribution at any time. We categorize two classes, INDSE and DNDSE, based on their increasing and decreasing NDSE values. Next, we present a non-parametric test to assess whether a distribution follows an exponential pattern against INDSE. We derive the exact and asymptotic distribution for the test statistic $\widehatΔ^*$. Additionally, a test for asymptotic behavior is presented in the paper for right censoring data. Finally, we determine the critical values and power of our exact test through simulation. The simulation demonstrates that the suggested test is easy to compute and has significant statistical power, even with small sample sizes. We also conduct a power comparison analysis among other tests, which shows better power for the proposed test against other alternatives mentioned in this paper. Some numerical real-life examples validating the test are also included.
△ Less
Submitted 16 July, 2024;
originally announced July 2024.
-
Stochastic comparison of series and parallel systems lifetime in Archimedean copula under random shock
Authors:
Sarikul Islam,
Nitin Gupta
Abstract:
In this paper, we studied the stochastic ordering behavior of series as well as parallel systems' lifetimes comprising dependent and heterogeneous components, experiencing random shocks, and exhibiting distinct dependency structures. We establish certain conditions on the lifetime of individual components where the dependency among components defined by Archimedean copulas, and the impact of rando…
▽ More
In this paper, we studied the stochastic ordering behavior of series as well as parallel systems' lifetimes comprising dependent and heterogeneous components, experiencing random shocks, and exhibiting distinct dependency structures. We establish certain conditions on the lifetime of individual components where the dependency among components defined by Archimedean copulas, and the impact of random shocks on the overall system lifetime to get the results. We consider components whose survival functions are either increasing log-concave or decreasing log-convex functions of the parameters involved. These conditions make it possible to compare the lifetimes of two systems using the usual stochastic order framework. Additionally, we provide examples and graphical representations to elucidate our theoretical findings.
△ Less
Submitted 28 May, 2025; v1 submitted 9 June, 2024;
originally announced June 2024.
-
When Does Confidence-Based Cascade Deferral Suffice?
Authors:
Wittawat Jitkrittum,
Neha Gupta,
Aditya Krishna Menon,
Harikrishna Narasimhan,
Ankit Singh Rawat,
Sanjiv Kumar
Abstract:
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite…
▽ More
Cascades are a classical strategy to enable inference cost to vary adaptively across samples, wherein a sequence of classifiers are invoked in turn. A deferral rule determines whether to invoke the next classifier in the sequence, or to terminate prediction. One simple deferral rule employs the confidence of the current classifier, e.g., based on the maximum predicted softmax probability. Despite being oblivious to the structure of the cascade -- e.g., not modelling the errors of downstream models -- such confidence-based deferral often works remarkably well in practice. In this paper, we seek to better understand the conditions under which confidence-based deferral may fail, and when alternate deferral strategies can perform better. We first present a theoretical characterisation of the optimal deferral rule, which precisely characterises settings under which confidence-based deferral may suffer. We then study post-hoc deferral mechanisms, and demonstrate they can significantly improve upon confidence-based deferral in settings where (i) downstream models are specialists that only work well on a subset of inputs, (ii) samples are subject to label noise, and (iii) there is distribution shift between the train and test set.
△ Less
Submitted 23 January, 2024; v1 submitted 6 July, 2023;
originally announced July 2023.
-
Predicting blood pressure under circumstances of missing data: An analysis of missing data patterns and imputation methods using NHANES
Authors:
Harish Chauhan,
Nikunj Gupta,
Zoe Haskell-Craig
Abstract:
The World Health Organization defines cardio-vascular disease (CVD) as "a group of disorders of the heart and blood vessels," including coronary heart disease and stroke (WHO 21). CVD is affected by "intermediate risk factors" such as raised blood pressure, raised blood glucose, raised blood lipids, and obesity. These are predominantly influenced by lifestyle and behaviour, including physical inac…
▽ More
The World Health Organization defines cardio-vascular disease (CVD) as "a group of disorders of the heart and blood vessels," including coronary heart disease and stroke (WHO 21). CVD is affected by "intermediate risk factors" such as raised blood pressure, raised blood glucose, raised blood lipids, and obesity. These are predominantly influenced by lifestyle and behaviour, including physical inactivity, unhealthy diets, high intake of salt, and tobacco and alcohol use. However, genetics and social/environmental factors such as poverty, stress, and racism also play an important role. Researchers studying the behavioural and environmental factors associated with these "intermediate risk factors" need access to high quality and detailed information on diet and physical activity. However, missing data are a pervasive problem in clinical and public health research, affecting both randomized trials and observational studies. Reasons for missing data can vary substantially across studies because of loss to follow-up, missed study visits, refusal to answer survey questions, or an unrecorded measurement during an office visit. One method of handling missing values is to simply delete observations for which there is missingness (called Complete Case Analysis). This is rarely used as deleting the data point containing missing data (List wise deletion) results in a smaller number of samples and thus affects accuracy. Additional methods of handling missing data exists, such as summarizing the variables with its observed values (Available Case Analysis). Motivated by the pervasiveness of missing data in the NHANES dataset, we will conduct an analysis of imputation methods under different simulated patterns of missing data. We will then apply these imputation methods to create a complete dataset upon which we can use ordinary least squares to predict blood pressure from diet and physical activity.
△ Less
Submitted 1 May, 2023;
originally announced May 2023.
-
Testing exponentiality using extropy of upper record values
Authors:
Santosh Kumar Chaudhary,
Nitin Gupta
Abstract:
We are giving one characterization result of exponential distribution using extropy of nth upper k-record value. We introduce test statistics based on the proposed characterization result that will be used to test exponentially. The critical value and power of the test have been calculated using monte Carlo simulation. The test is applied to seven real-life data sets to verify its applicability in…
▽ More
We are giving one characterization result of exponential distribution using extropy of nth upper k-record value. We introduce test statistics based on the proposed characterization result that will be used to test exponentially. The critical value and power of the test have been calculated using monte Carlo simulation. The test is applied to seven real-life data sets to verify its applicability in practice.
△ Less
Submitted 6 January, 2023;
originally announced January 2023.
-
Testing of symmetry based on cumulative past and residual extropy of record values
Authors:
Santosh Kumar Chaudhary,
Nitin Gupta
Abstract:
In this paper, we are testing the symmetry in the distribution of data observed on a random variable. We proposed test statistics using cumulative past and residual extropy of record values based on the characterization developed by Gupta and Chaudhary (2022) [5]. It is shown that the obtained estimator is consistent. Our proposed test has an advantage that we do not need to estimate the centre of…
▽ More
In this paper, we are testing the symmetry in the distribution of data observed on a random variable. We proposed test statistics using cumulative past and residual extropy of record values based on the characterization developed by Gupta and Chaudhary (2022) [5]. It is shown that the obtained estimator is consistent. Our proposed test has an advantage that we do not need to estimate the centre of symmetry. The empirical density, critical value and power of the proposed test statistics have been obtained. The test procedure has been implemented on six real-life data sets to verify its performance in identifying the symmetric nature. Simulations indicate our test performs better than the competitor tests.
△ Less
Submitted 14 September, 2022;
originally announced September 2022.
-
On General Weighted Extropy of Ranked Set Sampling
Authors:
Nitin Gupta,
Santosh Kumar Chaudhary
Abstract:
In the past six years, a considerable attention has been given to the extropy measure proposed by Lad et al. (2015). Weighted Extropy of Ranked Set Sampling was studied and compared with simple random sampling by Qiu et al. (2022). The general weighted extropy and some results related to it are introduced in this paper. We provide general weighted extropy of ranked set sampling. We also studied ch…
▽ More
In the past six years, a considerable attention has been given to the extropy measure proposed by Lad et al. (2015). Weighted Extropy of Ranked Set Sampling was studied and compared with simple random sampling by Qiu et al. (2022). The general weighted extropy and some results related to it are introduced in this paper. We provide general weighted extropy of ranked set sampling. We also studied characterization results, stochastic comparison and monotone properties of general weighted extropy.
△ Less
Submitted 5 July, 2022;
originally announced July 2022.
-
Ensembling over Classifiers: a Bias-Variance Perspective
Authors:
Neha Gupta,
Jamie Smith,
Ben Adlam,
Zelda Mariet
Abstract:
Ensembles are a straightforward, remarkably effective method for improving the accuracy,calibration, and robustness of models on classification tasks; yet, the reasons that underlie their success remain an active area of research. We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers. Introducin…
▽ More
Ensembles are a straightforward, remarkably effective method for improving the accuracy,calibration, and robustness of models on classification tasks; yet, the reasons that underlie their success remain an active area of research. We build upon the extension to the bias-variance decomposition by Pfau (2013) in order to gain crucial insights into the behavior of ensembles of classifiers. Introducing a dual reparameterization of the bias-variance tradeoff, we first derive generalized laws of total expectation and variance for nonsymmetric losses typical of classification tasks. Comparing conditional and bootstrap bias/variance estimates, we then show that conditional estimates necessarily incur an irreducible error. Next, we show that ensembling in dual space reduces the variance and leaves the bias unchanged, whereas standard ensembling can arbitrarily affect the bias. Empirically, standard ensembling reducesthe bias, leading us to hypothesize that ensembles of classifiers may perform well in part because of this unexpected reduction.We conclude by an empirical analysis of recent deep learning methods that ensemble over hyperparameters, revealing that these techniques indeed favor bias reduction. This suggests that, contrary to classical wisdom, targeting bias reduction may be a promising direction for classifier ensembles.
△ Less
Submitted 21 June, 2022;
originally announced June 2022.
-
Understanding the bias-variance tradeoff of Bregman divergences
Authors:
Ben Adlam,
Neha Gupta,
Zelda Mariet,
Jamie Smith
Abstract:
This paper builds upon the work of Pfau (2013), which generalized the bias variance tradeoff to any Bregman divergence loss function. Pfau (2013) showed that for Bregman divergences, the bias and variances are defined with respect to a central label, defined as the mean of the label variable, and a central prediction, of a more complex form. We show that, similarly to the label, the central predic…
▽ More
This paper builds upon the work of Pfau (2013), which generalized the bias variance tradeoff to any Bregman divergence loss function. Pfau (2013) showed that for Bregman divergences, the bias and variances are defined with respect to a central label, defined as the mean of the label variable, and a central prediction, of a more complex form. We show that, similarly to the label, the central prediction can be interpreted as the mean of a random variable, where the mean operates in a dual space defined by the loss function itself. Viewing the bias-variance tradeoff through operations taken in dual space, we subsequently derive several results of interest. In particular, (a) the variance terms satisfy a generalized law of total variance; (b) if a source of randomness cannot be controlled, its contribution to the bias and variance has a closed form; (c) there exist natural ensembling operations in the label and prediction spaces which reduce the variance and do not affect the bias.
△ Less
Submitted 9 February, 2022; v1 submitted 8 February, 2022;
originally announced February 2022.
-
On Accelerating Distributed Convex Optimizations
Authors:
Kushal Chakrabarti,
Nirupam Gupta,
Nikhil Chopra
Abstract:
This paper studies a distributed multi-agent convex optimization problem. The system comprises multiple agents in this problem, each with a set of local data points and an associated local cost function. The agents are connected to a server, and there is no inter-agent communication. The agents' goal is to learn a parameter vector that optimizes the aggregate of their local costs without revealing…
▽ More
This paper studies a distributed multi-agent convex optimization problem. The system comprises multiple agents in this problem, each with a set of local data points and an associated local cost function. The agents are connected to a server, and there is no inter-agent communication. The agents' goal is to learn a parameter vector that optimizes the aggregate of their local costs without revealing their local data points. In principle, the agents can solve this problem by collaborating with the server using the traditional distributed gradient-descent method. However, when the aggregate cost is ill-conditioned, the gradient-descent method (i) requires a large number of iterations to converge, and (ii) is highly unstable against process noise. We propose an iterative pre-conditioning technique to mitigate the deleterious effects of the cost function's conditioning on the convergence rate of distributed gradient-descent. Unlike the conventional pre-conditioning techniques, the pre-conditioner matrix in our proposed technique updates iteratively to facilitate implementation on the distributed network. In the distributed setting, we provably show that the proposed algorithm converges linearly with an improved rate of convergence than the traditional and adaptive gradient-descent methods. Additionally, for the special case when the minimizer of the aggregate cost is unique, our algorithm converges superlinearly. We demonstrate our algorithm's superior performance compared to prominent distributed algorithms for solving real logistic regression problems and emulating neural network training via a noisy quadratic model, thereby signifying the proposed algorithm's efficiency for distributively solving non-convex optimization. Moreover, we empirically show that the proposed algorithm results in faster training without compromising the generalization performance.
△ Less
Submitted 19 August, 2021;
originally announced August 2021.
-
Robustness of Iteratively Pre-Conditioned Gradient-Descent Method: The Case of Distributed Linear Regression Problem
Authors:
Kushal Chakrabarti,
Nirupam Gupta,
Nikhil Chopra
Abstract:
This paper considers the problem of multi-agent distributed linear regression in the presence of system noises. In this problem, the system comprises multiple agents wherein each agent locally observes a set of data points, and the agents' goal is to compute a linear model that best fits the collective data points observed by all the agents. We consider a server-based distributed architecture wher…
▽ More
This paper considers the problem of multi-agent distributed linear regression in the presence of system noises. In this problem, the system comprises multiple agents wherein each agent locally observes a set of data points, and the agents' goal is to compute a linear model that best fits the collective data points observed by all the agents. We consider a server-based distributed architecture where the agents interact with a common server to solve the problem; however, the server cannot access the agents' data points. We consider a practical scenario wherein the system either has observation noise, i.e., the data points observed by the agents are corrupted, or has process noise, i.e., the computations performed by the server and the agents are corrupted. In noise-free systems, the recently proposed distributed linear regression algorithm, named the Iteratively Pre-conditioned Gradient-descent (IPG) method, has been claimed to converge faster than related methods. In this paper, we study the robustness of the IPG method, against both the observation noise and the process noise. We empirically show that the robustness of the IPG method compares favorably to the state-of-the-art algorithms.
△ Less
Submitted 26 January, 2021;
originally announced January 2021.
-
Accelerating Distributed SGD for Linear Regression using Iterative Pre-Conditioning
Authors:
Kushal Chakrabarti,
Nirupam Gupta,
Nikhil Chopra
Abstract:
This paper considers the multi-agent distributed linear least-squares problem. The system comprises multiple agents, each agent with a locally observed set of data points, and a common server with whom the agents can interact. The agents' goal is to compute a linear model that best fits the collective data points observed by all the agents. In the server-based distributed settings, the server cann…
▽ More
This paper considers the multi-agent distributed linear least-squares problem. The system comprises multiple agents, each agent with a locally observed set of data points, and a common server with whom the agents can interact. The agents' goal is to compute a linear model that best fits the collective data points observed by all the agents. In the server-based distributed settings, the server cannot access the data points held by the agents. The recently proposed Iteratively Pre-conditioned Gradient-descent (IPG) method has been shown to converge faster than other existing distributed algorithms that solve this problem. In the IPG algorithm, the server and the agents perform numerous iterative computations. Each of these iterations relies on the entire batch of data points observed by the agents for updating the current estimate of the solution. Here, we extend the idea of iterative pre-conditioning to the stochastic settings, where the server updates the estimate and the iterative pre-conditioning matrix based on a single randomly selected data point at every iteration. We show that our proposed Iteratively Pre-conditioned Stochastic Gradient-descent (IPSG) method converges linearly in expectation to a proximity of the solution. Importantly, we empirically show that the proposed IPSG method's convergence rate compares favorably to prominent stochastic algorithms for solving the linear least-squares problem in server-based networks.
△ Less
Submitted 28 November, 2020; v1 submitted 15 November, 2020;
originally announced November 2020.
-
Universal guarantees for decision tree induction via a higher-order splitting criterion
Authors:
Guy Blanc,
Neha Gupta,
Jane Lange,
Li-Yang Tan
Abstract:
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitt…
▽ More
We propose a simple extension of top-down decision tree learning heuristics such as ID3, C4.5, and CART. Our algorithm achieves provable guarantees for all target functions $f: \{-1,1\}^n \to \{-1,1\}$ with respect to the uniform distribution, circumventing impossibility results showing that existing heuristics fare poorly even for simple target functions. The crux of our extension is a new splitting criterion that takes into account the correlations between $f$ and small subsets of its attributes. The splitting criteria of existing heuristics (e.g. Gini impurity and information gain), in contrast, are based solely on the correlations between $f$ and its individual attributes.
Our algorithm satisfies the following guarantee: for all target functions $f : \{-1,1\}^n \to \{-1,1\}$, sizes $s\in \mathbb{N}$, and error parameters $ε$, it constructs a decision tree of size $s^{\tilde{O}((\log s)^2/ε^2)}$ that achieves error $\le O(\mathsf{opt}_s) + ε$, where $\mathsf{opt}_s$ denotes the error of the optimal size $s$ decision tree. A key technical notion that drives our analysis is the noise stability of $f$, a well-studied smoothness measure.
△ Less
Submitted 16 October, 2020;
originally announced October 2020.
-
Active Local Learning
Authors:
Arturs Backurs,
Avrim Blum,
Neha Gupta
Abstract:
In this work we consider active local learning: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined,…
▽ More
In this work we consider active local learning: given a query point $x$, and active access to an unlabeled training set $S$, output the prediction $h(x)$ of a near-optimal $h \in H$ using significantly fewer labels than would be needed to actually learn $h$ fully. In particular, the number of label queries should be independent of the complexity of $H$, and the function $h$ should be well-defined, independent of $x$. This immediately also implies an algorithm for distance estimation: estimating the value $opt(H)$ from many fewer labels than needed to actually learn a near-optimal $h \in H$, by running local learning on a few random query points and computing the average error.
For the hypothesis class consisting of functions supported on the interval $[0,1]$ with Lipschitz constant bounded by $L$, we present an algorithm that makes $O(({1 / ε^6}) \log(1/ε))$ label queries from an unlabeled pool of $O(({L / ε^4})\log(1/ε))$ samples. It estimates the distance to the best hypothesis in the class to an additive error of $ε$ for an arbitrary underlying distribution. We further generalize our algorithm to more than one dimensions. We emphasize that the number of labels used is independent of the complexity of the hypothesis class which depends on $L$. Furthermore, we give an algorithm to locally estimate the values of a near-optimal function at a few query points of interest with number of labels independent of $L$.
We also consider the related problem of approximating the minimum error that can be achieved by the Nadaraya-Watson estimator under a linear diagonal transformation with eigenvalues coming from a small range. For a $d$-dimensional pointset of size $N$, our algorithm achieves an additive approximation of $ε$, makes $\tilde{O}({d}/{ε^2})$ queries and runs in $\tilde{O}({d^2}/{ε^{d+4}}+{dN}/{ε^2})$ time.
△ Less
Submitted 3 September, 2020; v1 submitted 31 August, 2020;
originally announced August 2020.
-
Byzantine Fault-Tolerant Distributed Machine Learning Using Stochastic Gradient Descent (SGD) and Norm-Based Comparative Gradient Elimination (CGE)
Authors:
Nirupam Gupta,
Shuo Liu,
Nitin H. Vaidya
Abstract:
This paper considers the Byzantine fault-tolerance problem in distributed stochastic gradient descent (D-SGD) method - a popular algorithm for distributed multi-agent machine learning. In this problem, each agent samples data points independently from a certain data-generating distribution. In the fault-free case, the D-SGD method allows all the agents to learn a mathematical model best fitting th…
▽ More
This paper considers the Byzantine fault-tolerance problem in distributed stochastic gradient descent (D-SGD) method - a popular algorithm for distributed multi-agent machine learning. In this problem, each agent samples data points independently from a certain data-generating distribution. In the fault-free case, the D-SGD method allows all the agents to learn a mathematical model best fitting the data collectively sampled by all agents. We consider the case when a fraction of agents may be Byzantine faulty. Such faulty agents may not follow a prescribed algorithm correctly, and may render traditional D-SGD method ineffective by sharing arbitrary incorrect stochastic gradients. We propose a norm-based gradient-filter, named comparative gradient elimination (CGE), that robustifies the D-SGD method against Byzantine agents. We show that the CGE gradient-filter guarantees fault-tolerance against a bounded fraction of Byzantine agents under standard stochastic assumptions, and is computationally simpler compared to many existing gradient-filters such as multi-KRUM, geometric median-of-means, and the spectral filters. We empirically show, by simulating distributed learning on neural networks, that the fault-tolerance of CGE is comparable to that of existing gradient-filters. We also empirically show that exponential averaging of stochastic gradients improves the fault-tolerance of a generic gradient-filter.
△ Less
Submitted 17 April, 2021; v1 submitted 11 August, 2020;
originally announced August 2020.
-
Iterative Pre-Conditioning for Expediting the Gradient-Descent Method: The Distributed Linear Least-Squares Problem
Authors:
Kushal Chakrabarti,
Nirupam Gupta,
Nikhil Chopra
Abstract:
This paper considers the multi-agent linear least-squares problem in a server-agent network. In this problem, the system comprises multiple agents, each having a set of local data points, that are connected to a server. The goal for the agents is to compute a linear mathematical model that optimally fits the collective data points held by all the agents, without sharing their individual local data…
▽ More
This paper considers the multi-agent linear least-squares problem in a server-agent network. In this problem, the system comprises multiple agents, each having a set of local data points, that are connected to a server. The goal for the agents is to compute a linear mathematical model that optimally fits the collective data points held by all the agents, without sharing their individual local data points. This goal can be achieved, in principle, using the server-agent variant of the traditional iterative gradient-descent method. The gradient-descent method converges linearly to a solution, and its rate of convergence is lower bounded by the conditioning of the agents' collective data points. If the data points are ill-conditioned, the gradient-descent method may require a large number of iterations to converge.
We propose an iterative pre-conditioning technique that mitigates the deleterious effect of the conditioning of data points on the rate of convergence of the gradient-descent method. We rigorously show that the resulting pre-conditioned gradient-descent method, with the proposed iterative pre-conditioning, achieves superlinear convergence when the least-squares problem has a unique solution. In general, the convergence is linear with improved rate of convergence in comparison to the traditional gradient-descent method and the state-of-the-art accelerated gradient-descent methods. We further illustrate the improved rate of convergence of our proposed algorithm through experiments on different real-world least-squares problems in both noise-free and noisy computation environment.
△ Less
Submitted 6 August, 2021; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Unsupervised Learning of KB Queries in Task-Oriented Dialogs
Authors:
Dinesh Raghu,
Nikhil Gupta,
Mausam
Abstract:
Task-oriented dialog (TOD) systems often need to formulate knowledge base (KB) queries corresponding to the user intent and use the query results to generate system responses. Existing approaches require dialog datasets to explicitly annotate these KB queries -- these annotations can be time consuming, and expensive. In response, we define the novel problems of predicting the KB query and training…
▽ More
Task-oriented dialog (TOD) systems often need to formulate knowledge base (KB) queries corresponding to the user intent and use the query results to generate system responses. Existing approaches require dialog datasets to explicitly annotate these KB queries -- these annotations can be time consuming, and expensive. In response, we define the novel problems of predicting the KB query and training the dialog agent, without explicit KB query annotation. For query prediction, we propose a reinforcement learning (RL) baseline, which rewards the generation of those queries whose KB results cover the entities mentioned in subsequent dialog. Further analysis reveals that correlation among query attributes in KB can significantly confuse memory augmented policy optimization (MAPO), an existing state of the art RL agent. To address this, we improve the MAPO baseline with simple but important modifications suited to our task. To train the full TOD system for our setting, we propose a pipelined approach: it independently predicts when to make a KB query (query position predictor), then predicts a KB query at the predicted position (query predictor), and uses the results of predicted query in subsequent dialog (next response predictor). Overall, our work proposes first solutions to our novel problem, and our analysis highlights the research challenges in training TOD systems without query annotation.
△ Less
Submitted 3 June, 2021; v1 submitted 30 April, 2020;
originally announced May 2020.
-
Iterative Pre-Conditioning to Expedite the Gradient-Descent Method
Authors:
Kushal Chakrabarti,
Nirupam Gupta,
Nikhil Chopra
Abstract:
This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradie…
▽ More
This paper considers the problem of multi-agent distributed optimization. In this problem, there are multiple agents in the system, and each agent only knows its local cost function. The objective for the agents is to collectively compute a common minimum of the aggregate of all their local cost functions. In principle, this problem is solvable using a distributed variant of the traditional gradient-descent method, which is an iterative method. However, the speed of convergence of the traditional gradient-descent method is highly influenced by the conditioning of the optimization problem being solved. Specifically, the method requires a large number of iterations to converge to a solution if the optimization problem is ill-conditioned.
In this paper, we propose an iterative pre-conditioning approach that can significantly attenuate the influence of the problem's conditioning on the convergence-speed of the gradient-descent method. The proposed pre-conditioning approach can be easily implemented in distributed systems and has minimal computation and communication overhead. For now, we only consider a specific distributed optimization problem wherein the individual local cost functions of the agents are quadratic. Besides the theoretical guarantees, the improved convergence speed of our approach is demonstrated through experiments on a real data-set.
△ Less
Submitted 29 March, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.
-
Extreme Regression for Dynamic Search Advertising
Authors:
Yashoteja Prabhu,
Aditya Kusupati,
Nilesh Gupta,
Manik Varma
Abstract:
This paper introduces a new learning paradigm called eXtreme Regression (XR) whose objective is to accurately predict the numerical degrees of relevance of an extremely large number of labels to a data point. XR can provide elegant solutions to many large-scale ranking and recommendation applications including Dynamic Search Advertising (DSA). XR can learn more accurate models than the recently po…
▽ More
This paper introduces a new learning paradigm called eXtreme Regression (XR) whose objective is to accurately predict the numerical degrees of relevance of an extremely large number of labels to a data point. XR can provide elegant solutions to many large-scale ranking and recommendation applications including Dynamic Search Advertising (DSA). XR can learn more accurate models than the recently popular extreme classifiers which incorrectly assume strictly binary-valued label relevances. Traditional regression metrics which sum the errors over all the labels are unsuitable for XR problems since they could give extremely loose bounds for the label ranking quality. Also, the existing regression algorithms won't efficiently scale to millions of labels. This paper addresses these limitations through: (1) new evaluation metrics for XR which sum only the k largest regression errors; (2) a new algorithm called XReg which decomposes XR task into a hierarchy of much smaller regression problems thus leading to highly efficient training and prediction. This paper also introduces a (3) new labelwise prediction algorithm in XReg useful for DSA and other recommendation tasks. Experiments on benchmark datasets demonstrated that XReg can outperform the state-of-the-art extreme classifiers as well as large-scale regressors and rankers by up to 50% reduction in the new XR error metric, and up to 2% and 2.4% improvements in terms of the propensity-scored precision metric used in extreme classification and the click-through rate metric used in DSA respectively. Deployment of XReg on DSA in Bing resulted in a relative gain of 27% in query coverage. XReg's source code can be downloaded from http://manikvarma.org/code/XReg/download.html.
△ Less
Submitted 20 January, 2020; v1 submitted 15 January, 2020;
originally announced January 2020.
-
Enforcing Linearity in DNN succours Robustness and Adversarial Image Generation
Authors:
Anindya Sarkar,
Nikhil Kumar Gupta,
Raghu Iyengar
Abstract:
Recent studies on the adversarial vulnerability of neural networks have shown that models trained with the objective of minimizing an upper bound on the worst-case loss over all possible adversarial perturbations improve robustness against adversarial attacks. Beside exploiting adversarial training framework, we show that by enforcing a Deep Neural Network (DNN) to be linear in transformed input a…
▽ More
Recent studies on the adversarial vulnerability of neural networks have shown that models trained with the objective of minimizing an upper bound on the worst-case loss over all possible adversarial perturbations improve robustness against adversarial attacks. Beside exploiting adversarial training framework, we show that by enforcing a Deep Neural Network (DNN) to be linear in transformed input and feature space improves robustness significantly. We also demonstrate that by augmenting the objective function with Local Lipschitz regularizer boost robustness of the model further. Our method outperforms most sophisticated adversarial training methods and achieves state of the art adversarial accuracy on MNIST, CIFAR10 and SVHN dataset. In this paper, we also propose a novel adversarial image generation method by leveraging Inverse Representation Learning and Linearity aspect of an adversarially trained deep neural network classifier.
△ Less
Submitted 21 October, 2019; v1 submitted 17 October, 2019;
originally announced October 2019.
-
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
Authors:
Guy Blanc,
Neha Gupta,
Gregory Valiant,
Paul Valiant
Abstract:
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm o…
▽ More
We consider networks, trained via stochastic gradient descent to minimize $\ell_2$ loss, with the training labels perturbed by independent noise at each iteration. We characterize the behavior of the training dynamics near any parameter vector that achieves zero training error, in terms of an implicit regularization term corresponding to the sum over the data points, of the squared $\ell_2$ norm of the gradient of the model with respect to the parameter vector, evaluated at each data point. This holds for networks of any connectivity, width, depth, and choice of activation function. We interpret this implicit regularization term for three simple settings: matrix sensing, two layer ReLU networks trained on one-dimensional data, and two layer networks with sigmoid activations trained on a single datapoint. For these settings, we show why this new and general implicit regularization effect drives the networks towards "simple" models.
△ Less
Submitted 22 July, 2020; v1 submitted 19 April, 2019;
originally announced April 2019.
-
Some ordering properties of highest and lowest order statistics with exponentiated Gumble type-II distributed components
Authors:
Surojit Biswas,
Nitin Gupta
Abstract:
In this paper, we have studied the stochastic comparisons of the highest and lowest order statistics of exponentiated Gumble type-II distribution with three parameters. We have compared both the statistics by using three different stochastic ordering. First, we consider a system with different scale and outer shape parameters and then we study the usual stochastic ordering of the lowest and highes…
▽ More
In this paper, we have studied the stochastic comparisons of the highest and lowest order statistics of exponentiated Gumble type-II distribution with three parameters. We have compared both the statistics by using three different stochastic ordering. First, we consider a system with different scale and outer shape parameters and then we study the usual stochastic ordering of the lowest and highest order statistics in the sense of multivariate chain majorization. In addition, we construct two examples to support our results. Second, by using the vector majorization technique, we study the usual stochastic ordering, the reversed failure rate ordering and the likelihood ratio ordering with respect to different outer shape parameters, next, by varying the inner shape parameter, we discuss the usual stochastic order of the lowest order statistics and we have shown that the highest order statistics are not comparable in the usual stochastic ordering by an example.
△ Less
Submitted 18 April, 2019;
originally announced April 2019.
-
Byzantine Fault Tolerant Distributed Linear Regression
Authors:
Nirupam Gupta,
Nitin H. Vaidya
Abstract:
This paper considers the problem of Byzantine fault tolerance in distributed linear regression in a multi-agent system. However, the proposed algorithms are given for a more general class of distributed optimization problems, of which distributed linear regression is a special case. The system comprises of a server and multiple agents, where each agent is holding a certain number of data points an…
▽ More
This paper considers the problem of Byzantine fault tolerance in distributed linear regression in a multi-agent system. However, the proposed algorithms are given for a more general class of distributed optimization problems, of which distributed linear regression is a special case. The system comprises of a server and multiple agents, where each agent is holding a certain number of data points and responses that satisfy a linear relationship (could be noisy). The objective of the server is to determine this relationship, given that some of the agents in the system (up to a known number) are Byzantine faulty (aka. actively adversarial). We show that the server can achieve this objective, in a deterministic manner, by robustifying the original distributed gradient descent method using norm based filters, namely 'norm filtering' and 'norm-cap filtering', incurring an additional log-linear computation cost in each iteration. The proposed algorithms improve upon the existing methods on three levels: i) no assumptions are required on the probability distribution of data points, ii) system can be partially asynchronous, and iii) the computational overhead (in order to handle Byzantine faulty agents) is log-linear in number of agents and linear in dimension of data points. The proposed algorithms differ from each other in the assumptions made for their correctness, and the gradient filter they use.
△ Less
Submitted 4 April, 2019; v1 submitted 20 March, 2019;
originally announced March 2019.
-
Disentangling Language and Knowledge in Task-Oriented Dialogs
Authors:
Dinesh Raghu,
Nikhil Gupta,
Mausam
Abstract:
The Knowledge Base (KB) used for real-world applications, such as booking a movie or restaurant reservation, keeps changing over time. End-to-end neural networks trained for these task-oriented dialogs are expected to be immune to any changes in the KB. However, existing approaches breakdown when asked to handle such changes. We propose an encoder-decoder architecture (BoSsNet) with a novel Bag-of…
▽ More
The Knowledge Base (KB) used for real-world applications, such as booking a movie or restaurant reservation, keeps changing over time. End-to-end neural networks trained for these task-oriented dialogs are expected to be immune to any changes in the KB. However, existing approaches breakdown when asked to handle such changes. We propose an encoder-decoder architecture (BoSsNet) with a novel Bag-of-Sequences (BoSs) memory, which facilitates the disentangled learning of the response's language model and its knowledge incorporation. Consequently, the KB can be modified with new knowledge without a drop in interpretability. We find that BoSsNet outperforms state-of-the-art models, with considerable improvements (> 10\%) on bAbI OOV test sets and other human-human datasets. We also systematically modify existing datasets to measure disentanglement and show BoSsNet to be robust to KB modifications.
△ Less
Submitted 5 April, 2019; v1 submitted 3 May, 2018;
originally announced May 2018.
-
A Grid Based Adversarial Clustering Algorithm
Authors:
Wutao Wei,
Nikhil Gupta,
Bowei Xi
Abstract:
Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most…
▽ More
Nowadays more and more data are gathered for detecting and preventing cyber attacks. In cyber security applications, data analytics techniques have to deal with active adversaries that try to deceive the data analytics models and avoid being detected. The existence of such adversarial behavior motivates the development of robust and resilient adversarial learning techniques for various tasks. Most of the previous work focused on adversarial classification techniques, which assumed the existence of a reasonably large amount of carefully labeled data instances. However, in practice, labeling the data instances often requires costly and time-consuming human expertise and becomes a significant bottleneck. Meanwhile, a large number of unlabeled instances can also be used to understand the adversaries' behavior. To address the above mentioned challenges, in this paper, we develop a novel grid based adversarial clustering algorithm. Our adversarial clustering algorithm is able to identify the core normal regions, and to draw defensive walls around the centers of the normal objects utilizing game theoretic ideas. Our algorithm also identifies sub-clusters of attack objects, the overlapping areas within clusters, and outliers which may be potential anomalies.
△ Less
Submitted 21 November, 2024; v1 submitted 13 April, 2018;
originally announced April 2018.
-
Intrinsic Geometric Analysis of the Network Reliability and Voltage Stability
Authors:
N. Gupta,
B. N. Tiwari,
S. Bellucci
Abstract:
This paper presents the intrinsic geometric model for the solution of power system planning and its operation. This problem is large-scale and nonlinear, in general. Thus, we have developed the intrinsic geometric model for the network reliability and voltage stability, and examined it for the IEEE 5 bus system. The robustness of the proposed model is illustrated by introducing variations of the n…
▽ More
This paper presents the intrinsic geometric model for the solution of power system planning and its operation. This problem is large-scale and nonlinear, in general. Thus, we have developed the intrinsic geometric model for the network reliability and voltage stability, and examined it for the IEEE 5 bus system. The robustness of the proposed model is illustrated by introducing variations of the network parameters. Exact analytical results show the accuracy as well as the efficiency of the proposed solution technique.
△ Less
Submitted 12 November, 2010;
originally announced November 2010.
-
Geometric Design and Stability of Power Networks
Authors:
Neeraj Gupta,
Bhupendra Nath Tiwari,
Stefano Bellucci
Abstract:
From the perspective of the network theory, the present work illustrates how the parametric intrinsic geometric description exhibits an exact set of pair correction functions and global correlation volume with and without the inclusion of the imaginary power flow. The Gaussian fluctuations about the equilibrium basis accomplish a well-defined, non-degenerate, curved regular intrinsic Riemannian su…
▽ More
From the perspective of the network theory, the present work illustrates how the parametric intrinsic geometric description exhibits an exact set of pair correction functions and global correlation volume with and without the inclusion of the imaginary power flow. The Gaussian fluctuations about the equilibrium basis accomplish a well-defined, non-degenerate, curved regular intrinsic Riemannian surfaces for the purely real and the purely imaginary power flows and their linear combinations. An explicit computation demonstrates that the underlying real and imaginary power correlations involve ordinary summations of the power factors, with and without their joint effects. Novel aspect of the intrinsic geometry constitutes a stable design for the power systems.
△ Less
Submitted 12 November, 2010;
originally announced November 2010.