-
Sequential Kernelized Independence Testing
Authors:
Aleksandr Podkopaev,
Patrick Blöbaum,
Shiva Prasad Kasiviswanathan,
Aaditya Ramdas
Abstract:
Independence testing is a classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) stop earlier on easy tasks (and later on harder tasks), he…
▽ More
Independence testing is a classical statistical problem that has been extensively studied in the batch setting when one fixes the sample size before collecting data. However, practitioners often prefer procedures that adapt to the complexity of a problem at hand instead of setting sample size in advance. Ideally, such procedures should (a) stop earlier on easy tasks (and later on harder tasks), hence making better use of available resources, and (b) continuously monitor the data and efficiently incorporate statistical evidence after collecting new data, while controlling the false alarm rate. Classical batch tests are not tailored for streaming data: valid inference after data peeking requires correcting for multiple testing which results in low power. Following the principle of testing by betting, we design sequential kernelized independence tests that overcome such shortcomings. We exemplify our broad framework using bets inspired by kernelized dependence measures, e.g., the Hilbert-Schmidt independence criterion. Our test is also valid under non-i.i.d., time-varying settings. We demonstrate the power of our approaches on both simulated and real data.
△ Less
Submitted 20 May, 2025; v1 submitted 14 December, 2022;
originally announced December 2022.
-
Restricted Eigenvalue from Stable Rank with Applications to Sparse Linear Regression
Authors:
Shiva Prasad Kasiviswanathan,
Mark Rudelson
Abstract:
High-dimensional settings, where the data dimension ($d$) far exceeds the number of observations ($n$), are common in many statistical and machine learning applications. Methods based on $\ell_1$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on…
▽ More
High-dimensional settings, where the data dimension ($d$) far exceeds the number of observations ($n$), are common in many statistical and machine learning applications. Methods based on $\ell_1$-relaxation, such as Lasso, are very popular for sparse recovery in these settings. Restricted Eigenvalue (RE) condition is among the weakest, and hence the most general, condition in literature imposed on the Gram matrix that guarantees nice statistical properties for the Lasso estimator. It is natural to ask: what families of matrices satisfy the RE condition? Following a line of work in this area, we construct a new broad ensemble of dependent random design matrices that have an explicit RE bound. Our construction starts with a fixed (deterministic) matrix $X \in \mathbb{R}^{n \times d}$ satisfying a simple stable rank condition, and we show that a matrix drawn from the distribution $X Φ^\top Φ$, where $Φ\in \mathbb{R}^{m \times d}$ is a subgaussian random matrix, with high probability, satisfies the RE condition. This construction allows incorporating a fixed matrix that has an easily {\em verifiable} condition into the design process, and allows for generation of {\em compressed} design matrices that have a lower storage requirement than a standard design matrix. We give two applications of this construction to sparse linear regression problems, including one to a compressed sparse regression setting where the regression algorithm only has access to a compressed representation of a fixed design matrix $X$.
△ Less
Submitted 17 February, 2018; v1 submitted 25 July, 2017;
originally announced July 2017.
-
The Power of Linear Reconstruction Attacks
Authors:
Shiva Prasad Kasiviswanathan,
Mark Rudelson,
Adam Smith
Abstract:
We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Linear attacks have been studied before (Dinur and Nissim PODS'03, Dwork, McSherry and Talwar STOC'07, Kasiviswanathan, Rudelson, Smith and Ullman STOC'10, De TCC'12, Muthukrishnan and Nikolov STOC'12) but have so far bee…
▽ More
We consider the power of linear reconstruction attacks in statistical data privacy, showing that they can be applied to a much wider range of settings than previously understood. Linear attacks have been studied before (Dinur and Nissim PODS'03, Dwork, McSherry and Talwar STOC'07, Kasiviswanathan, Rudelson, Smith and Ullman STOC'10, De TCC'12, Muthukrishnan and Nikolov STOC'12) but have so far been applied only in settings with releases that are obviously linear.
Consider a database curator who manages a database of sensitive information but wants to release statistics about how a sensitive attribute (say, disease) in the database relates to some nonsensitive attributes (e.g., postal code, age, gender, etc). We show one can mount linear reconstruction attacks based on any release that gives: a) the fraction of records that satisfy a given non-degenerate boolean function. Such releases include contingency tables (previously studied by Kasiviswanathan et al., STOC'10) as well as more complex outputs like the error rate of classifiers such as decision trees; b) any one of a large class of M-estimators (that is, the output of empirical risk minimization algorithms), including the standard estimators for linear and logistic regression.
We make two contributions: first, we show how these types of releases can be transformed into a linear format, making them amenable to existing polynomial-time reconstruction algorithms. This is already perhaps surprising, since many of the above releases (like M-estimators) are obtained by solving highly nonlinear formulations. Second, we show how to analyze the resulting attacks under various distributional assumptions on the data. Specifically, we consider a setting in which the same statistic (either a) or b) above) is released about how the sensitive attribute relates to all subsets of size k (out of a total of d) nonsensitive boolean attributes.
△ Less
Submitted 8 October, 2012;
originally announced October 2012.
-
The rigidity transition in random graphs
Authors:
Shiva Prasad Kasiviswanathan,
Cristopher Moore,
Louis Theran
Abstract:
As we add rigid bars between points in the plane, at what point is there a giant (linear-sized) rigid component, which can be rotated and translated, but which has no internal flexibility? If the points are generic, this depends only on the combinatorics of the graph formed by the bars. We show that if this graph is an Erdos-Renyi random graph G(n,c/n), then there exists a sharp threshold for a gi…
▽ More
As we add rigid bars between points in the plane, at what point is there a giant (linear-sized) rigid component, which can be rotated and translated, but which has no internal flexibility? If the points are generic, this depends only on the combinatorics of the graph formed by the bars. We show that if this graph is an Erdos-Renyi random graph G(n,c/n), then there exists a sharp threshold for a giant rigid component to emerge. For c < c_2, w.h.p. all rigid components span one, two, or three vertices, and when c > c_2, w.h.p. there is a giant rigid component. The constant c_2 \approx 3.588 is the threshold for 2-orientability, discovered independently by Fernholz and Ramachandran and Cain, Sanders, and Wormald in SODA'07. We also give quantitative bounds on the size of the giant rigid component when it emerges, proving that it spans a (1-o(1))-fraction of the vertices in the (3+2)-core. Informally, the (3+2)-core is maximal induced subgraph obtained by starting from the 3-core and then inductively adding vertices with 2 neighbors in the graph obtained so far.
△ Less
Submitted 26 November, 2010; v1 submitted 18 October, 2010;
originally announced October 2010.