-
Tunable correlation retention: A statistical method for generating synthetic data
Authors:
Nicklas Jävergård,
Rainey Lyons,
Adrian Muntean,
Jonas Forsman
Abstract:
We propose a method to generate statistically representative synthetic data from a given dataset. The main goal of our method is for the created data set to mimic the inter--feature correlations present in the original data, while also offering a tunable parameter to influence the privacy level. In particular, our method constructs a statistical map by using the empirical conditional distributions…
▽ More
We propose a method to generate statistically representative synthetic data from a given dataset. The main goal of our method is for the created data set to mimic the inter--feature correlations present in the original data, while also offering a tunable parameter to influence the privacy level. In particular, our method constructs a statistical map by using the empirical conditional distributions between the features of the original dataset. Part of the tunability is achieved by limiting the depths of conditional distributions that are being used. We describe in detail our algorithms used both in the construction of a statistical map and how to use this map to generate synthetic observations. This approach is tested in three different ways: with a hand calculated example; a manufactured dataset; and a real world energy-related dataset of consumption/production of households in Madeira Island. We evaluate the method by comparing the datasets using the Pearson correlation matrix with different levels of resolution and depths of correlation. These two considerations are being viewed as tunable parameters influencing the resulting datasets fidelity and privacy. The proposed methodology is general in the sense that it does not rely on the used test dataset. We expect it to be applicable in a much broader context than indicated here.
△ Less
Submitted 24 June, 2025; v1 submitted 3 March, 2024;
originally announced March 2024.
-
Lower bounds for trace reconstruction
Authors:
Nina Holden,
Russell Lyons
Abstract:
In the trace reconstruction problem, an unknown bit string ${\bf x}\in\{0,1 \}^n$ is sent through a deletion channel where each bit is deleted independently with some probability $q\in(0,1)$, yielding a contracted string $\widetilde{\bf x}$. How many i.i.d.\ samples of $\widetilde{\bf x}$ are needed to reconstruct $\bf x$ with high probability? We prove that there exist…
▽ More
In the trace reconstruction problem, an unknown bit string ${\bf x}\in\{0,1 \}^n$ is sent through a deletion channel where each bit is deleted independently with some probability $q\in(0,1)$, yielding a contracted string $\widetilde{\bf x}$. How many i.i.d.\ samples of $\widetilde{\bf x}$ are needed to reconstruct $\bf x$ with high probability? We prove that there exist ${\bf x},{\bf y} \in\{0,1 \}^n$ such that at least $c\, n^{5/4}/\sqrt{\log n}$ traces are required to distinguish between ${\bf x}$ and ${\bf y}$ for some absolute constant $c$, improving the previous lower bound of $c\,n$. Furthermore, our result improves the previously known lower bound for reconstruction of random strings from $c \log^2 n$ to $c \log^{9/4}n/\sqrt{\log \log n} $.
△ Less
Submitted 7 June, 2019; v1 submitted 4 August, 2018;
originally announced August 2018.
-
Sharp Bounds on Random Walk Eigenvalues via Spectral Embedding
Authors:
Russell Lyons,
Shayan Oveis Gharan
Abstract:
Spectral embedding of graphs uses the top k non-trivial eigenvectors of the random walk matrix to embed the graph into R^k. The primary use of this embedding has been for practical spectral clustering algorithms [SM00,NJW02]. Recently, spectral embedding was studied from a theoretical perspective to prove higher order variants of Cheeger's inequality [LOT12,LRTV12].
We use spectral embedding to…
▽ More
Spectral embedding of graphs uses the top k non-trivial eigenvectors of the random walk matrix to embed the graph into R^k. The primary use of this embedding has been for practical spectral clustering algorithms [SM00,NJW02]. Recently, spectral embedding was studied from a theoretical perspective to prove higher order variants of Cheeger's inequality [LOT12,LRTV12].
We use spectral embedding to provide a unifying framework for bounding all the eigenvalues of graphs. For example, we show that for any finite graph with n vertices and all k >= 2, the k-th largest eigenvalue is at most 1-Omega(k^3/n^3), which extends the only other such result known, which is for k=2 only and is due to [LO81]. This upper bound improves to 1-Omega(k^2/n^2) if the graph is regular. We generalize these results, and we provide sharp bounds on the spectral measure of various classes of graphs, including vertex-transitive graphs and infinite graphs, in terms of specific graph parameters like the volume growth.
As a consequence, using the entire spectrum, we provide (improved) upper bounds on the return probabilities and mixing time of random walks with considerably shorter and more direct proofs. Our work introduces spectral embedding as a new tool in analyzing reversible Markov chains. Furthermore, building on [Lyo05], we design a local algorithm to approximate the number of spanning trees of massive graphs.
△ Less
Submitted 13 January, 2017; v1 submitted 2 November, 2012;
originally announced November 2012.
-
The Spread of Evidence-Poor Medicine via Flawed Social-Network Analysis
Authors:
Russell Lyons
Abstract:
The chronic widespread misuse of statistics is usually inadvertent, not intentional. We find cautionary examples in a series of recent papers by Christakis and Fowler that advance statistical arguments for the transmission via social networks of various personal characteristics, including obesity, smoking cessation, happiness, and loneliness. Those papers also assert that such influence extends to…
▽ More
The chronic widespread misuse of statistics is usually inadvertent, not intentional. We find cautionary examples in a series of recent papers by Christakis and Fowler that advance statistical arguments for the transmission via social networks of various personal characteristics, including obesity, smoking cessation, happiness, and loneliness. Those papers also assert that such influence extends to three degrees of separation in social networks. We shall show that these conclusions do not follow from Christakis and Fowler's statistical analyses. In fact, their studies even provide some evidence against the existence of such transmission. The errors that we expose arose, in part, because the assumptions behind the statistical procedures used were insufficiently examined, not only by the authors, but also by the reviewers. Our examples are instructive because the practitioners are highly reputed, their results have received enormous popular attention, and the journals that published their studies are among the most respected in the world. An educational bonus emerges from the difficulty we report in getting our critique published. We discuss the relevance of this episode to understanding statistical literacy and the role of scientific review, as well as to reforming statistics education.
△ Less
Submitted 5 May, 2011; v1 submitted 16 July, 2010;
originally announced July 2010.