-
Poisson-Process Topic Model for Integrating Knowledge from Pre-trained Language Models
Authors:
Morgane Austern,
Yuanchuan Guo,
Zheng Tracy Ke,
Tianle Liu
Abstract:
Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling.
We use a pre-trained LLM to convert each document into a sequence of wo…
▽ More
Topic modeling is traditionally applied to word counts without accounting for the context in which words appear. Recent advancements in large language models (LLMs) offer contextualized word embeddings, which capture deeper meaning and relationships between words. We aim to leverage such embeddings to improve topic modeling.
We use a pre-trained LLM to convert each document into a sequence of word embeddings. This sequence is then modeled as a Poisson point process, with its intensity measure expressed as a convex combination of $K$ base measures, each corresponding to a topic. To estimate these topics, we propose a flexible algorithm that integrates traditional topic modeling methods, enhanced by net-rounding applied before and kernel smoothing applied after. One advantage of this framework is that it treats the LLM as a black box, requiring no fine-tuning of its parameters. Another advantage is its ability to seamlessly integrate any traditional topic modeling approach as a plug-in module, without the need for modifications
Assuming each topic is a $β$-Hölder smooth intensity measure on the embedded space, we establish the rate of convergence of our method. We also provide a minimax lower bound and show that the rate of our method matches with the lower bound when $β\leq 1$. Additionally, we apply our method to several datasets, providing evidence that it offers an advantage over traditional topic modeling approaches.
△ Less
Submitted 22 March, 2025;
originally announced March 2025.
-
Perfect Recovery for Random Geometric Graph Matching with Shallow Graph Neural Networks
Authors:
Suqi Liu,
Morgane Austern
Abstract:
We study the graph matching problem in the presence of vertex feature information using shallow graph neural networks. Specifically, given two graphs that are independent perturbations of a single random geometric graph with sparse binary features, the task is to recover an unknown one-to-one mapping between the vertices of the two graphs. We show under certain conditions on the sparsity and noise…
▽ More
We study the graph matching problem in the presence of vertex feature information using shallow graph neural networks. Specifically, given two graphs that are independent perturbations of a single random geometric graph with sparse binary features, the task is to recover an unknown one-to-one mapping between the vertices of the two graphs. We show under certain conditions on the sparsity and noise level of the feature vectors, a carefully designed two-layer graph neural network can, with high probability, recover the correct mapping between the vertices with the help of the graph structure. Additionally, we prove that our condition on the noise parameter is tight up to logarithmic factors. Finally, we compare the performance of the graph neural network to directly solving an assignment problem using the noisy vertex features and demonstrate that when the noise level is at least constant, this direct matching fails to achieve perfect recovery, whereas the graph neural network can tolerate noise levels growing as fast as a power of the size of the graph. Our theoretical findings are further supported by numerical studies as well as real-world data experiments.
△ Less
Submitted 11 March, 2025; v1 submitted 11 February, 2024;
originally announced February 2024.
-
Statistical Guarantees for Link Prediction using Graph Neural Networks
Authors:
Alan Chung,
Amin Saberi,
Morgane Austern
Abstract:
This paper derives statistical guarantees for the performance of Graph Neural Networks (GNNs) in link prediction tasks on graphs generated by a graphon. We propose a linear GNN architecture (LG-GNN) that produces consistent estimators for the underlying edge probabilities. We establish a bound on the mean squared error and give guarantees on the ability of LG-GNN to detect high-probability edges.…
▽ More
This paper derives statistical guarantees for the performance of Graph Neural Networks (GNNs) in link prediction tasks on graphs generated by a graphon. We propose a linear GNN architecture (LG-GNN) that produces consistent estimators for the underlying edge probabilities. We establish a bound on the mean squared error and give guarantees on the ability of LG-GNN to detect high-probability edges. Our guarantees hold for both sparse and dense graphs. Finally, we demonstrate some of the shortcomings of the classical GCN architecture, as well as verify our results on real and synthetic datasets.
△ Less
Submitted 7 February, 2024; v1 submitted 4 February, 2024;
originally announced February 2024.
-
Inference on Optimal Dynamic Policies via Softmax Approximation
Authors:
Qizhao Chen,
Morgane Austern,
Vasilis Syrgkanis
Abstract:
Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inh…
▽ More
Estimating optimal dynamic policies from offline data is a fundamental problem in dynamic decision making. In the context of causal inference, the problem is known as estimating the optimal dynamic treatment regime. Even though there exists a plethora of methods for estimation, constructing confidence intervals for the value of the optimal regime and structural parameters associated with it is inherently harder, as it involves non-linear and non-differentiable functionals of unknown quantities that need to be estimated. Prior work resorted to sub-sample approaches that can deteriorate the quality of the estimate. We show that a simple soft-max approximation to the optimal treatment regime, for an appropriately fast growing temperature parameter, can achieve valid inference on the truly optimal regime. We illustrate our result for a two-period optimal dynamic regime, though our approach should directly extend to the finite horizon case. Our work combines techniques from semi-parametric inference and $g$-estimation, together with an appropriate triangular array central limit theorem, as well as a novel analysis of the asymptotic influence and asymptotic bias of softmax approximations.
△ Less
Submitted 13 December, 2023; v1 submitted 8 March, 2023;
originally announced March 2023.
-
Debiased Machine Learning without Sample-Splitting for Stable Estimators
Authors:
Qizhao Chen,
Vasilis Syrgkanis,
Morgane Austern
Abstract:
Estimation and inference on causal parameters is typically reduced to a generalized method of moments problem, which involves auxiliary functions that correspond to solutions to a regression or classification problem. Recent line of work on debiased machine learning shows how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and ro…
▽ More
Estimation and inference on causal parameters is typically reduced to a generalized method of moments problem, which involves auxiliary functions that correspond to solutions to a regression or classification problem. Recent line of work on debiased machine learning shows how one can use generic machine learning estimators for these auxiliary problems, while maintaining asymptotic normality and root-$n$ consistency of the target parameter of interest, while only requiring mean-squared-error guarantees from the auxiliary estimation algorithms. The literature typically requires that these auxiliary problems are fitted on a separate sample or in a cross-fitting manner. We show that when these auxiliary estimation algorithms satisfy natural leave-one-out stability properties, then sample splitting is not required. This allows for sample re-use, which can be beneficial in moderately sized sample regimes. For instance, we show that the stability properties that we propose are satisfied for ensemble bagged estimators, built via sub-sampling without replacement, a popular technique in machine learning practice.
△ Less
Submitted 14 November, 2022; v1 submitted 3 June, 2022;
originally announced June 2022.
-
Gaussian and Non-Gaussian Universality of Data Augmentation
Authors:
Kevin Han Huang,
Peter Orbanz,
Morgane Austern
Abstract:
We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such…
▽ More
We provide universality results that quantify how data augmentation affects the variance and limiting distribution of estimates through simple surrogates, and analyze several specific models in detail. The results confirm some observations made in machine learning practice, but also lead to unexpected findings: Data augmentation may increase rather than decrease the uncertainty of estimates, such as the empirical prediction risk. It can act as a regularizer, but fails to do so in certain high-dimensional problems, and it may shift the double-descent peak of an empirical risk. Overall, the analysis shows that several properties data augmentation has been attributed with are not either true or false, but rather depend on a combination of factors -- notably the data distribution, the properties of the estimator, and the interplay of sample size, number of augmentations, and dimension. As our main theoretical tool, we develop an adaptation of Lindeberg's technique for block dependence. The resulting universality regime may be Gaussian or non-Gaussian.
△ Less
Submitted 15 March, 2025; v1 submitted 18 February, 2022;
originally announced February 2022.
-
Asymptotics of Network Embeddings Learned via Subsampling
Authors:
Andrew Davison,
Morgane Austern
Abstract:
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme ca…
▽ More
Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
△ Less
Submitted 17 May, 2023; v1 submitted 5 July, 2021;
originally announced July 2021.
-
Asymptotics of the Empirical Bootstrap Method Beyond Asymptotic Normality
Authors:
Morgane Austern,
Vasilis Syrgkanis
Abstract:
One of the most commonly used methods for forming confidence intervals for statistical inference is the empirical bootstrap, which is especially expedient when the limiting distribution of the estimator is unknown. However, despite its ubiquitous role, its theoretical properties are still not well understood for non-asymptotically normal estimators. In this paper, under stability conditions, we es…
▽ More
One of the most commonly used methods for forming confidence intervals for statistical inference is the empirical bootstrap, which is especially expedient when the limiting distribution of the estimator is unknown. However, despite its ubiquitous role, its theoretical properties are still not well understood for non-asymptotically normal estimators. In this paper, under stability conditions, we establish the limiting distribution of the empirical bootstrap estimator, derive tight conditions for it to be asymptotically consistent, and quantify the speed of convergence. Moreover, we propose three alternative ways to use the bootstrap method to build confidence intervals with coverage guarantees. Finally, we illustrate the generality and tightness of our results by a series of examples, including uniform confidence bands, two-sample kernel tests, minmax stochastic programs and the empirical risk of stacked estimators.
△ Less
Submitted 23 November, 2020;
originally announced November 2020.
-
Empirical Risk Minimization and Stochastic Gradient Descent for Relational Data
Authors:
Victor Veitch,
Morgane Austern,
Wenda Zhou,
David M. Blei,
Peter Orbanz
Abstract:
Empirical risk minimization is the main tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent ideas from graph sampling theory to (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this empirical risk that are automatically unbiased. This is achieved by considering the method by which data is s…
▽ More
Empirical risk minimization is the main tool for prediction problems, but its extension to relational data remains unsolved. We solve this problem using recent ideas from graph sampling theory to (i) define an empirical risk for relational data and (ii) obtain stochastic gradients for this empirical risk that are automatically unbiased. This is achieved by considering the method by which data is sampled from a graph as an explicit component of model design. By integrating fast implementations of graph sampling schemes with standard automatic differentiation tools, we provide an efficient turnkey solver for the risk minimization problem. We establish basic theoretical properties of the procedure. Finally, we demonstrate relational ERM with application to two non-standard problems: one-stage training for semi-supervised node classification, and learning embedding vectors for vertex attributes. Experiments confirm that the turnkey inference procedure is effective in practice, and that the sampling scheme used for model specification has a strong effect on model performance. Code is available at https://github.com/wooden-spoon/relational-ERM.
△ Less
Submitted 22 February, 2019; v1 submitted 27 June, 2018;
originally announced June 2018.
-
Non-Vacuous Generalization Bounds at the ImageNet Scale: A PAC-Bayesian Compression Approach
Authors:
Wenda Zhou,
Victor Veitch,
Morgane Austern,
Ryan P. Adams,
Peter Orbanz
Abstract:
Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization…
▽ More
Modern neural networks are highly overparameterized, with capacity to substantially overfit to training data. Nevertheless, these networks often generalize well in practice. It has also been observed that trained networks can often be "compressed" to much smaller representations. The purpose of this paper is to connect these two empirical observations. Our main technical result is a generalization bound for compressed networks based on the compressed size. Combined with off-the-shelf compression algorithms, the bound leads to state of the art generalization guarantees; in particular, we provide the first non-vacuous generalization guarantees for realistic architectures applied to the ImageNet classification problem. As additional evidence connecting compression and generalization, we show that compressibility of models that tend to overfit is limited: We establish an absolute limit on expected compressibility as a function of expected generalization error, where the expectations are over the random choice of training examples. The bounds are complemented by empirical results that show an increase in overfitting implies an increase in the number of bits required to describe a trained network.
△ Less
Submitted 24 February, 2019; v1 submitted 16 April, 2018;
originally announced April 2018.
-
On the Gaussianity of Kolmogorov Complexity of Mixing Sequences
Authors:
Morgane Austern,
Arian Maleki
Abstract:
Let $ K(X_1, \ldots, X_n)$ and $H(X_n | X_{n-1}, \ldots, X_1)$ denote the Kolmogorov complexity and Shannon's entropy rate of a stationary and ergodic process $\{X_i\}_{i=-\infty}^\infty$. It has been proved that \[ \frac{K(X_1, \ldots, X_n)}{n} - H(X_n | X_{n-1}, \ldots, X_1) \rightarrow 0, \] almost surely. This paper studies the convergence rate of this asymptotic result. In particular, we show…
▽ More
Let $ K(X_1, \ldots, X_n)$ and $H(X_n | X_{n-1}, \ldots, X_1)$ denote the Kolmogorov complexity and Shannon's entropy rate of a stationary and ergodic process $\{X_i\}_{i=-\infty}^\infty$. It has been proved that \[ \frac{K(X_1, \ldots, X_n)}{n} - H(X_n | X_{n-1}, \ldots, X_1) \rightarrow 0, \] almost surely. This paper studies the convergence rate of this asymptotic result. In particular, we show that if the process satisfies certain mixing conditions, then there exists $σ<\infty$ such that $$\sqrt{n}\left(\frac{K(X_{1:n})}{n}- H(X_0|X_1,\dots,X_{-\infty})\right) \rightarrow_d N(0,σ^2).$$ Furthermore, we show that under slightly stronger mixing conditions one may obtain non-asymptotic concentration bounds for the Kolmogorov complexity.
△ Less
Submitted 4 February, 2017;
originally announced February 2017.