Search | arXiv e-print repository

arXiv:2505.19470 [pdf, ps, other]

Information-theoretic Generalization Analysis for VQ-VAEs: A Role of Latent Variables

Authors: Futoshi Futami, Masahiro Fujisawa

Abstract: Latent variables (LVs) play a crucial role in encoder-decoder models by enabling effective data compression, prediction, and generation. Although their theoretical properties, such as generalization, have been extensively studied in supervised learning, similar analyses for unsupervised models such as variational autoencoders (VAEs) remain insufficiently underexplored. In this work, we extend info… ▽ More Latent variables (LVs) play a crucial role in encoder-decoder models by enabling effective data compression, prediction, and generation. Although their theoretical properties, such as generalization, have been extensively studied in supervised learning, similar analyses for unsupervised models such as variational autoencoders (VAEs) remain insufficiently underexplored. In this work, we extend information-theoretic generalization analysis to vector-quantized (VQ) VAEs with discrete latent spaces, introducing a novel data-dependent prior to rigorously analyze the relationship among LVs, generalization, and data generation. We derive a novel generalization error bound of the reconstruction loss of VQ-VAEs, which depends solely on the complexity of LVs and the encoder, independent of the decoder. Additionally, we provide the upper bound of the 2-Wasserstein distance between the distributions of the true data and the generated data, explaining how the regularization of the LVs contributes to the data generation performance. △ Less

Submitted 25 May, 2025; originally announced May 2025.

arXiv:2505.17859 [pdf, ps, other]

Scalable Valuation of Human Feedback through Provably Robust Model Alignment

Authors: Masahiro Fujisawa, Masaki Adachi, Michael A. Osborne

Abstract: Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment metho… ▽ More Despite the importance of aligning language models with human preferences, crowd-sourced human feedback is often noisy -- for example, preferring less desirable responses -- posing a fundamental challenge to alignment. A truly robust alignment objective should yield identical model parameters even under severe label noise, a property known as redescending. We prove that no existing alignment methods satisfy this property. To address this, we propose Hölder-DPO, the first principled alignment loss with a provable redescending property, enabling estimation of the clean data distribution from noisy feedback. The aligned model estimates the likelihood of clean data, providing a theoretically grounded metric for dataset valuation that identifies the location and fraction of mislabels. This metric is gradient-free, enabling scalable and automated human feedback valuation without costly manual verification or clean validation dataset. Hölder-DPO achieves state-of-the-art robust alignment performance while accurately detecting mislabels in controlled datasets. Finally, we apply Hölder-DPO to widely used alignment datasets, revealing substantial noise levels and demonstrating that removing these mislabels significantly improves alignment performance across methods. △ Less

Submitted 23 May, 2025; originally announced May 2025.

Comments: 38 pages, 7 figures

arXiv:2503.06079 [pdf, ps, other]

Fixing the Pitfalls of Probabilistic Time-Series Forecasting Evaluation by Kernel Quadrature

Authors: Masaki Adachi, Masahiro Fujisawa, Michael A Osborne

Abstract: Despite the significance of probabilistic time-series forecasting models, their evaluation metrics often involve intractable integrations. The most widely used metric, the continuous ranked probability score (CRPS), is a strictly proper scoring function; however, its computation requires approximation. We found that popular CRPS estimators--specifically, the quantile-based estimator implemented in… ▽ More Despite the significance of probabilistic time-series forecasting models, their evaluation metrics often involve intractable integrations. The most widely used metric, the continuous ranked probability score (CRPS), is a strictly proper scoring function; however, its computation requires approximation. We found that popular CRPS estimators--specifically, the quantile-based estimator implemented in the widely used GluonTS library and the probability-weighted moment approximation--both exhibit inherent estimation biases. These biases lead to crude approximations, resulting in improper rankings of forecasting model performance when CRPS values are close. To address this issue, we introduced a kernel quadrature approach that leverages an unbiased CRPS estimator and employs cubature construction for scalable computation. Empirically, our approach consistently outperforms the two widely used CRPS estimators. △ Less

Submitted 23 July, 2025; v1 submitted 8 March, 2025; originally announced March 2025.

Comments: 11 pages, 6 figures

MSC Class: 62C10; 62F15

arXiv:2406.06227 [pdf, ps, other]

PAC-Bayes Analysis for Recalibration in Classification

Authors: Masahiro Fujisawa, Futoshi Futami

Abstract: Nonparametric estimation using uniform-width binning is a standard approach for evaluating the calibration performance of machine learning models. However, existing theoretical analyses of the bias induced by binning are limited to binary classification, creating a significant gap with practical applications such as multiclass classification. Additionally, many parametric recalibration algorithms… ▽ More Nonparametric estimation using uniform-width binning is a standard approach for evaluating the calibration performance of machine learning models. However, existing theoretical analyses of the bias induced by binning are limited to binary classification, creating a significant gap with practical applications such as multiclass classification. Additionally, many parametric recalibration algorithms lack theoretical guarantees for their generalization performance. To address these issues, we conduct a generalization analysis of calibration error using the probably approximately correct Bayes framework. This approach enables us to derive the first optimizable upper bound for generalization error in the calibration context. On the basis of our theory, we propose a generalization-aware recalibration algorithm. Numerical experiments show that our algorithm enhances the performance of Gaussian process-based recalibration across various benchmark datasets and models. △ Less

Submitted 10 July, 2025; v1 submitted 10 June, 2024; originally announced June 2024.

Comments: Accepted by the 42nd International Conference on Machine Learning (ICML2025), 38 pages, 8 figures

arXiv:2405.15709 [pdf, other]

Information-theoretic Generalization Analysis for Expected Calibration Error

Authors: Futoshi Futami, Masahiro Fujisawa

Abstract: While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present the first comprehensive analysis of the estimation bias in the two common binning strategies, uniform mass and uniform width binning. Our analysis establishes u… ▽ More While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present the first comprehensive analysis of the estimation bias in the two common binning strategies, uniform mass and uniform width binning. Our analysis establishes upper bounds on the bias, achieving an improved convergence rate. Moreover, our bounds reveal, for the first time, the optimal number of bins to minimize the estimation bias. We further extend our bias analysis to generalization error analysis based on the information-theoretic approach, deriving upper bounds that enable the numerical evaluation of how small the ECE is for unknown data. Experiments using deep learning models show that our bounds are nonvacuous thanks to this information-theoretic generalization analysis approach. △ Less

Submitted 26 May, 2025; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: Accepted by the 38th Conference on Neural Information Processing Systems (NeurIPS2024), 52 pages, 6 figures

arXiv:2311.01046 [pdf, ps, other]

Time-Independent Information-Theoretic Generalization Bounds for SGLD

Authors: Futoshi Futami, Masahiro Fujisawa

Abstract: We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike pr… ▽ More We provide novel information-theoretic generalization bounds for stochastic gradient Langevin dynamics (SGLD) under the assumptions of smoothness and dissipativity, which are widely used in sampling and non-convex optimization studies. Our bounds are time-independent and decay to zero as the sample size increases, regardless of the number of iterations and whether the step size is fixed. Unlike previous studies, we derive the generalization error bounds by focusing on the time evolution of the Kullback--Leibler divergence, which is related to the stability of datasets and is the upper bound of the mutual information between output parameters and an input dataset. Additionally, we establish the first information-theoretic generalization bound when the training and test loss are the same by showing that a loss function of SGLD is sub-exponential. This bound is also time-independent and removes the problematic step size dependence in existing work, leading to an improved excess risk bound by combining our analysis with the existing non-convex optimization error bounds. △ Less

Submitted 2 November, 2023; originally announced November 2023.

Comments: Accepted by the Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS2023), 29 pages

arXiv:2006.07571 [pdf, other]

$γ$-ABC: Outlier-Robust Approximate Bayesian Computation Based on a Robust Divergence Estimator

Authors: Masahiro Fujisawa, Takeshi Teshima, Issei Sato, Masashi Sugiyama

Abstract: Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $γ$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical ro… ▽ More Approximate Bayesian computation (ABC) is a likelihood-free inference method that has been employed in various applications. However, ABC can be sensitive to outliers if a data discrepancy measure is chosen inappropriately. In this paper, we propose to use a nearest-neighbor-based $γ$-divergence estimator as a data discrepancy measure. We show that our estimator possesses a suitable theoretical robustness property called the redescending property. In addition, our estimator enjoys various desirable properties such as high flexibility, asymptotic unbiasedness, almost sure convergence, and linear-time computational complexity. Through experiments, we demonstrate that our method achieves significantly higher robustness than existing discrepancy measures. △ Less

Submitted 5 March, 2021; v1 submitted 13 June, 2020; originally announced June 2020.

Comments: The 24th International Conference on Artificial Intelligence and Statistics (AISTATS 2021); 48 pages, 22 figures

arXiv:1902.00468 [pdf, other]

Multilevel Monte Carlo Variational Inference

Authors: Masahiro Fujisawa, Issei Sato

Abstract: We propose a variance reduction framework for variational inference using the Multilevel Monte Carlo (MLMC) method. Our framework is built on reparameterized gradient estimators and "recycles" parameters obtained from past update history in optimization. In addition, our framework provides a new optimization algorithm based on stochastic gradient descent (SGD) that adaptively estimates the sample… ▽ More We propose a variance reduction framework for variational inference using the Multilevel Monte Carlo (MLMC) method. Our framework is built on reparameterized gradient estimators and "recycles" parameters obtained from past update history in optimization. In addition, our framework provides a new optimization algorithm based on stochastic gradient descent (SGD) that adaptively estimates the sample size used for gradient estimation according to the ratio of the gradient variance. We theoretically show that, with our method, the variance of the gradient estimator decreases as optimization proceeds and that a learning rate scheduler function helps improve the convergence. We also show that, in terms of the \textit{signal-to-noise} ratio, our method can improve the quality of gradient estimation by the learning rate scheduler function without increasing the initial sample size. Finally, we confirm that our method achieves faster convergence and reduces the variance of the gradient estimator compared with other methods through experimental comparisons with baseline methods using several benchmark datasets. △ Less

Submitted 2 December, 2021; v1 submitted 1 February, 2019; originally announced February 2019.

Comments: 44pages, 10 figures; Journal of Machine Learning Research (JMLR)

Showing 1–8 of 8 results for author: Fujisawa, M