Search | arXiv e-print repository

Alignment as Distribution Learning: Your Preference Model is Explicitly a Language Model

Authors: Jihun Yun, Juno Kim, Jongho Park, Junhyuck Kim, Jongha Jon Ryu, Jaewoong Cho, Kwang-Sung Jun

Abstract: Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) al… ▽ More Alignment via reinforcement learning from human feedback (RLHF) has become the dominant paradigm for controlling the quality of outputs from large language models (LLMs). However, when viewed as `loss + regularization,' the standard RLHF objective lacks theoretical justification and incentivizes degenerate, deterministic solutions, an issue that variants such as Direct Policy Optimization (DPO) also inherit. In this paper, we rethink alignment by framing it as \emph{distribution learning} from pairwise preference feedback by explicitly modeling how information about the target language model bleeds through the preference data. This explicit modeling leads us to propose three principled learning objectives: preference maximum likelihood estimation, preference distillation, and reverse KL minimization. We theoretically show that all three approaches enjoy strong non-asymptotic $O(1/n)$ convergence to the target language model, naturally avoiding degeneracy and reward overfitting. Finally, we empirically demonstrate that our distribution learning framework, especially preference distillation, consistently outperforms or matches the performances of RLHF and DPO across various tasks and models. △ Less

Submitted 2 June, 2025; originally announced June 2025.

Comments: 26 pages, 7 tables

arXiv:2502.10826 [pdf, other]

Improved Offline Contextual Bandits with Second-Order Bounds: Betting and Freezing

Authors: J. Jon Ryu, Jeongyeol Kwon, Benjamin Koppe, Kwang-Sung Jun

Abstract: We consider the off-policy selection and learning in contextual bandits where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis… ▽ More We consider the off-policy selection and learning in contextual bandits where the learner aims to select or train a reward-maximizing policy using data collected by a fixed behavior policy. Our contribution is two-fold. First, we propose a novel off-policy selection method that leverages a new betting-based confidence bound applied to an inverse propensity weight sequence. Our theoretical analysis reveals that our method achieves a significantly better, variance-adaptive guarantee upon prior art. Second, we propose a novel and generic condition on the optimization objective for off-policy learning that strikes a difference balance in bias and variance. One special case that we call freezing tends to induce small variance, which is preferred in small-data regimes. Our analysis shows that they match the best existing guarantee. In our empirical study, our selection method outperforms existing methods, and freezing exhibits improved performance in small-sample regimes. △ Less

Submitted 15 February, 2025; originally announced February 2025.

Comments: 36 pages, 8 figures

arXiv:2502.09609 [pdf, other]

Score-of-Mixture Training: Training One-Step Generative Models Made Simple via Score Estimation of Mixture Distributions

Authors: Tejas Jayashankar, J. Jon Ryu, Gregory Wornell

Abstract: We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the $α$-skew Jensen-Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and di… ▽ More We propose Score-of-Mixture Training (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the $α$-skew Jensen-Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels. Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call Score-of-Mixture Distillation (SMD). It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64x64 show that SMT/SMD are competitive with and can even outperform existing methods. △ Less

Submitted 13 February, 2025; v1 submitted 13 February, 2025; originally announced February 2025.

Comments: 27 pages, 9 figures. Title updated to match the title of the manuscript, otherwise identical to v1

arXiv:2409.18209 [pdf, ps, other]

A Unified View on Learning Unnormalized Distributions via Noise-Contrastive Estimation

Authors: J. Jon Ryu, Abhin Shah, Gregory W. Wornell

Abstract: This paper studies a family of estimators based on noise-contrastive estimation (NCE) for learning unnormalized distributions. The main contribution of this work is to provide a unified perspective on various methods for learning unnormalized distributions, which have been independently proposed and studied in separate research communities, through the lens of NCE. This unified view offers new ins… ▽ More This paper studies a family of estimators based on noise-contrastive estimation (NCE) for learning unnormalized distributions. The main contribution of this work is to provide a unified perspective on various methods for learning unnormalized distributions, which have been independently proposed and studied in separate research communities, through the lens of NCE. This unified view offers new insights into existing estimators. Specifically, for exponential families, we establish the finite-sample convergence rates of the proposed estimators under a set of regularity assumptions, most of which are new. △ Less

Submitted 26 September, 2024; originally announced September 2024.

Comments: 35 pages

arXiv:2402.06160 [pdf, other]

Are Uncertainty Quantification Capabilities of Evidential Deep Learning a Mirage?

Authors: Maohao Shen, J. Jon Ryu, Soumya Ghosh, Yuheng Bu, Prasanna Sattigeri, Subhro Das, Gregory W. Wornell

Abstract: This paper questions the effectiveness of a modern predictive uncertainty quantification approach, called \emph{evidential deep learning} (EDL), in which a single neural network model is trained to learn a meta distribution over the predictive distribution by minimizing a specific objective function. Despite their perceived strong empirical performance on downstream tasks, a line of recent studies… ▽ More This paper questions the effectiveness of a modern predictive uncertainty quantification approach, called \emph{evidential deep learning} (EDL), in which a single neural network model is trained to learn a meta distribution over the predictive distribution by minimizing a specific objective function. Despite their perceived strong empirical performance on downstream tasks, a line of recent studies by Bengs et al. identify limitations of the existing methods to conclude their learned epistemic uncertainties are unreliable, e.g., in that they are non-vanishing even with infinite data. Building on and sharpening such analysis, we 1) provide a sharper understanding of the asymptotic behavior of a wide class of EDL methods by unifying various objective functions; 2) reveal that the EDL methods can be better interpreted as an out-of-distribution detection algorithm based on energy-based-models; and 3) conduct extensive ablation studies to better assess their empirical effectiveness with real-world datasets. Through all these analyses, we conclude that even when EDL methods are empirically effective on downstream tasks, this occurs despite their poor uncertainty quantification capabilities. Our investigation suggests that incorporating model uncertainty can help EDL methods faithfully quantify uncertainties and further improve performance on representative downstream tasks, albeit at the cost of additional computational complexity. △ Less

Submitted 31 October, 2024; v1 submitted 8 February, 2024; originally announced February 2024.

Comments: 35 pages, 14 figures. NeurIPS 2024

arXiv:2402.03683 [pdf, other]

Gambling-Based Confidence Sequences for Bounded Random Vectors

Authors: J. Jon Ryu, Gregory W. Wornell

Abstract: A confidence sequence (CS) is a sequence of confidence sets that contains a target parameter of an underlying stochastic process at any time step with high probability. This paper proposes a new approach to constructing CSs for means of bounded multivariate stochastic processes using a general gambling framework, extending the recently established coin toss framework for bounded random processes.… ▽ More A confidence sequence (CS) is a sequence of confidence sets that contains a target parameter of an underlying stochastic process at any time step with high probability. This paper proposes a new approach to constructing CSs for means of bounded multivariate stochastic processes using a general gambling framework, extending the recently established coin toss framework for bounded random processes. The proposed gambling framework provides a general recipe for constructing CSs for categorical and probability-vector-valued observations, as well as for general bounded multidimensional observations through a simple reduction. This paper specifically explores the use of the mixture portfolio, akin to Cover's universal portfolio, in the proposed framework and investigates the properties of the resulting CSs. Simulations demonstrate the tightness of these confidence sequences compared to existing methods. When applied to the sampling without-replacement setting for finite categorical data, it is shown that the resulting CS based on a universal gambling strategy is provably tighter than that of the posterior-prior ratio martingale proposed by Waudby-Smith and Ramdas. △ Less

Submitted 21 August, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 14 pages, 3 figures. ICML 2024

arXiv:2402.03655 [pdf, other]

Operator SVD with Neural Networks via Nested Low-Rank Approximation

Authors: J. Jon Ryu, Xiangxiang Xu, H. S. Melihcan Erol, Yuheng Bu, Lizhong Zheng, Gregory W. Wornell

Abstract: Computing eigenvalue decomposition (EVD) of a given linear operator, or finding its leading eigenvalues and eigenfunctions, is a fundamental task in many machine learning and scientific computing problems. For high-dimensional eigenvalue problems, training neural networks to parameterize the eigenfunctions is considered as a promising alternative to the classical numerical linear algebra technique… ▽ More Computing eigenvalue decomposition (EVD) of a given linear operator, or finding its leading eigenvalues and eigenfunctions, is a fundamental task in many machine learning and scientific computing problems. For high-dimensional eigenvalue problems, training neural networks to parameterize the eigenfunctions is considered as a promising alternative to the classical numerical linear algebra techniques. This paper proposes a new optimization framework based on the low-rank approximation characterization of a truncated singular value decomposition, accompanied by new techniques called \emph{nesting} for learning the top-$L$ singular values and singular functions in the correct order. The proposed method promotes the desired orthogonality in the learned functions implicitly and efficiently via an unconstrained optimization formulation, which is easy to solve with off-the-shelf gradient-based optimization algorithms. We demonstrate the effectiveness of the proposed optimization framework for use cases in computational physics and machine learning. △ Less

Submitted 21 August, 2024; v1 submitted 5 February, 2024; originally announced February 2024.

Comments: 36 pages, 7 figures. ICML 2024. Almost identical to the conference version, except a few updates for fixing typos and mistakes

arXiv:2302.08077 [pdf, other]

Group Fairness with Uncertainty in Sensitive Attributes

Authors: Abhin Shah, Maohao Shen, Jongha Jon Ryu, Subhro Das, Prasanna Sattigeri, Yuheng Bu, Gregory W. Wornell

Abstract: Learning a fair predictive model is crucial to mitigate biased decisions against minority groups in high-stakes applications. A common approach to learn such a model involves solving an optimization problem that maximizes the predictive power of the model under an appropriate group fairness constraint. However, in practice, sensitive attributes are often missing or noisy resulting in uncertainty.… ▽ More Learning a fair predictive model is crucial to mitigate biased decisions against minority groups in high-stakes applications. A common approach to learn such a model involves solving an optimization problem that maximizes the predictive power of the model under an appropriate group fairness constraint. However, in practice, sensitive attributes are often missing or noisy resulting in uncertainty. We demonstrate that solely enforcing fairness constraints on uncertain sensitive attributes can fall significantly short in achieving the level of fairness of models trained without uncertainty. To overcome this limitation, we propose a bootstrap-based algorithm that achieves the target level of fairness despite the uncertainty in sensitive attributes. The algorithm is guided by a Gaussian analysis for the independence notion of fairness where we propose a robust quadratically constrained quadratic problem to ensure a strict fairness guarantee with uncertain sensitive attributes. Our algorithm is applicable to both discrete and continuous sensitive attributes and is effective in real-world classification and regression tasks for various group fairness notions, e.g., independence and separation. △ Less

Submitted 7 June, 2023; v1 submitted 15 February, 2023; originally announced February 2023.

arXiv:2207.12382 [pdf, other]

doi 10.1109/TIT.2024.3448461

On Confidence Sequences for Bounded Random Processes via Universal Gambling Strategies

Authors: J. Jon Ryu, Alankrita Bhatt

Abstract: This paper considers the problem of constructing a confidence sequence, which is a sequence of confidence intervals that hold uniformly over time, for estimating the mean of bounded real-valued random processes. This paper revisits the gambling-based approach established in the recent literature from a natural \emph{two-horse race} perspective, and demonstrates new properties of the resulting algo… ▽ More This paper considers the problem of constructing a confidence sequence, which is a sequence of confidence intervals that hold uniformly over time, for estimating the mean of bounded real-valued random processes. This paper revisits the gambling-based approach established in the recent literature from a natural \emph{two-horse race} perspective, and demonstrates new properties of the resulting algorithm induced by Cover (1991)'s universal portfolio. The main result of this paper is a new algorithm based on a mixture of lower bounds, which closely approximates the performance of Cover's universal portfolio with constant per-round time complexity. A higher-order generalization of a lower bound on a logarithmic function in (Fan et al., 2015), which is developed as a key technique for the proposed algorithm, may be of independent interest. △ Less

Submitted 21 August, 2024; v1 submitted 25 July, 2022; originally announced July 2022.

Comments: 20 pages, 3 figures. IEEE Transactions on Information Theory (to appear)

arXiv:2202.06005 [pdf, ps, other]

An Information-Theoretic Proof of the Kac--Bernstein Theorem

Authors: J. Jon Ryu, Young-Han Kim

Abstract: A short, information-theoretic proof of the Kac--Bernstein theorem, which is stated as follows, is presented: For any independent random variables $X$ and $Y$, if $X+Y$ and $X-Y$ are independent, then $X$ and $Y$ are normally distributed. A short, information-theoretic proof of the Kac--Bernstein theorem, which is stated as follows, is presented: For any independent random variables $X$ and $Y$, if $X+Y$ and $X-Y$ are independent, then $X$ and $Y$ are normally distributed. △ Less

Submitted 21 February, 2022; v1 submitted 12 February, 2022; originally announced February 2022.

Comments: 4 pages

arXiv:2202.02464 [pdf, other]

Minimax Optimal Algorithms with Fixed-$k$-Nearest Neighbors

Authors: J. Jon Ryu, Young-Han Kim

Abstract: This paper presents how to perform minimax optimal classification, regression, and density estimation based on fixed-$k$ nearest neighbor (NN) searches. We consider a distributed learning scenario, in which a massive dataset is split into smaller groups, where the $k$-NNs are found for a query point with respect to each subset of data. We propose \emph{optimal} rules to aggregate the fixed-$k$-NN… ▽ More This paper presents how to perform minimax optimal classification, regression, and density estimation based on fixed-$k$ nearest neighbor (NN) searches. We consider a distributed learning scenario, in which a massive dataset is split into smaller groups, where the $k$-NNs are found for a query point with respect to each subset of data. We propose \emph{optimal} rules to aggregate the fixed-$k$-NN information for classification, regression, and density estimation that achieve minimax optimal rates for the respective problems. We show that the distributed algorithm with a fixed $k$ over a sufficiently large number of groups attains a minimax optimal error rate up to a multiplicative logarithmic factor under some regularity conditions. Roughly speaking, distributed $k$-NN rules with $M$ groups has a performance comparable to the standard $Θ(kM)$-NN rules even for fixed $k$. △ Less

Submitted 6 September, 2024; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: 65 pages, 5 figures. The manuscript has been revised from scratch compared to the previous version. Notable differences include (1) updated statements and corrected proofs for classification and regression, (2) explicit statements and proofs for distance-selective rules, and (3) new analogous estimators for density estimation

arXiv:2202.02431 [pdf, ps, other]

On Universal Portfolios with Continuous Side Information

Authors: Alankrita Bhatt, J. Jon Ryu, Young-Han Kim

Abstract: A new portfolio selection strategy that adapts to a continuous side-information sequence is presented, with a universal wealth guarantee against a class of state-constant rebalanced portfolios with respect to a state function that maps each side-information symbol to a finite set of states. In particular, given that a state function belongs to a collection of functions of finite Natarajan dimensio… ▽ More A new portfolio selection strategy that adapts to a continuous side-information sequence is presented, with a universal wealth guarantee against a class of state-constant rebalanced portfolios with respect to a state function that maps each side-information symbol to a finite set of states. In particular, given that a state function belongs to a collection of functions of finite Natarajan dimension, the proposed strategy is shown to achieve, asymptotically to first order in the exponent, the same wealth as the best state-constant rebalanced portfolio with respect to the best state function, chosen in hindsight from observed market. This result can be viewed as an extension of the seminal work of Cover and Ordentlich (1996) that assumes a single state function. △ Less

Submitted 4 February, 2022; originally announced February 2022.

arXiv:2202.02406 [pdf, other]

Parameter-free Online Linear Optimization with Side Information via Universal Coin Betting

Authors: J. Jon Ryu, Alankrita Bhatt, Young-Han Kim

Abstract: A class of parameter-free online linear optimization algorithms is proposed that harnesses the structure of an adversarial sequence by adapting to some side information. These algorithms combine the reduction technique of Orabona and P{á}l (2016) for adapting coin betting algorithms for online linear optimization with universal compression techniques in information theory for incorporating sequent… ▽ More A class of parameter-free online linear optimization algorithms is proposed that harnesses the structure of an adversarial sequence by adapting to some side information. These algorithms combine the reduction technique of Orabona and P{á}l (2016) for adapting coin betting algorithms for online linear optimization with universal compression techniques in information theory for incorporating sequential side information to coin betting. Concrete examples are studied in which the side information has a tree structure and consists of quantized values of the previous symbols of the adversarial sequence, including fixed-order and variable-order Markov cases. By modifying the context-tree weighting technique of Willems, Shtarkov, and Tjalkens (1995), the proposed algorithm is further refined to achieve the best performance over all adaptive algorithms with tree-structured side information of a given maximum order in a computationally efficient manner. △ Less

Submitted 4 February, 2022; originally announced February 2022.

Comments: 23 pages, 5 figures, to appear at AISTATS 2022

arXiv:1911.04018 [pdf, other]

Feedback Recurrent AutoEncoder

Authors: Yang Yang, Guillaume Sautière, J. Jon Ryu, Taco S Cohen

Abstract: In this work, we propose a new recurrent autoencoder architecture, termed Feedback Recurrent AutoEncoder (FRAE), for online compression of sequential data with temporal dependency. The recurrent structure of FRAE is designed to efficiently extract the redundancy along the time dimension and allows a compact discrete representation of the data to be learned. We demonstrate its effectiveness in spee… ▽ More In this work, we propose a new recurrent autoencoder architecture, termed Feedback Recurrent AutoEncoder (FRAE), for online compression of sequential data with temporal dependency. The recurrent structure of FRAE is designed to efficiently extract the redundancy along the time dimension and allows a compact discrete representation of the data to be learned. We demonstrate its effectiveness in speech spectrogram compression. Specifically, we show that the FRAE, paired with a powerful neural vocoder, can produce high-quality speech waveforms at a low, fixed bitrate. We further show that by adding a learned prior for the latent space and using an entropy coder, we can achieve an even lower variable bitrate. △ Less

Submitted 17 February, 2020; v1 submitted 10 November, 2019; originally announced November 2019.

Journal ref: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

arXiv:1905.10945 [pdf, other]

Learning with Succinct Common Representation Based on Wyner's Common Information

Authors: J. Jon Ryu, Yoojin Choi, Young-Han Kim, Mostafa El-Khamy, Jungwon Lee

Abstract: A new bimodal generative model is proposed for generating conditional and joint samples, accompanied with a training method with learning a succinct bottleneck representation. The proposed model, dubbed as the variational Wyner model, is designed based on two classical problems in network information theory -- distributed simulation and channel synthesis -- in which Wyner's common information aris… ▽ More A new bimodal generative model is proposed for generating conditional and joint samples, accompanied with a training method with learning a succinct bottleneck representation. The proposed model, dubbed as the variational Wyner model, is designed based on two classical problems in network information theory -- distributed simulation and channel synthesis -- in which Wyner's common information arises as the fundamental limit on the succinctness of the common representation. The model is trained by minimizing the symmetric Kullback--Leibler divergence between variational and model distributions with regularization terms for common information, reconstruction consistency, and latent space matching terms, which is carried out via an adversarial density ratio estimation technique. The utility of the proposed approach is demonstrated through experiments for joint and conditional generation with synthetic and real-world datasets, as well as a challenging zero-shot image retrieval task. △ Less

Submitted 27 July, 2022; v1 submitted 26 May, 2019; originally announced May 2019.

Comments: 20 pages, 7 figures

arXiv:1805.08342 [pdf, other]

doi 10.1109/TIT.2022.3151231

Nearest neighbor density functional estimation from inverse Laplace transform

Authors: J. Jon Ryu, Shouvik Ganguly, Young-Han Kim, Yung-Kyun Noh, Daniel D. Lee

Abstract: A new approach to $L_2$-consistent estimation of a general density functional using $k$-nearest neighbor distances is proposed, where the functional under consideration is in the form of the expectation of some function $f$ of the densities at each point. The estimator is designed to be asymptotically unbiased, using the convergence of the normalized volume of a $k$-nearest neighbor ball to a Gamm… ▽ More A new approach to $L_2$-consistent estimation of a general density functional using $k$-nearest neighbor distances is proposed, where the functional under consideration is in the form of the expectation of some function $f$ of the densities at each point. The estimator is designed to be asymptotically unbiased, using the convergence of the normalized volume of a $k$-nearest neighbor ball to a Gamma distribution in the large-sample limit, and naturally involves the inverse Laplace transform of a scaled version of the function $f.$ Some instantiations of the proposed estimator recover existing $k$-nearest neighbor based estimators of Shannon and Rényi entropies and Kullback--Leibler and Rényi divergences, and discover new consistent estimators for many other functionals such as logarithmic entropies and divergences. The $L_2$-consistency of the proposed estimator is established for a broad class of densities for general functionals, and the convergence rate in mean squared error is established as a function of the sample size for smooth, bounded densities. △ Less

Submitted 4 February, 2022; v1 submitted 21 May, 2018; originally announced May 2018.

Comments: 43 pages, 4 figures. IEEE Transactions on Information Theory (to appear)

Showing 1–16 of 16 results for author: Ryu, J J