-
Improving LoRA with Variational Learning
Authors:
Bai Cong,
Nico Daheim,
Yuesong Shen,
Rio Yokota,
Mohammad Emtiyaz Khan,
Thomas Möllenhoff
Abstract:
Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorit…
▽ More
Bayesian methods have recently been used to improve LoRA finetuning and, although they improve calibration, their effect on other metrics (such as accuracy) is marginal and can sometimes even be detrimental. Moreover, Bayesian methods also increase computational overheads and require additional tricks for them to work well. Here, we fix these issues by using a recently proposed variational algorithm called IVON. We show that IVON is easy to implement and has similar costs to AdamW, and yet it can also drastically improve many metrics by using a simple posterior pruning technique. We present extensive results on billion-scale LLMs (Llama and Qwen series) going way beyond the scale of existing applications of IVON. For example, we finetune a Llama-3.2-3B model on a set of commonsense reasoning tasks and improve accuracy over AdamW by 1.3% and reduce ECE by 5.4%, outperforming AdamW and other recent Bayesian methods like Laplace-LoRA and BLoB. Overall, our results show that variational learning with IVON can effectively improve LoRA finetuning.
△ Less
Submitted 17 June, 2025;
originally announced June 2025.
-
Federated ADMM from Bayesian Duality
Authors:
Thomas Möllenhoff,
Siddharth Swaroop,
Finale Doshi-Velez,
Mohammad Emtiyaz Khan
Abstract:
ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which ex…
▽ More
ADMM is a popular method for federated deep learning which originated in the 1970s and, even though many new variants of it have been proposed since then, its core algorithmic structure has remained unchanged. Here, we take a major departure from the old structure and present a fundamentally new way to derive and extend federated ADMM. We propose to use a structure called Bayesian Duality which exploits a duality of the posterior distributions obtained by solving a variational-Bayesian reformulation of the original problem. We show that this naturally recovers the original ADMM when isotropic Gaussian posteriors are used, and yields non-trivial extensions for other posterior forms. For instance, full-covariance Gaussians lead to Newton-like variants of ADMM, while diagonal covariances result in a cheap Adam-like variant. This is especially useful to handle heterogeneity in federated deep learning, giving up to 7% accuracy improvements over recent baselines. Our work opens a new Bayesian path to improve primal-dual methods.
△ Less
Submitted 16 June, 2025;
originally announced June 2025.
-
Variational Learning Finds Flatter Solutions at the Edge of Stability
Authors:
Avrajit Ghosh,
Bai Cong,
Rio Yokota,
Saiprasad Ravishankar,
Rongrong Wang,
Molei Tao,
Mohammad Emtiyaz Khan,
Thomas Möllenhoff
Abstract:
Variational Learning (VL) has recently gained popularity for training deep neural networks and is competitive to standard learning methods. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL…
▽ More
Variational Learning (VL) has recently gained popularity for training deep neural networks and is competitive to standard learning methods. Part of its empirical success can be explained by theories such as PAC-Bayes bounds, minimum description length and marginal likelihood, but there are few tools to unravel the implicit regularization in play. Here, we analyze the implicit regularization of VL through the Edge of Stability (EoS) framework. EoS has previously been used to show that gradient descent can find flat solutions and we extend this result to VL to show that it can find even flatter solutions. This is obtained by controlling the posterior covariance and the number of Monte Carlo samples from the posterior. These results are derived in a similar fashion as the standard EoS literature for deep learning, by first deriving a result for a quadratic problem and then extending it to deep neural networks. We empirically validate these findings on a wide variety of large networks, such as ResNet and ViT, to find that the theoretical results closely match the empirical ones. Ours is the first work to analyze the EoS dynamics in VL.
△ Less
Submitted 15 June, 2025;
originally announced June 2025.
-
Uncertainty-Aware Decoding with Minimum Bayes Risk
Authors:
Nico Daheim,
Clara Meister,
Thomas Möllenhoff,
Iryna Gurevych
Abstract:
Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, wh…
▽ More
Despite their outstanding performance in the majority of scenarios, contemporary language models still occasionally generate undesirable outputs, for example, hallucinated text. While such behaviors have previously been linked to uncertainty, there is a notable lack of methods that actively consider uncertainty during text generation. In this work, we show how Minimum Bayes Risk (MBR) decoding, which selects model generations according to an expected risk, can be generalized into a principled uncertainty-aware decoding method. In short, we account for model uncertainty during decoding by incorporating a posterior over model parameters into MBR's computation of expected risk. We show that this modified expected risk is useful for both choosing outputs and deciding when to abstain from generation and can provide improvements without incurring overhead. We benchmark different methods for learning posteriors and show that performance improves with prediction diversity. We release our code publicly.
△ Less
Submitted 7 March, 2025;
originally announced March 2025.
-
Natural Variational Annealing for Multimodal Optimization
Authors:
Tâm Le Minh,
Julyan Arbel,
Thomas Möllenhoff,
Mohammad Emtiyaz Khan,
Florence Forbes
Abstract:
We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually…
▽ More
We introduce a new multimodal optimization approach called Natural Variational Annealing (NVA) that combines the strengths of three foundational concepts to simultaneously search for multiple global and local modes of black-box nonconvex objectives. First, it implements a simultaneous search by using variational posteriors, such as, mixtures of Gaussians. Second, it applies annealing to gradually trade off exploration for exploitation. Finally, it learns the variational search distribution using natural-gradient learning where updates resemble well-known and easy-to-implement algorithms. The three concepts come together in NVA giving rise to new algorithms and also allowing us to incorporate "fitness shaping", a core concept from evolutionary algorithms. We assess the quality of search on simulations and compare them to methods using gradient descent and evolution strategies. We also provide an application to a real-world inverse problem in planetary science.
△ Less
Submitted 11 February, 2025; v1 submitted 8 January, 2025;
originally announced January 2025.
-
How to Weight Multitask Finetuning? Fast Previews via Bayesian Model-Merging
Authors:
Hugo Monzón Maldonado,
Thomas Möllenhoff,
Nico Daheim,
Iryna Gurevych,
Mohammad Emtiyaz Khan
Abstract:
When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task…
▽ More
When finetuning multiple tasks altogether, it is important to carefully weigh them to get a good performance, but searching for good weights can be difficult and costly. Here, we propose to aid the search with fast previews to quickly get a rough idea of different reweighting options. We use model merging to create previews by simply reusing and averaging parameters of models trained on each task separately (no retraining required). To improve the quality of previews, we propose a Bayesian approach to design new merging strategies by using more flexible posteriors. We validate our findings on vision and natural-language transformers. Our work shows the benefits of model merging via Bayes to improve multitask finetuning.
△ Less
Submitted 11 December, 2024;
originally announced December 2024.
-
Variational Low-Rank Adaptation Using IVON
Authors:
Bai Cong,
Nico Daheim,
Yuesong Shen,
Daniel Cremers,
Rio Yokota,
Mohammad Emtiyaz Khan,
Thomas Möllenhoff
Abstract:
We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. T…
▽ More
We show that variational learning can significantly improve the accuracy and calibration of Low-Rank Adaptation (LoRA) without a substantial increase in the cost. We replace AdamW by the Improved Variational Online Newton (IVON) algorithm to finetune large language models. For Llama-2 with 7 billion parameters, IVON improves the accuracy over AdamW by 2.8% and expected calibration error by 4.6%. The accuracy is also better than the other Bayesian alternatives, yet the cost is lower and the implementation is easier. Our work provides additional evidence for the effectiveness of IVON for large language models. The code is available at https://github.com/team-approx-bayes/ivon-lora.
△ Less
Submitted 9 November, 2024; v1 submitted 6 November, 2024;
originally announced November 2024.
-
Conformal Prediction via Regression-as-Classification
Authors:
Etash Guha,
Shlok Natarajan,
Thomas Möllenhoff,
Mohammad Emtiyaz Khan,
Eugene Ndiaye
Abstract:
Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classifica…
▽ More
Conformal prediction (CP) for regression can be challenging, especially when the output distribution is heteroscedastic, multimodal, or skewed. Some of the issues can be addressed by estimating a distribution over the output, but in reality, such approaches can be sensitive to estimation error and yield unstable intervals.~Here, we circumvent the challenges by converting regression to a classification problem and then use CP for classification to obtain CP sets for regression.~To preserve the ordering of the continuous-output space, we design a new loss function and make necessary modifications to the CP classification techniques.~Empirical results on many benchmarks shows that this simple approach gives surprisingly good results on many practical problems.
△ Less
Submitted 11 April, 2024;
originally announced April 2024.
-
Variational Learning is Effective for Large Deep Networks
Authors:
Yuesong Shen,
Nico Daheim,
Bai Cong,
Peter Nickl,
Gian Maria Marconi,
Clement Bazan,
Rio Yokota,
Iryna Gurevych,
Daniel Cremers,
Mohammad Emtiyaz Khan,
Thomas Möllenhoff
Abstract:
We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertaint…
▽ More
We give extensive empirical evidence against the common belief that variational learning is ineffective for large neural networks. We show that an optimizer called Improved Variational Online Newton (IVON) consistently matches or outperforms Adam for training large networks such as GPT-2 and ResNets from scratch. IVON's computational costs are nearly identical to Adam but its predictive uncertainty is better. We show several new use cases of IVON where we improve finetuning and model merging in Large Language Models, accurately predict generalization error, and faithfully estimate sensitivity to data. We find overwhelming evidence that variational learning is effective.
△ Less
Submitted 6 June, 2024; v1 submitted 27 February, 2024;
originally announced February 2024.
-
The Memory Perturbation Equation: Understanding Model's Sensitivity to Data
Authors:
Peter Nickl,
Lu Xu,
Dharmesh Tailor,
Thomas Möllenhoff,
Mohammad Emtiyaz Khan
Abstract:
Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of…
▽ More
Understanding model's sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model's sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.
△ Less
Submitted 16 January, 2024; v1 submitted 30 October, 2023;
originally announced October 2023.
-
Model Merging by Uncertainty-Based Gradient Matching
Authors:
Nico Daheim,
Thomas Möllenhoff,
Edoardo Maria Ponti,
Iryna Gurevych,
Mohammad Emtiyaz Khan
Abstract:
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averag…
▽ More
Models trained on different datasets can be merged by a weighted-averaging of their parameters, but why does it work and when can it fail? Here, we connect the inaccuracy of weighted-averaging to mismatches in the gradients and propose a new uncertainty-based scheme to improve the performance by reducing the mismatch. The connection also reveals implicit assumptions in other schemes such as averaging, task arithmetic, and Fisher-weighted averaging. Our new method gives consistent improvements for large language models and vision transformers, both in terms of performance and robustness to hyperparameters. Code available here.
△ Less
Submitted 23 August, 2024; v1 submitted 19 October, 2023;
originally announced October 2023.
-
The Lie-Group Bayesian Learning Rule
Authors:
Eren Mehmet Kıral,
Thomas Möllenhoff,
Mohammad Emtiyaz Khan
Abstract:
The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posterio…
▽ More
The Bayesian Learning Rule provides a framework for generic algorithm design but can be difficult to use for three reasons. First, it requires a specific parameterization of exponential family. Second, it uses gradients which can be difficult to compute. Third, its update may not always stay on the manifold. We address these difficulties by proposing an extension based on Lie-groups where posteriors are parametrized through transformations of an arbitrary base distribution and updated via the group's exponential map. This simplifies all three difficulties for many cases, providing flexible parametrizations through group's action, simple gradient computation through reparameterization, and updates that always stay on the manifold. We use the new learning rule to derive a new algorithm for deep learning with desirable biologically-plausible attributes to learn sparse features. Our work opens a new frontier for the design of new algorithms by exploiting Lie-group structures.
△ Less
Submitted 8 March, 2023;
originally announced March 2023.
-
SAM as an Optimal Relaxation of Bayes
Authors:
Thomas Möllenhoff,
Mohammad Emtiyaz Khan
Abstract:
Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables…
▽ More
Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
△ Less
Submitted 10 December, 2023; v1 submitted 4 October, 2022;
originally announced October 2022.
-
Lifting the Convex Conjugate in Lagrangian Relaxations: A Tractable Approach for Continuous Markov Random Fields
Authors:
Hartmut Bauermeister,
Emanuel Laude,
Thomas Möllenhoff,
Michael Moeller,
Daniel Cremers
Abstract:
Dual decomposition approaches in nonconvex optimization may suffer from a duality gap. This poses a challenge when applying them directly to nonconvex problems such as MAP-inference in a Markov random field (MRF) with continuous state spaces. To eliminate such gaps, this paper considers a reformulation of the original nonconvex task in the space of measures. This infinite-dimensional reformulation…
▽ More
Dual decomposition approaches in nonconvex optimization may suffer from a duality gap. This poses a challenge when applying them directly to nonconvex problems such as MAP-inference in a Markov random field (MRF) with continuous state spaces. To eliminate such gaps, this paper considers a reformulation of the original nonconvex task in the space of measures. This infinite-dimensional reformulation is then approximated by a semi-infinite one, which is obtained via a piecewise polynomial discretization in the dual. We provide a geometric intuition behind the primal problem induced by the dual discretization and draw connections to optimization over moment spaces. In contrast to existing discretizations which suffer from a grid bias, we show that a piecewise polynomial discretization better preserves the continuous nature of our problem. Invoking results from optimal transport theory and convex algebraic geometry we reduce the semi-infinite program to a finite one and provide a practical implementation based on semidefinite programming. We show, experimentally and in theory, that the approach successfully reduces the duality gap. To showcase the scalability of our approach, we apply it to the stereo matching problem between two images.
△ Less
Submitted 16 May, 2022; v1 submitted 13 July, 2021;
originally announced July 2021.
-
Optimization of Graph Total Variation via Active-Set-based Combinatorial Reconditioning
Authors:
Zhenzhang Ye,
Thomas Möllenhoff,
Tao Wu,
Daniel Cremers
Abstract:
Structured convex optimization on weighted graphs finds numerous applications in machine learning and computer vision. In this work, we propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class. Our preconditioner is driven by a sharp analysis of the local linear convergence rate depending on the "active set" at the current iterate. We show that nested-forest…
▽ More
Structured convex optimization on weighted graphs finds numerous applications in machine learning and computer vision. In this work, we propose a novel adaptive preconditioning strategy for proximal algorithms on this problem class. Our preconditioner is driven by a sharp analysis of the local linear convergence rate depending on the "active set" at the current iterate. We show that nested-forest decomposition of the inactive edges yields a guaranteed local linear convergence rate. Further, we propose a practical greedy heuristic which realizes such nested decompositions and show in several numerical experiments that our reconditioning strategy, when applied to proximal gradient or primal-dual hybrid gradient algorithm, achieves competitive performances. Our results suggest that local convergence analysis can serve as a guideline for selecting variable metrics in proximal algorithms.
△ Less
Submitted 27 February, 2020;
originally announced February 2020.
-
Informative GANs via Structured Regularization of Optimal Transport
Authors:
Pierre Bréchet,
Tao Wu,
Thomas Möllenhoff,
Daniel Cremers
Abstract:
We tackle the challenge of disentangled representation learning in generative adversarial networks (GANs) from the perspective of regularized optimal transport (OT). Specifically, a smoothed OT loss gives rise to an implicit transportation plan between the latent space and the data space. Based on this theoretical observation, we exploit a structured regularization on the transportation plan to en…
▽ More
We tackle the challenge of disentangled representation learning in generative adversarial networks (GANs) from the perspective of regularized optimal transport (OT). Specifically, a smoothed OT loss gives rise to an implicit transportation plan between the latent space and the data space. Based on this theoretical observation, we exploit a structured regularization on the transportation plan to encourage a prescribed latent subspace to be informative. This yields the formulation of a novel informative OT-based GAN. By convex duality, we obtain the equivalent view that this leads to perturbed ground costs favoring sparsity in the informative latent dimensions. Practically, we devise a stable training algorithm for the proposed informative GAN. Our experiments support the hypothesis that such regularizations effectively yield the discovery of disentangled and interpretable latent representations. Our work showcases potential power of a regularized OT framework in the context of generative modeling through its access to the transport plan. Further challenges are addressed in this line.
△ Less
Submitted 4 December, 2019;
originally announced December 2019.
-
Flat Metric Minimization with Applications in Generative Modeling
Authors:
Thomas Möllenhoff,
Daniel Cremers
Abstract:
We take the novel perspective to view data not as a probability distribution but rather as a current. Primarily studied in the field of geometric measure theory, $k$-currents are continuous linear functionals acting on compactly supported smooth differential forms and can be understood as a generalized notion of oriented $k$-dimensional manifold. By moving from distributions (which are $0$-current…
▽ More
We take the novel perspective to view data not as a probability distribution but rather as a current. Primarily studied in the field of geometric measure theory, $k$-currents are continuous linear functionals acting on compactly supported smooth differential forms and can be understood as a generalized notion of oriented $k$-dimensional manifold. By moving from distributions (which are $0$-currents) to $k$-currents, we can explicitly orient the data by attaching a $k$-dimensional tangent plane to each sample point. Based on the flat metric which is a fundamental distance between currents, we derive FlatGAN, a formulation in the spirit of generative adversarial networks but generalized to $k$-currents. In our theoretical contribution we prove that the flat metric between a parametrized current and a reference current is Lipschitz continuous in the parameters. In experiments, we show that the proposed shift to $k>0$ leads to interpretable and disentangled latent representations which behave equivariantly to the specified oriented tangent planes.
△ Less
Submitted 12 May, 2019;
originally announced May 2019.
-
Lifting Vectorial Variational Problems: A Natural Formulation based on Geometric Measure Theory and Discrete Exterior Calculus
Authors:
Thomas Möllenhoff,
Daniel Cremers
Abstract:
Numerous tasks in imaging and vision can be formulated as variational problems over vector-valued maps. We approach the relaxation and convexification of such vectorial variational problems via a lifting to the space of currents. To that end, we recall that functionals with polyconvex Lagrangians can be reparametrized as convex one-homogeneous functionals on the graph of the function. This leads t…
▽ More
Numerous tasks in imaging and vision can be formulated as variational problems over vector-valued maps. We approach the relaxation and convexification of such vectorial variational problems via a lifting to the space of currents. To that end, we recall that functionals with polyconvex Lagrangians can be reparametrized as convex one-homogeneous functionals on the graph of the function. This leads to an equivalent shape optimization problem over oriented surfaces in the product space of domain and codomain. A convex formulation is then obtained by relaxing the search space from oriented surfaces to more general currents. We propose a discretization of the resulting infinite-dimensional optimization problem using Whitney forms, which also generalizes recent "sublabel-accurate" multilabeling approaches.
△ Less
Submitted 2 May, 2019;
originally announced May 2019.
-
Controlling Neural Networks via Energy Dissipation
Authors:
Michael Moeller,
Thomas Möllenhoff,
Daniel Cremers
Abstract:
The last decade has shown a tremendous success in solving various computer vision problems with the help of deep learning techniques. Lately, many works have demonstrated that learning-based approaches with suitable network architectures even exhibit superior performance for the solution of (ill-posed) image reconstruction problems such as deblurring, super-resolution, or medical image reconstruct…
▽ More
The last decade has shown a tremendous success in solving various computer vision problems with the help of deep learning techniques. Lately, many works have demonstrated that learning-based approaches with suitable network architectures even exhibit superior performance for the solution of (ill-posed) image reconstruction problems such as deblurring, super-resolution, or medical image reconstruction. The drawback of purely learning-based methods, however, is that they cannot provide provable guarantees for the trained network to follow a given data formation process during inference. In this work we propose energy dissipating networks that iteratively compute a descent direction with respect to a given cost function or energy at the currently estimated reconstruction. Therefore, an adaptive step size rule such as a line-search, along with a suitable number of iterations can guarantee the reconstruction to follow a given data formation model encoded in the energy to arbitrary precision, and hence control the model's behavior even during test time. We prove that under standard assumptions, descent using the direction predicted by the network converges (linearly) to the global minimum of the energy. We illustrate the effectiveness of the proposed approach in experiments on single image super resolution and computed tomography (CT) reconstruction, and further illustrate extensions to convex feasibility problems.
△ Less
Submitted 20 August, 2019; v1 submitted 5 April, 2019;
originally announced April 2019.
-
Combinatorial Preconditioners for Proximal Algorithms on Graphs
Authors:
Thomas Möllenhoff,
Zhenzhang Ye,
Tao Wu,
Daniel Cremers
Abstract:
We present a novel preconditioning technique for proximal optimization methods that relies on graph algorithms to construct effective preconditioners. Such combinatorial preconditioners arise from partitioning the graph into forests. We prove that certain decompositions lead to a theoretically optimal condition number. We also show how ideal decompositions can be realized using matroid partitionin…
▽ More
We present a novel preconditioning technique for proximal optimization methods that relies on graph algorithms to construct effective preconditioners. Such combinatorial preconditioners arise from partitioning the graph into forests. We prove that certain decompositions lead to a theoretically optimal condition number. We also show how ideal decompositions can be realized using matroid partitioning and propose efficient greedy variants thereof for large-scale problems. Coupled with specialized solvers for the resulting scaled proximal subproblems, the preconditioned algorithm achieves competitive performance in machine learning and vision applications.
△ Less
Submitted 21 February, 2018; v1 submitted 16 January, 2018;
originally announced January 2018.
-
Proximal Backpropagation
Authors:
Thomas Frerix,
Thomas Möllenhoff,
Michael Moeller,
Daniel Cremers
Abstract:
We propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. ProxProp is developed from a general point of view on the backpropagation algorithm…
▽ More
We propose proximal backpropagation (ProxProp) as a novel algorithm that takes implicit instead of explicit gradient steps to update the network parameters during neural network training. Our algorithm is motivated by the step size limitation of explicit gradient descent, which poses an impediment for optimization. ProxProp is developed from a general point of view on the backpropagation algorithm, currently the most common technique to train neural networks via stochastic gradient descent and variants thereof. Specifically, we show that backpropagation of a prediction error is equivalent to sequential gradient descent steps on a quadratic penalty energy, which comprises the network activations as variables of the optimization. We further analyze theoretical properties of ProxProp and in particular prove that the algorithm yields a descent direction in parameter space and can therefore be combined with a wide variety of convergent algorithms. Finally, we devise an efficient numerical implementation that integrates well with popular deep learning frameworks. We conclude by demonstrating promising numerical results and show that ProxProp can be effectively combined with common first order optimizers such as Adam.
△ Less
Submitted 20 February, 2018; v1 submitted 14 June, 2017;
originally announced June 2017.
-
Sublabel-Accurate Discretization of Nonconvex Free-Discontinuity Problems
Authors:
Thomas Möllenhoff,
Daniel Cremers
Abstract:
In this work we show how sublabel-accurate multilabeling approaches can be derived by approximating a classical label-continuous convex relaxation of nonconvex free-discontinuity problems. This insight allows to extend these sublabel-accurate approaches from total variation to general convex and nonconvex regularizations. Furthermore, it leads to a systematic approach to the discretization of cont…
▽ More
In this work we show how sublabel-accurate multilabeling approaches can be derived by approximating a classical label-continuous convex relaxation of nonconvex free-discontinuity problems. This insight allows to extend these sublabel-accurate approaches from total variation to general convex and nonconvex regularizations. Furthermore, it leads to a systematic approach to the discretization of continuous convex relaxations. We study the relationship to existing discretizations and to discrete-continuous MRFs. Finally, we apply the proposed approach to obtain a sublabel-accurate and convex solution to the vectorial Mumford-Shah functional and show in several experiments that it leads to more precise solutions using fewer labels.
△ Less
Submitted 5 August, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Sublabel-Accurate Convex Relaxation of Vectorial Multilabel Energies
Authors:
Emanuel Laude,
Thomas Möllenhoff,
Michael Moeller,
Jan Lellmann,
Daniel Cremers
Abstract:
Convex relaxations of nonconvex multilabel problems have been demonstrated to produce superior (provably optimal or near-optimal) solutions to a variety of classical computer vision problems. Yet, they are of limited practical use as they require a fine discretization of the label space, entailing a huge demand in memory and runtime. In this work, we propose the first sublabel accurate convex rela…
▽ More
Convex relaxations of nonconvex multilabel problems have been demonstrated to produce superior (provably optimal or near-optimal) solutions to a variety of classical computer vision problems. Yet, they are of limited practical use as they require a fine discretization of the label space, entailing a huge demand in memory and runtime. In this work, we propose the first sublabel accurate convex relaxation for vectorial multilabel problems. The key idea is that we approximate the dataterm of the vectorial labeling problem in a piecewise convex (rather than piecewise linear) manner. As a result we have a more faithful approximation of the original cost function that provides a meaningful interpretation for the fractional solutions of the relaxed convex problem. In numerous experiments on large-displacement optical flow estimation and on color image denoising we demonstrate that the computed solutions have superior quality while requiring much lower memory and runtime.
△ Less
Submitted 10 October, 2016; v1 submitted 7 April, 2016;
originally announced April 2016.
-
Sublabel-Accurate Relaxation of Nonconvex Energies
Authors:
Thomas Möllenhoff,
Emanuel Laude,
Michael Moeller,
Jan Lellmann,
Daniel Cremers
Abstract:
We propose a novel spatially continuous framework for convex relaxations based on functional lifting. Our method can be interpreted as a sublabel-accurate solution to multilabel problems. We show that previously proposed functional lifting methods optimize an energy which is linear between two labels and hence require (often infinitely) many labels for a faithful approximation. In contrast, the pr…
▽ More
We propose a novel spatially continuous framework for convex relaxations based on functional lifting. Our method can be interpreted as a sublabel-accurate solution to multilabel problems. We show that previously proposed functional lifting methods optimize an energy which is linear between two labels and hence require (often infinitely) many labels for a faithful approximation. In contrast, the proposed formulation is based on a piecewise convex approximation and therefore needs far fewer labels. In comparison to recent MRF-based approaches, our method is formulated in a spatially continuous setting and shows less grid bias. Moreover, in a local sense, our formulation is the tightest possible convex relaxation. It is easy to implement and allows an efficient primal-dual optimization on GPUs. We show the effectiveness of our approach on several computer vision problems.
△ Less
Submitted 4 December, 2015;
originally announced December 2015.
-
The Primal-Dual Hybrid Gradient Method for Semiconvex Splittings
Authors:
Thomas Möllenhoff,
Evgeny Strekalovskiy,
Michael Moeller,
Daniel Cremers
Abstract:
This paper deals with the analysis of a recent reformulation of the primal-dual hybrid gradient method [Zhu and Chan 2008, Pock, Cremers, Bischof and Chambolle 2009, Esser, Zhang and Chan 2010, Chambolle and Pock 2011], which allows to apply it to nonconvex regularizers as first proposed for truncated quadratic penalization in [Strekalovskiy and Cremers 2014]. Particularly, it investigates variati…
▽ More
This paper deals with the analysis of a recent reformulation of the primal-dual hybrid gradient method [Zhu and Chan 2008, Pock, Cremers, Bischof and Chambolle 2009, Esser, Zhang and Chan 2010, Chambolle and Pock 2011], which allows to apply it to nonconvex regularizers as first proposed for truncated quadratic penalization in [Strekalovskiy and Cremers 2014]. Particularly, it investigates variational problems for which the energy to be minimized can be written as $G(u) + F(Ku)$, where $G$ is convex, $F$ semiconvex, and $K$ is a linear operator. We study the method and prove convergence in the case where the nonconvexity of $F$ is compensated by the strong convexity of the $G$. The convergence proof yields an interesting requirement for the choice of algorithm parameters, which we show to not only be sufficient, but necessary. Additionally, we show boundedness of the iterates under much weaker conditions. Finally, we demonstrate effectiveness and convergence of the algorithm beyond the theoretical guarantees in several numerical experiments.
△ Less
Submitted 7 July, 2014;
originally announced July 2014.