Search | arXiv e-print repository

Simulating Fokker-Planck equations via mean field control of score-based normalizing flows

Authors: Mo Zhou, Stanley Osher, Wuchen Li

Abstract: The Fokker-Planck (FP) equation governs the evolution of densities for stochastic dynamics of physical systems, such as the Langevin dynamics and the Lorenz system. This work simulates FP equations through a mean field control (MFC) problem. We first formulate the FP equation as a continuity equation, where the velocity field consists of the drift function and the score function, i.e., the gradien… ▽ More The Fokker-Planck (FP) equation governs the evolution of densities for stochastic dynamics of physical systems, such as the Langevin dynamics and the Lorenz system. This work simulates FP equations through a mean field control (MFC) problem. We first formulate the FP equation as a continuity equation, where the velocity field consists of the drift function and the score function, i.e., the gradient of the logarithm of the density function. Next, we design a MFC problem that matches the velocity fields in a continuity equation with the ones in the FP equation. The score functions along deterministic trajectories are computed efficiently through the score-based normalizing flow, which only rely on the derivatives of the parameterized velocity fields. A convergence analysis is conducted for our algorithm on the FP equation of Ornstein-Uhlenbeck processes. Numerical results, including Langevin dynamics, underdamped Langevin dynamics, and various chaotic systems, validate the effectiveness of our proposed algorithms. △ Less

Submitted 6 June, 2025; originally announced June 2025.

MSC Class: 34K35; 37N35; 58E25; 93C15; 93C20 ACM Class: G.1.7

arXiv:2506.00674 [pdf, ps, other]

Thinking Out of the Box: Hybrid SAT Solving by Unconstrained Continuous Optimization

Authors: Zhiwei Zhang, Samy Wu Fung, Anastasios Kyrillidis, Stanley Osher, Moshe Y. Vardi

Abstract: The Boolean satisfiability (SAT) problem lies at the core of many applications in combinatorial optimization, software verification, cryptography, and machine learning. While state-of-the-art solvers have demonstrated high efficiency in handling conjunctive normal form (CNF) formulas, numerous applications require non-CNF (hybrid) constraints, such as XOR, cardinality, and Not-All-Equal constraint… ▽ More The Boolean satisfiability (SAT) problem lies at the core of many applications in combinatorial optimization, software verification, cryptography, and machine learning. While state-of-the-art solvers have demonstrated high efficiency in handling conjunctive normal form (CNF) formulas, numerous applications require non-CNF (hybrid) constraints, such as XOR, cardinality, and Not-All-Equal constraints. Recent work leverages polynomial representations to represent such hybrid constraints, but it relies on box constraints that can limit the use of powerful unconstrained optimizers. In this paper, we propose unconstrained continuous optimization formulations for hybrid SAT solving by penalty terms. We provide theoretical insights into when these penalty terms are necessary and demonstrate empirically that unconstrained optimizers (e.g., Adam) can enhance SAT solving on hybrid benchmarks. Our results highlight the potential of combining continuous optimization and machine-learning-based methods for effective hybrid SAT solving. △ Less

Submitted 31 May, 2025; originally announced June 2025.

arXiv:2505.13499 [pdf, ps, other]

Optimal Control for Transformer Architectures: Enhancing Generalization, Robustness and Efficiency

Authors: Kelvin Kan, Xingjian Li, Benjamin J. Zhang, Tuhin Sahai, Stanley Osher, Markos A. Katsoulakis

Abstract: We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, ena… ▽ More We study Transformers through the perspective of optimal control theory, using tools from continuous-time formulations to derive actionable insights into training and architecture design. This framework improves the performance of existing Transformer models while providing desirable theoretical guarantees, including generalization and robustness. Our framework is designed to be plug-and-play, enabling seamless integration with established Transformer models and requiring only slight changes to the implementation. We conduct seven extensive experiments on tasks motivated by text generation, sentiment analysis, image classification, and point cloud classification. Experimental results show that the framework improves the test performance of the baselines, while being more parameter-efficient. On character-level text generation with nanoGPT, our framework achieves a 46% reduction in final test loss while using 42% fewer parameters. On GPT-2, our framework achieves a 5.6% reduction in final test loss, demonstrating scalability to larger models. To the best of our knowledge, this is the first work that applies optimal control theory to both the training and architecture of Transformers. It offers a new foundation for systematic, theory-driven improvements and moves beyond costly trial-and-error approaches. △ Less

Submitted 15 May, 2025; originally announced May 2025.

arXiv:2502.20833 [pdf, other]

Recent Advances in Numerical Solutions for Hamilton-Jacobi PDEs

Authors: Tingwei Meng, Siting Liu, Samy Wu Fung, Stanley Osher

Abstract: Hamilton-Jacobi partial differential equations (HJ PDEs) play a central role in many applications such as economics, physics, and engineering. These equations describe the evolution of a value function which encodes valuable information about the system, such as action, cost, or level sets of a dynamic process. Their importance lies in their ability to model diverse phenomena, ranging from the pro… ▽ More Hamilton-Jacobi partial differential equations (HJ PDEs) play a central role in many applications such as economics, physics, and engineering. These equations describe the evolution of a value function which encodes valuable information about the system, such as action, cost, or level sets of a dynamic process. Their importance lies in their ability to model diverse phenomena, ranging from the propagation of fronts in computational physics to optimal decision-making in control systems. This paper provides a review of some recent advances in numerical methods to address challenges such as high-dimensionality, nonlinearity, and computational efficiency. By examining these developments, this paper sheds light on important techniques and emerging directions in the numerical solution of HJ PDEs. △ Less

Submitted 28 February, 2025; originally announced February 2025.

arXiv:2502.16773 [pdf, ps, other]

Splitting Regularized Wasserstein Proximal Algorithms for Nonsmooth Sampling Problems

Authors: Fuqun Han, Stanley Osher, Wuchen Li

Abstract: Sampling from nonsmooth target probability distributions is essential in various applications, including the Bayesian Lasso. We propose a splitting-based sampling algorithm for the time-implicit discretization of the probability flow for the Fokker-Planck equation, where the score function defined as the gradient logarithm of the current probability density function, is approximated by the regular… ▽ More Sampling from nonsmooth target probability distributions is essential in various applications, including the Bayesian Lasso. We propose a splitting-based sampling algorithm for the time-implicit discretization of the probability flow for the Fokker-Planck equation, where the score function defined as the gradient logarithm of the current probability density function, is approximated by the regularized Wasserstein proximal. When the prior distribution is the Laplace prior, our algorithm is explicitly formulated as a deterministic interacting particle system, incorporating softmax operators and shrinkage operations to efficiently compute the gradient drift vector field and the score function. The proposed formulation introduces a particular class of attention layers in transformer structures, which can sample sparse target distributions. We verify the convergence towards target distributions regarding Rényi divergences under suitable conditions. Numerical experiments in high-dimensional nonsmooth sampling problems, such as sampling from mixed Gaussian and Laplace distributions, logistic regressions, image restoration with L1-TV regularization, and Bayesian neural networks, demonstrate the efficiency and robust performance of the proposed method. △ Less

Submitted 23 February, 2025; originally announced February 2025.

arXiv:2502.06026 [pdf, other]

A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions

Authors: Elisa Negrini, Yuxuan Liu, Liu Yang, Stanley J. Osher, Hayden Schaeffer

Abstract: Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE f… ▽ More Neural networks are one tool for approximating non-linear differential equations used in scientific computing tasks such as surrogate modeling, real-time predictions, and optimal control. PDE foundation models utilize neural networks to train approximations to multiple differential equations simultaneously and are thus a general purpose solver that can be adapted to downstream tasks. Current PDE foundation models focus on either learning general solution operators and/or the governing system of equations, and thus only handle numerical or symbolic modalities. However, real-world applications may require more flexible data modalities, e.g. text analysis or descriptive outputs. To address this gap, we propose a novel multimodal deep learning approach that leverages a transformer-based architecture to approximate solution operators for a wide variety of ODEs and PDEs. Our method integrates numerical inputs, such as equation parameters and initial conditions, with text descriptions of physical processes or system dynamics. This enables our model to handle settings where symbolic representations may be incomplete or unavailable. In addition to providing accurate numerical predictions, our approach generates interpretable scientific text descriptions, offering deeper insights into the underlying dynamics and solution properties. The numerical experiments show that our model provides accurate solutions for in-distribution data (with average relative error less than 3.3%) and out-of-distribution data (average relative error less than 7.8%) together with precise text descriptions (with correct descriptions generated 100% of times). In certain tests, the model is also shown to be capable of extrapolating solutions in time. △ Less

Submitted 9 February, 2025; originally announced February 2025.

arXiv:2501.19351 [pdf, other]

Neural Implicit Solution Formula for Efficiently Solving Hamilton-Jacobi Equations

Authors: Yesom Park, Stanley Osher

Abstract: This paper presents an implicit solution formula for the Hamilton-Jacobi partial differential equation (HJ PDE). The formula is derived using the method of characteristics and is shown to coincide with the Hopf and Lax formulas in the case where either the Hamiltonian or the initial function is convex. It provides a simple and efficient numerical approach for computing the viscosity solution of HJ… ▽ More This paper presents an implicit solution formula for the Hamilton-Jacobi partial differential equation (HJ PDE). The formula is derived using the method of characteristics and is shown to coincide with the Hopf and Lax formulas in the case where either the Hamiltonian or the initial function is convex. It provides a simple and efficient numerical approach for computing the viscosity solution of HJ PDEs, bypassing the need for the Legendre transform of the Hamiltonian or the initial condition, and the explicit computation of individual characteristic trajectories. A deep learning-based methodology is proposed to learn this implicit solution formula, leveraging the mesh-free nature of deep learning to ensure scalability for high-dimensional problems. Building upon this framework, an algorithm is developed that approximates the characteristic curves piecewise linearly for state-dependent Hamiltonians. Extensive experimental results demonstrate that the proposed method delivers highly accurate solutions, even for nonconvex Hamiltonians, and exhibits remarkable scalability, achieving computational efficiency for problems up to 40 dimensions. △ Less

Submitted 31 January, 2025; originally announced January 2025.

arXiv:2501.18793 [pdf, other]

OT-Transformer: A Continuous-time Transformer Architecture with Optimal Transport Regularization

Authors: Kelvin Kan, Xingjian Li, Stanley Osher

Abstract: Transformers have achieved state-of-the-art performance in numerous tasks. In this paper, we propose a continuous-time formulation of transformers. Specifically, we consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of th… ▽ More Transformers have achieved state-of-the-art performance in numerous tasks. In this paper, we propose a continuous-time formulation of transformers. Specifically, we consider a dynamical system whose governing equation is parametrized by transformer blocks. We leverage optimal transport theory to regularize the training problem, which enhances stability in training and improves generalization of the resulting model. Moreover, we demonstrate in theory that this regularization is necessary as it promotes uniqueness and regularity of solutions. Our model is flexible in that almost any existing transformer architectures can be adopted to construct the dynamical system with only slight modifications to the existing code. We perform extensive numerical experiments on tasks motivated by natural language processing, image classification, and point cloud classification. Our experimental results show that the proposed method improves the performance of its discrete counterpart and outperforms relevant comparing models. △ Less

Submitted 30 January, 2025; originally announced January 2025.

arXiv:2501.15106 [pdf, other]

In-Context Operator Learning for Linear Propagator Models

Authors: Tingwei Meng, Moritz Voß, Nils Detering, Giulio Farolfi, Stanley Osher, Georg Menz

Abstract: We study operator learning in the context of linear propagator models for optimal order execution problems with transient price impact à la Bouchaud et al. (2004) and Gatheral (2010). Transient price impact persists and decays over time according to some propagator kernel. Specifically, we propose to use In-Context Operator Networks (ICON), a novel transformer-based neural network architecture int… ▽ More We study operator learning in the context of linear propagator models for optimal order execution problems with transient price impact à la Bouchaud et al. (2004) and Gatheral (2010). Transient price impact persists and decays over time according to some propagator kernel. Specifically, we propose to use In-Context Operator Networks (ICON), a novel transformer-based neural network architecture introduced by Yang et al. (2023), which facilitates data-driven learning of operators by merging offline pre-training with an online few-shot prompting inference. First, we train ICON to learn the operator from various propagator models that maps the trading rate to the induced transient price impact. The inference step is then based on in-context prediction, where ICON is presented only with a few examples. We illustrate that ICON is capable of accurately inferring the underlying price impact model from the data prompts, even with propagator kernels not seen in the training data. In a second step, we employ the pre-trained ICON model provided with context as a surrogate operator in solving an optimal order execution problem via a neural network control policy, and demonstrate that the exact optimal execution strategies from Abi Jaber and Neuman (2022) for the models generating the context are correctly retrieved. Our introduced methodology is very general, offering a new approach to solving optimal stochastic control problems with unknown state dynamics, inferred data-efficiently from a limited number of examples by leveraging the few-shot and transfer learning capabilities of transformer networks. △ Less

Submitted 25 January, 2025; originally announced January 2025.

Comments: 25 pages, 10 figures

MSC Class: 93E20; 91G60; 68T07

arXiv:2412.11485 [pdf, ps, other]

Inexact Proximal Point Algorithms for Zeroth-Order Global Optimization

Authors: Minxin Zhang, Fuqun Han, Yat Tin Chow, Stanley Osher, Hayden Schaeffer

Abstract: This work concerns the zeroth-order global minimization of continuous nonconvex functions with a unique global minimizer and possibly multiple local minimizers. We formulate a theoretical framework for inexact proximal point (IPP) methods for global optimization, establishing convergence guarantees under mild assumptions when either deterministic or stochastic estimates of proximal operators are u… ▽ More This work concerns the zeroth-order global minimization of continuous nonconvex functions with a unique global minimizer and possibly multiple local minimizers. We formulate a theoretical framework for inexact proximal point (IPP) methods for global optimization, establishing convergence guarantees under mild assumptions when either deterministic or stochastic estimates of proximal operators are used. The quadratic regularization in the proximal operator and the scaling effect of a parameter $δ>0$ create a concentrated landscape of an associated Gibbs measure that is practically effective for sampling. The convergence of the expectation under the Gibbs measure as $δ\to 0^+$ is established, and the convergence rate of $\mathcal O(δ)$ is derived under additional assumptions. These results provide a theoretical foundation for evaluating proximal operators inexactly using sampling-based methods such as Monte Carlo (MC) integration. In addition, we propose a new approach based on tensor train (TT) approximation. This approach employs a randomized TT cross algorithm to efficiently construct a low-rank TT approximation of a discretized function using a small number of function evaluations, and we provide an error analysis for the TT-based estimation. We then propose two practical IPP algorithms, TT-IPP and MC-IPP. The TT-IPP algorithm leverages TT estimates of the proximal operators, while the MC-IPP algorithm employs MC integration to estimate the proximal operators. Both algorithms are designed to adaptively balance efficiency and accuracy in inexact evaluations of proximal operators. The effectiveness of the two algorithms is demonstrated through experiments on diverse benchmark functions and various applications. △ Less

Submitted 2 June, 2025; v1 submitted 16 December, 2024; originally announced December 2024.

MSC Class: 49M37; 65K05; 90C26; 90C56

arXiv:2411.16063 [pdf, other]

VICON: Vision In-Context Operator Networks for Multi-Physics Fluid Dynamics Prediction

Authors: Yadi Cao, Yuxuan Liu, Liu Yang, Rose Yu, Hayden Schaeffer, Stanley Osher

Abstract: In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), wh… ▽ More In-Context Operator Networks (ICONs) have demonstrated the ability to learn operators across diverse partial differential equations using few-shot, in-context learning. However, existing ICONs process each spatial point as an individual token, severely limiting computational efficiency when handling dense data in higher spatial dimensions. We propose Vision In-Context Operator Networks (VICON), which integrates vision transformer architectures to efficiently process 2D data through patch-wise operations while preserving ICON's adaptability to multiphysics systems and varying timesteps. Evaluated across three fluid dynamics benchmarks, VICON significantly outperforms state-of-the-art baselines: DPOT and MPP, reducing the averaged last-step rollout error by 37.9% compared to DPOT and 44.7% compared to MPP, while requiring only 72.5% and 34.8% of their respective inference times. VICON naturally supports flexible rollout strategies with varying timestep strides, enabling immediate deployment in imperfect measurement systems where sampling frequencies may differ or frames might be dropped - common challenges in real-world settings - without requiring retraining or interpolation. In these realistic scenarios, VICON exhibits remarkable robustness, experiencing only 24.41% relative performance degradation compared to 71.37%-74.49% degradation in baseline methods, demonstrating its versatility for deploying in realistic applications. Our scripts for processing datasets and code are publicly available at https://github.com/Eydcao/VICON. △ Less

Submitted 19 May, 2025; v1 submitted 24 November, 2024; originally announced November 2024.

Comments: update 1 more baseline + 1 more experiment setup (performance for temporal measurements with dropped frames); updated to Nueral IPS format. Refined writing and presentations

arXiv:2411.06278 [pdf, ps, other]

A Natural Primal-Dual Hybrid Gradient Method for Adversarial Neural Network Training on Solving Partial Differential Equations

Authors: Shu Liu, Stanley Osher, Wuchen Li

Abstract: We propose a scalable preconditioned primal-dual hybrid gradient algorithm for solving partial differential equations (PDEs). We multiply the PDE with a dual test function to obtain an inf-sup problem whose loss functional involves lower-order differential operators. The Primal-Dual Hybrid Gradient (PDHG) algorithm is then leveraged for this saddle point problem. By introducing suitable preconditi… ▽ More We propose a scalable preconditioned primal-dual hybrid gradient algorithm for solving partial differential equations (PDEs). We multiply the PDE with a dual test function to obtain an inf-sup problem whose loss functional involves lower-order differential operators. The Primal-Dual Hybrid Gradient (PDHG) algorithm is then leveraged for this saddle point problem. By introducing suitable precondition operators to the proximal steps in the PDHG algorithm, we obtain an alternative natural gradient ascent-descent optimization scheme for updating the neural network parameters. We apply the Krylov subspace method (MINRES) to evaluate the natural gradients efficiently. Such treatment readily handles the inversion of precondition matrices via matrix-vector multiplication. A posterior convergence analysis is established for the time-continuous version of the proposed method. The algorithm is tested on various types of PDEs with dimensions ranging from $1$ to $50$, including linear and nonlinear elliptic equations, reaction-diffusion equations, and Monge-Ampère equations stemming from the $L^2$ optimal transport problems. We compare the performance of the proposed method with several commonly used deep learning algorithms such as physics-informed neural networks (PINNs), the DeepRitz method, weak adversarial networks (WANs), etc, for solving PDEs using the Adam and L-BFGS optimizers. The numerical results suggest that the proposed method performs efficiently and robustly and converges more stably. △ Less

Submitted 24 December, 2024; v1 submitted 9 November, 2024; originally announced November 2024.

Comments: Several typos have been corrected. We welcome your comments and suggestions

arXiv:2411.02890 [pdf, other]

doi 10.1117/12.917234

Fried deconvolution

Authors: Jerome Gilles, Stanley Osher

Abstract: In this paper we present a new approach to deblur the effect of atmospheric turbulence in the case of long range imaging. Our method is based on an analytical formulation, the Fried kernel, of the atmosphere modulation transfer function (MTF) and a framelet based deconvolution algorithm. An important parameter is the refractive index structure which requires specific measurements to be known. Then… ▽ More In this paper we present a new approach to deblur the effect of atmospheric turbulence in the case of long range imaging. Our method is based on an analytical formulation, the Fried kernel, of the atmosphere modulation transfer function (MTF) and a framelet based deconvolution algorithm. An important parameter is the refractive index structure which requires specific measurements to be known. Then we propose a method which provides a good estimation of this parameter from the input blurred image. The final algorithms are very easy to implement and show very good results on both simulated blur and real images. △ Less

Submitted 5 November, 2024; originally announced November 2024.

Journal ref: SPIE Defense, Security and Sensing conference, Baltimore, USA, Proceedings Volume 8355, Infrared Imaging Systems: Design, Analysis, Modeling, and Testing XXIII; 83550G, April 2012

arXiv:2410.23533 [pdf, other]

doi 10.1137/130923774

2D Empirical Transforms. Wavelets, Ridgelets and Curvelets revisited

Authors: Jerome Gilles, Giang Tran, Stanley Osher

Abstract: A recently developed new approach, called ``Empirical Wavelet Transform'', aims to build 1D adaptive wavelet frames accordingly to the analyzed signal. In this paper, we present several extensions of this approach to 2D signals (images). We revisit some well-known transforms (tensor wavelets, Littlewood-Paley wavelets, ridgelets and curvelets) and show that it is possible to build their empirical… ▽ More A recently developed new approach, called ``Empirical Wavelet Transform'', aims to build 1D adaptive wavelet frames accordingly to the analyzed signal. In this paper, we present several extensions of this approach to 2D signals (images). We revisit some well-known transforms (tensor wavelets, Littlewood-Paley wavelets, ridgelets and curvelets) and show that it is possible to build their empirical counterpart. We prove that such constructions lead to different adaptive frames which show some promising properties for image analysis and processing. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Journal ref: SIAM Journal on Imaging Sciences, Vol.7, No.1, 157--186, January 2014

arXiv:2410.22802 [pdf, other]

doi 10.1117/1.JEI.25.3.033003

Wavelet Burst Accumulation for turbulence mitigation

Authors: Jerome Gilles, Stanley Osher

Abstract: In this paper, we investigate the extension of the recently proposed weighted Fourier burst accumulation (FBA) method into the wavelet domain. The purpose of FBA is to reconstruct a clean and sharp image from a sequence of blurred frames. This concept lies in the construction of weights to amplify dominant frequencies in the Fourier spectrum of each frame. The reconstructed image is then obtained… ▽ More In this paper, we investigate the extension of the recently proposed weighted Fourier burst accumulation (FBA) method into the wavelet domain. The purpose of FBA is to reconstruct a clean and sharp image from a sequence of blurred frames. This concept lies in the construction of weights to amplify dominant frequencies in the Fourier spectrum of each frame. The reconstructed image is then obtained by taking the inverse Fourier transform of the average of all processed spectra. In this paper, we first suggest to replace the rigid registration step used in the original algorithm by a non-rigid registration in order to be able to process sequences acquired through atmospheric turbulence. Second, we propose to work in a wavelet domain instead of the Fourier one. This leads us to the construction of two types of algorithms. Finally, we propose an alternative approach to replace the weighting idea by an approach promoting the sparsity in the used space. Several experiments are provided to illustrate the efficiency of the proposed methods. △ Less

Submitted 30 October, 2024; originally announced October 2024.

Journal ref: Journal of Electronic Imaging, Vol.25, No.3, 033003-1--033003-9, May 2016

arXiv:2410.22777 [pdf, ps, other]

Bregman implementation of Meyer's $G-$norm for cartoon + textures decomposition

Authors: Jerome Gilles, Stanley Osher

Abstract: In this paper, we design a very simple algorithm based on Split Bregman iterations to numerically solve the cartoon + textures decomposition model of Meyer. This results in a significant gain in speed compared to Chambolle's nonlinear projectors. In this paper, we design a very simple algorithm based on Split Bregman iterations to numerically solve the cartoon + textures decomposition model of Meyer. This results in a significant gain in speed compared to Chambolle's nonlinear projectors. △ Less

Submitted 30 October, 2024; originally announced October 2024.

arXiv:2410.08987 [pdf, other]

Gradient-adjusted underdamped Langevin dynamics for sampling

Authors: Xinzhe Zuo, Stanley Osher, Wuchen Li

Abstract: Sampling from a target distribution is a fundamental problem. Traditional Markov chain Monte Carlo (MCMC) algorithms, such as the unadjusted Langevin algorithm (ULA), derived from the overdamped Langevin dynamics, have been extensively studied. From an optimization perspective, the Kolmogorov forward equation of the overdamped Langevin dynamics can be treated as the gradient flow of the relative e… ▽ More Sampling from a target distribution is a fundamental problem. Traditional Markov chain Monte Carlo (MCMC) algorithms, such as the unadjusted Langevin algorithm (ULA), derived from the overdamped Langevin dynamics, have been extensively studied. From an optimization perspective, the Kolmogorov forward equation of the overdamped Langevin dynamics can be treated as the gradient flow of the relative entropy in the space of probability densities embedded with Wassrstein-2 metrics. Several efforts have also been devoted to including momentum-based methods, such as underdamped Langevin dynamics for faster convergence of sampling algorithms. Recent advances in optimizations have demonstrated the effectiveness of primal-dual damping and Hessian-driven damping dynamics for achieving faster convergence in solving optimization problems. Motivated by these developments, we introduce a class of stochastic differential equations (SDEs) called gradient-adjusted underdamped Langevin dynamics (GAUL), which add stochastic perturbations in primal-dual damping dynamics and Hessian-driven damping dynamics from optimization. We prove that GAUL admits the correct stationary distribution, whose marginal is the target distribution. The proposed method outperforms overdamped and underdamped Langevin dynamics regarding convergence speed in the total variation distance for Gaussian target distributions. Moreover, using the Euler-Maruyama discretization, we show that the mixing time towards a biased target distribution only depends on the square root of the condition number of the target covariance matrix. Numerical experiments for non-Gaussian target distributions, such as Bayesian regression problems and Bayesian neural networks, further illustrate the advantages of our approach. △ Less

Submitted 26 October, 2024; v1 submitted 11 October, 2024; originally announced October 2024.

Comments: added references, discussion on preconditioner

arXiv:2409.16471 [pdf, other]

Score-based Neural Ordinary Differential Equations for Computing Mean Field Control Problems

Authors: Mo Zhou, Stanley Osher, Wuchen Li

Abstract: Classical neural ordinary differential equations (ODEs) are powerful tools for approximating the log-density functions in high-dimensional spaces along trajectories, where neural networks parameterize the velocity fields. This paper proposes a system of neural differential equations representing first- and second-order score functions along trajectories based on deep neural networks. We reformulat… ▽ More Classical neural ordinary differential equations (ODEs) are powerful tools for approximating the log-density functions in high-dimensional spaces along trajectories, where neural networks parameterize the velocity fields. This paper proposes a system of neural differential equations representing first- and second-order score functions along trajectories based on deep neural networks. We reformulate the mean field control (MFC) problem with individual noises into an unconstrained optimization problem framed by the proposed neural ODE system. Additionally, we introduce a novel regularization term to enforce characteristics of viscous Hamilton--Jacobi--Bellman (HJB) equations to be satisfied based on the evolution of the second-order score function. Examples include regularized Wasserstein proximal operators (RWPOs), probability flow matching of Fokker--Planck (FP) equations, and linear quadratic (LQ) MFC problems, which demonstrate the effectiveness and accuracy of the proposed method. △ Less

Submitted 29 January, 2025; v1 submitted 24 September, 2024; originally announced September 2024.

MSC Class: 34H05 ACM Class: G.1.7

arXiv:2409.01567 [pdf, ps, other]

Convergence of Noise-Free Sampling Algorithms with Regularized Wasserstein Proximals

Authors: Fuqun Han, Stanley Osher, Wuchen Li

Abstract: In this work, we investigate the convergence properties of the backward regularized Wasserstein proximal (BRWP) method for sampling a target distribution. The BRWP approach can be shown as a semi-implicit time discretization for a probability flow ODE with the score function whose density satisfies the Fokker-Planck equation of the overdamped Langevin dynamics. Specifically, the evolution of the s… ▽ More In this work, we investigate the convergence properties of the backward regularized Wasserstein proximal (BRWP) method for sampling a target distribution. The BRWP approach can be shown as a semi-implicit time discretization for a probability flow ODE with the score function whose density satisfies the Fokker-Planck equation of the overdamped Langevin dynamics. Specifically, the evolution of the score function is computed using a kernel formula derived from the regularized Wasserstein proximal operator. By applying the Laplace method to obtain the asymptotic expansion of this kernel formula, we establish guaranteed convergence in terms of the Kullback-Leibler divergence for the BRWP method towards a strongly log-concave target distribution. Our analysis also identifies the optimal and maximum step sizes for convergence. Furthermore, we demonstrate that the deterministic and semi-implicit BRWP scheme outperforms many classical Langevin Monte Carlo methods, such as the Unadjusted Langevin Algorithm (ULA), by offering faster convergence and reduced bias. Numerical experiments further validate the convergence analysis of the BRWP method. △ Less

Submitted 2 September, 2024; originally announced September 2024.

arXiv:2408.03532 [pdf, other]

Fast Partial Fourier Transforms for Large-Scale Ptychography

Authors: Ricardo Parada, Samy Wu Fung, Stanley Osher

Abstract: Ptychography is a popular imaging technique that combines diffractive imaging with scanning microscopy. The technique consists of a coherent beam that is scanned across an object in a series of overlapping positions, leading to reliable and improved reconstructions. Ptychographic microscopes allow for large fields to be imaged at high resolution at additional computational expense. In this work, w… ▽ More Ptychography is a popular imaging technique that combines diffractive imaging with scanning microscopy. The technique consists of a coherent beam that is scanned across an object in a series of overlapping positions, leading to reliable and improved reconstructions. Ptychographic microscopes allow for large fields to be imaged at high resolution at additional computational expense. In this work, we explore the use of the fast Partial Fourier Transforms (PFTs), which efficiently compute Fourier coefficients corresponding to low frequencies. The core idea is to use the PFT in a plug-and-play manner to warm-start existing ptychography algorithms such as the ptychographic iterative engine (PIE). This approach reduces the computational budget required to solve the ptychography problem. Our numerical results show that our scheme accelerates the convergence of traditional solvers without sacrificing quality of reconstruction. △ Less

Submitted 7 August, 2024; originally announced August 2024.

arXiv:2406.13781 [pdf, other]

A Primal-Dual Framework for Transformers and Neural Networks

Authors: Tan M. Nguyen, Tam Nguyen, Nhat Ho, Andrea L. Bertozzi, Richard G. Baraniuk, Stanley J. Osher

Abstract: Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresp… ▽ More Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification. △ Less

Submitted 19 June, 2024; originally announced June 2024.

Comments: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables

arXiv:2406.02003 [pdf, other]

Laplace Meets Moreau: Smooth Approximation to Infimal Convolutions Using Laplace's Method

Authors: Ryan J. Tibshirani, Samy Wu Fung, Howard Heaton, Stanley Osher

Abstract: We study approximations to the Moreau envelope -- and infimal convolutions more broadly -- based on Laplace's method, a classical tool in analysis which ties certain integrals to suprema of their integrands. We believe the connection between Laplace's method and infimal convolutions is generally deserving of more attention in the study of optimization and partial differential equations, since it b… ▽ More We study approximations to the Moreau envelope -- and infimal convolutions more broadly -- based on Laplace's method, a classical tool in analysis which ties certain integrals to suprema of their integrands. We believe the connection between Laplace's method and infimal convolutions is generally deserving of more attention in the study of optimization and partial differential equations, since it bears numerous potentially important applications, from proximal-type algorithms to solving Halmiton-Jacobi equations. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2405.10922 [pdf, other]

Kernel Expansions for High-Dimensional Mean-Field Control with Non-local Interactions

Authors: Alexander Vidal, Samy Wu Fung, Stanley Osher, Luis Tenorio, Levon Nurbekyan

Abstract: Mean-field control (MFC) problems aim to find the optimal policy to control massive populations of interacting agents. These problems are crucial in areas such as economics, physics, and biology. We consider the non-local setting, where the interactions between agents are governed by a suitable kernel. For $N$ agents, the interaction cost has $\mathcal{O}(N^2)$ complexity, which can be prohibitive… ▽ More Mean-field control (MFC) problems aim to find the optimal policy to control massive populations of interacting agents. These problems are crucial in areas such as economics, physics, and biology. We consider the non-local setting, where the interactions between agents are governed by a suitable kernel. For $N$ agents, the interaction cost has $\mathcal{O}(N^2)$ complexity, which can be prohibitively slow to evaluate and differentiate when $N$ is large. To this end, we propose an efficient primal-dual algorithm that utilizes basis expansions of the kernels. The basis expansions reduce the cost of computing the interactions, while the primal-dual methodology decouples the agents at the expense of solving for a moderate number of dual variables. We also demonstrate that our approach can further be structured in a multi-resolution manner, where we estimate optimal dual variables using a moderate $N$ and solve decoupled trajectory optimization problems for large $N$. We illustrate the effectiveness of our method on an optimal control of 5000 interacting quadrotors. △ Less

Submitted 17 May, 2024; originally announced May 2024.

arXiv:2404.01586 [pdf, other]

Efficient Computation of Mean field Control based Barycenters from Reaction-Diffusion Systems

Authors: Arjun Vijaywargiya, Guosheng Fu, Stanley Osher, Wuchen Li

Abstract: We develop a class of barycenter problems based on mean field control problems in three dimensions with associated reactive-diffusion systems of unnormalized multi-species densities. This problem is the generalization of the Wasserstein barycenter problem for single probability density functions. The primary objective is to present a comprehensive framework for efficiently computing the proposed v… ▽ More We develop a class of barycenter problems based on mean field control problems in three dimensions with associated reactive-diffusion systems of unnormalized multi-species densities. This problem is the generalization of the Wasserstein barycenter problem for single probability density functions. The primary objective is to present a comprehensive framework for efficiently computing the proposed variational problem: generalized Benamou-Brenier formulas with multiple input density vectors as boundary conditions. Our approach involves the utilization of high-order finite element discretizations of the spacetime domain to achieve improved accuracy. The discrete optimization problem is then solved using the primal-dual hybrid gradient (PDHG) algorithm, a first-order optimization method for effectively addressing a wide range of constrained optimization problems. The efficacy and robustness of our proposed framework are illustrated through several numerical examples in three dimensions, such as the computation of the barycenter of multi-density systems consisting of Gaussian distributions and reactive-diffusive multi-density systems involving 3D voxel densities. Additional examples highlighting computations on 2D embedded surfaces are also provided. △ Less

Submitted 1 April, 2024; originally announced April 2024.

arXiv:2403.02468 [pdf, other]

A Primal-dual hybrid gradient method for solving optimal control problems and the corresponding Hamilton-Jacobi PDEs

Authors: Tingwei Meng, Siting Liu, Wuchen Li, Stanley Osher

Abstract: Optimal control problems are crucial in various domains, including path planning, robotics, and humanoid control, demonstrating their broad applicability. The connection between optimal control and Hamilton-Jacobi (HJ) partial differential equations (PDEs) underscores the need for solving HJ PDEs to address these control problems effectively. While numerous numerical methods exist for tackling HJ… ▽ More Optimal control problems are crucial in various domains, including path planning, robotics, and humanoid control, demonstrating their broad applicability. The connection between optimal control and Hamilton-Jacobi (HJ) partial differential equations (PDEs) underscores the need for solving HJ PDEs to address these control problems effectively. While numerous numerical methods exist for tackling HJ PDEs across different dimensions, this paper introduces an innovative optimization-based approach that reformulates optimal control problems and HJ PDEs into a saddle point problem using a Lagrange multiplier. Our method, based on the preconditioned primal-dual hybrid gradient (PDHG) method, offers a solution to HJ PDEs with first-order accuracy and numerical unconditional stability, enabling larger time steps and avoiding the limitations of explicit time discretization methods. Our approach has ability to handle a wide variety of Hamiltonian functions, including those that are non-smooth and dependent on time and space, through a simplified saddle point formulation that facilitates easy and parallelizable updates. Furthermore, our framework extends to viscous HJ PDEs and stochastic optimal control problems, showcasing its versatility. Through a series of numerical examples, we demonstrate the method's effectiveness in managing diverse Hamiltonians and achieving efficient parallel computation, highlighting its potential for wide-ranging applications in optimal control and beyond. △ Less

Submitted 4 March, 2024; originally announced March 2024.

arXiv:2402.17745 [pdf, other]

Low-light phase retrieval with implicit generative priors

Authors: Raunak Manekar, Elisa Negrini, Minh Pham, Daniel Jacobs, Jaideep Srivastava, Stanley J. Osher, Jianwei Miao

Abstract: Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ C… ▽ More Phase retrieval (PR) is fundamentally important in scientific imaging and is crucial for nanoscale techniques like coherent diffractive imaging (CDI). Low radiation dose imaging is essential for applications involving radiation-sensitive samples. However, most PR methods struggle in low-dose scenarios due to high shot noise. Recent advancements in optical data acquisition setups, such as in-situ CDI, have shown promise for low-dose imaging, but they rely on a time series of measurements, making them unsuitable for single-image applications. Similarly, data-driven phase retrieval techniques are not easily adaptable to data-scarce situations. Zero-shot deep learning methods based on pre-trained and implicit generative priors have been effective in various imaging tasks but have shown limited success in PR. In this work, we propose low-dose deep image prior (LoDIP), which combines in-situ CDI with the power of implicit generative priors to address single-image low-dose phase retrieval. Quantitative evaluations demonstrate LoDIP's superior performance in this task and its applicability to real experimental scenarios. △ Less

Submitted 23 August, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

MSC Class: 68T10 68T07 78A46

arXiv:2402.16821 [pdf, other]

Numerical Analysis on Neural Network Projected Schemes for Approximating One Dimensional Wasserstein Gradient Flows

Authors: Xinzhe Zuo, Jiaxi Zhao, Shu Liu, Stanley Osher, Wuchen Li

Abstract: We provide a numerical analysis and computation of neural network projected schemes for approximating one dimensional Wasserstein gradient flows. We approximate the Lagrangian mapping functions of gradient flows by the class of two-layer neural network functions with ReLU (rectified linear unit) activation functions. The numerical scheme is based on a projected gradient method, namely the Wasserst… ▽ More We provide a numerical analysis and computation of neural network projected schemes for approximating one dimensional Wasserstein gradient flows. We approximate the Lagrangian mapping functions of gradient flows by the class of two-layer neural network functions with ReLU (rectified linear unit) activation functions. The numerical scheme is based on a projected gradient method, namely the Wasserstein natural gradient, where the projection is constructed from the $L^2$ mapping spaces onto the neural network parameterized mapping space. We establish theoretical guarantees for the performance of the neural projected dynamics. We derive a closed-form update for the scheme with well-posedness and explicit consistency guarantee for a particular choice of network structure. General truncation error analysis is also established on the basis of the projective nature of the dynamics. Numerical examples, including gradient drift Fokker-Planck equations, porous medium equations, and Keller-Segel models, verify the accuracy and effectiveness of the proposed neural projected algorithm. △ Less

Submitted 26 February, 2024; originally announced February 2024.

arXiv:2402.06162 [pdf, other]

Wasserstein proximal operators describe score-based generative models and resolve memorization

Authors: Benjamin J. Zhang, Siting Liu, Wuchen Li, Markos A. Katsoulakis, Stanley J. Osher

Abstract: We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form… ▽ More We focus on the fundamental mathematical structure of score-based generative models (SGMs). We first formulate SGMs in terms of the Wasserstein proximal operator (WPO) and demonstrate that, via mean-field games (MFGs), the WPO formulation reveals mathematical structure that describes the inductive bias of diffusion and score-based models. In particular, MFGs yield optimality conditions in the form of a pair of coupled partial differential equations: a forward-controlled Fokker-Planck (FP) equation, and a backward Hamilton-Jacobi-Bellman (HJB) equation. Via a Cole-Hopf transformation and taking advantage of the fact that the cross-entropy can be related to a linear functional of the density, we show that the HJB equation is an uncontrolled FP equation. Second, with the mathematical structure at hand, we present an interpretable kernel-based model for the score function which dramatically improves the performance of SGMs in terms of training samples and training time. In addition, the WPO-informed kernel model is explicitly constructed to avoid the recently studied memorization effects of score-based generative models. The mathematical form of the new kernel-based models in combination with the use of the terminal condition of the MFG reveals new explanations for the manifold learning and generalization properties of SGMs, and provides a resolution to their memorization effects. Finally, our mathematically informed, interpretable kernel-based model suggests new scalable bespoke neural network architectures for high-dimensional applications. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2401.14602 [pdf, other]

Numerical analysis of a first-order computational algorithm for reaction-diffusion equations via the primal-dual hybrid gradient method

Authors: Shu Liu, Xinzhe Zuo, Stanley Osher, Wuchen Li

Abstract: In arXiv:2305.03945 [math.NA], a first-order optimization algorithm has been introduced to solve time-implicit schemes of reaction-diffusion equations. In this research, we conduct theoretical studies on this first-order algorithm equipped with a quadratic regularization term. We provide sufficient conditions under which the proposed algorithm and its time-continuous limit converge exponentially f… ▽ More In arXiv:2305.03945 [math.NA], a first-order optimization algorithm has been introduced to solve time-implicit schemes of reaction-diffusion equations. In this research, we conduct theoretical studies on this first-order algorithm equipped with a quadratic regularization term. We provide sufficient conditions under which the proposed algorithm and its time-continuous limit converge exponentially fast to a desired time-implicit numerical solution. We show both theoretically and numerically that the convergence rate is independent of the grid size, which makes our method suitable for large-scale problems. The efficiency of our algorithm has been verified via a series of numerical examples conducted on various types of reaction-diffusion equations. The choice of optimal hyperparameters as well as comparisons with some classical root-finding algorithms are also discussed in the numerical section. △ Less

Submitted 28 March, 2025; v1 submitted 25 January, 2024; originally announced January 2024.

Comments: Revised version, comments and suggestions are welcome

arXiv:2401.13125 [pdf, ps, other]

Tensor train based sampling algorithms for approximating regularized Wasserstein proximal operators

Authors: Fuqun Han, Stanley Osher, Wuchen Li

Abstract: We present a tensor train (TT) based algorithm designed for sampling from a target distribution and employ TT approximation to capture the high-dimensional probability density evolution of overdamped Langevin dynamics. This involves utilizing the regularized Wasserstein proximal operator, which exhibits a simple kernel integration formulation, i.e., the softmax formula of the traditional proximal… ▽ More We present a tensor train (TT) based algorithm designed for sampling from a target distribution and employ TT approximation to capture the high-dimensional probability density evolution of overdamped Langevin dynamics. This involves utilizing the regularized Wasserstein proximal operator, which exhibits a simple kernel integration formulation, i.e., the softmax formula of the traditional proximal operator. The integration, performed in $\mathbb{R}^d$, poses a challenge in practical scenarios, making the algorithm practically implementable only with the aid of TT approximation. In the specific context of Gaussian distributions, we rigorously establish the unbiasedness and linear convergence of our sampling algorithm towards the target distribution. To assess the effectiveness of our proposed methods, we apply them to various scenarios, including Gaussian families, Gaussian mixtures, bimodal distributions, and Bayesian inverse problems in numerical examples. The sampling algorithm exhibits superior accuracy and faster convergence when compared to classical Langevin dynamics-type sampling algorithms. △ Less

Submitted 12 March, 2025; v1 submitted 23 January, 2024; originally announced January 2024.

Comments: Revised version

arXiv:2401.09547 [pdf, other]

A deep learning algorithm for computing mean field control problems via forward-backward score dynamics

Authors: Mo Zhou, Stanley Osher, Wuchen Li

Abstract: We propose a deep learning approach to compute mean field control problems with individual noises. The problem consists of the Fokker-Planck (FP) equation and the Hamilton-Jacobi-Bellman (HJB) equation. Using the differential of the entropy, namely the score function, we first formulate the deterministic forward-backward characteristics for the mean field control system, which is different from th… ▽ More We propose a deep learning approach to compute mean field control problems with individual noises. The problem consists of the Fokker-Planck (FP) equation and the Hamilton-Jacobi-Bellman (HJB) equation. Using the differential of the entropy, namely the score function, we first formulate the deterministic forward-backward characteristics for the mean field control system, which is different from the classical forward-backward stochastic differential equations (FBSDEs). We further apply the neural network approximation to fit the proposed deterministic characteristic lines. Numerical examples, including the control problem with entropy potential energy, the linear quadratic regulator, and the systemic risks, demonstrate the effectiveness of the proposed method. △ Less

Submitted 17 May, 2025; v1 submitted 17 January, 2024; originally announced January 2024.

MSC Class: 49N80 (Primary) 35Q89 (Secondary) ACM Class: G.1.6; G.1.8

arXiv:2401.07364 [pdf, other]

PDE Generalization of In-Context Operator Networks: A Study on 1D Scalar Nonlinear Conservation Laws

Authors: Liu Yang, Stanley J. Osher

Abstract: Can we build a single large model for a wide range of PDE-related scientific learning tasks? Can this model generalize to new PDEs, even of new forms, without any fine-tuning? In-context operator learning and the corresponding model In-Context Operator Networks (ICON) represent an initial exploration of these questions. The capability of ICON regarding the first question has been demonstrated prev… ▽ More Can we build a single large model for a wide range of PDE-related scientific learning tasks? Can this model generalize to new PDEs, even of new forms, without any fine-tuning? In-context operator learning and the corresponding model In-Context Operator Networks (ICON) represent an initial exploration of these questions. The capability of ICON regarding the first question has been demonstrated previously. In this paper, we present a detailed methodology for solving PDE problems with ICON, and show how a single ICON model can make forward and reverse predictions for different equations with different strides, provided with appropriately designed data prompts. We show the positive evidence to the second question, i.e., ICON can generalize well to some PDEs with new forms without any fine-tuning. This is exemplified through a study on 1D scalar nonlinear conservation laws, a family of PDEs with temporal evolution. We also show how to broaden the range of problems that an ICON model can address, by transforming functions and equations to ICON's capability scope. We believe that the progress in this paper is a significant step towards the goal of training a foundation model for PDE-related tasks under the in-context operator learning framework. △ Less

Submitted 21 January, 2024; v1 submitted 14 January, 2024; originally announced January 2024.

arXiv:2310.12513 [pdf, other]

Real space iterative reconstruction for vector tomography (RESIRE-V)

Authors: Minh Pham, Xingyuan Lu, Arjun Rana, Stanley Osher, Jianwei Miao

Abstract: Tomography has had an important impact on the physical, biological, and medical sciences. To date, most tomographic applications have been focused on 3D scalar reconstructions. However, in some crucial applications, vector tomography is required to reconstruct 3D vector fields such as the electric and magnetic fields. Over the years, several vector tomography methods have been developed. Here, we… ▽ More Tomography has had an important impact on the physical, biological, and medical sciences. To date, most tomographic applications have been focused on 3D scalar reconstructions. However, in some crucial applications, vector tomography is required to reconstruct 3D vector fields such as the electric and magnetic fields. Over the years, several vector tomography methods have been developed. Here, we present the mathematical foundation and algorithmic implementation of REal Space Iterative REconstruction for Vector tomography, termed RESIRE-V. RESIRE-V uses multiple tilt series of projections and iterates between the projections and a 3D reconstruction. Each iteration consists of a forward step using the Radon transform and a backward step using its transpose, then updates the object via gradient descent. Incorporating with a 3D support constraint, the algorithm iteratively minimizes an error metric, defined as the difference between the measured and calculated projections. The algorithm can also be used to refine the tilt angles and further improve the 3D reconstruction. To validate RESIRE-V, we first apply it to a simulated data set of the 3D magnetization vector field, consisting of two orthogonal tilt series, each with a missing wedge. Our quantitative analysis shows that the three components of the reconstructed magnetization vector field agree well with the ground-truth counterparts. We then use RESIRE-V to reconstruct the 3D magnetization vector field of a ferromagnetic meta-lattice consisting of three tilt series. Our 3D vector reconstruction reveals the existence of topological magnetic defects with positive and negative charges. We expect that RESIRE-V can be incorporated into different imaging modalities as a general vector tomography method. △ Less

Submitted 19 October, 2023; originally announced October 2023.

arXiv:2310.01605 [pdf, other]

Primal-dual hybrid gradient algorithms for computing time-implicit Hamilton-Jacobi equations

Authors: Tingwei Meng, Wenbo Hao, Siting Liu, Stanley J. Osher, Wuchen Li

Abstract: Hamilton-Jacobi (HJ) partial differential equations (PDEs) have diverse applications spanning physics, optimal control, game theory, and imaging sciences. This research introduces a first-order optimization-based technique for HJ PDEs, which formulates the time-implicit update of HJ PDEs as saddle point problems. We remark that the saddle point formulation for HJ equations is aligned with the prim… ▽ More Hamilton-Jacobi (HJ) partial differential equations (PDEs) have diverse applications spanning physics, optimal control, game theory, and imaging sciences. This research introduces a first-order optimization-based technique for HJ PDEs, which formulates the time-implicit update of HJ PDEs as saddle point problems. We remark that the saddle point formulation for HJ equations is aligned with the primal-dual formulation of optimal transport and potential mean-field games (MFGs). This connection enables us to extend MFG techniques and design numerical schemes for solving HJ PDEs. We employ the primal-dual hybrid gradient (PDHG) method to solve the saddle point problems, benefiting from the simple structures that enable fast computations in updates. Remarkably, the method caters to a broader range of Hamiltonians, encompassing non-smooth and spatiotemporally dependent cases. The approach's effectiveness is verified through various numerical examples in both one-dimensional and two-dimensional examples, such as quadratic and $L^1$ Hamiltonians with spatial and time dependence. △ Less

Submitted 2 October, 2023; originally announced October 2023.

arXiv:2308.14945 [pdf, other]

Noise-Free Sampling Algorithms via Regularized Wasserstein Proximals

Authors: Hong Ye Tan, Stanley Osher, Wuchen Li

Abstract: We consider the problem of sampling from a distribution governed by a potential function. This work proposes an explicit score based MCMC method that is deterministic, resulting in a deterministic evolution for particles rather than a stochastic differential equation evolution. The score term is given in closed form by a regularized Wasserstein proximal, using a kernel convolution that is approxim… ▽ More We consider the problem of sampling from a distribution governed by a potential function. This work proposes an explicit score based MCMC method that is deterministic, resulting in a deterministic evolution for particles rather than a stochastic differential equation evolution. The score term is given in closed form by a regularized Wasserstein proximal, using a kernel convolution that is approximated by sampling. We demonstrate fast convergence on various problems and show improved dimensional dependence of mixing time bounds for the case of Gaussian distributions compared to the unadjusted Langevin algorithm (ULA) and the Metropolis-adjusted Langevin algorithm (MALA). We additionally derive closed form expressions for the distributions at each iterate for quadratic potential functions, characterizing the variance reduction. Empirical results demonstrate that the particles behave in an organized manner, lying on level set contours of the potential. Moreover, the posterior mean estimator of the proposed method is shown to be closer to the maximum a-posteriori estimator compared to ULA and MALA in the context of Bayesian logistic regression. Additional examples demonstrate competitive performance for Bayesian neural network training. △ Less

Submitted 2 October, 2023; v1 submitted 28 August, 2023; originally announced August 2023.

MSC Class: 65C05; 62G07

arXiv:2308.05061 [pdf, other]

Fine-Tune Language Models as Multi-Modal Differential Equation Solvers

Authors: Liu Yang, Siting Liu, Stanley J. Osher

Abstract: In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human… ▽ More In the growing domain of scientific machine learning, in-context operator learning has shown notable potential in building foundation models, as in this framework the model is trained to learn operators and solve differential equations using prompted data, during the inference stage without weight updates. However, the current model's overdependence on function data overlooks the invaluable human insight into the operator. To address this, we present a transformation of in-context operator learning into a multi-modal paradigm. In particular, we take inspiration from the recent success of large language models, and propose using "captions" to integrate human knowledge about the operator, expressed through natural language descriptions and equations. Also, we introduce a novel approach to train a language-model-like architecture, or directly fine-tune existing language models, for in-context operator learning. We beat the baseline on single-modal learning tasks, and also demonstrated the effectiveness of multi-modal learning in enhancing performance and reducing function data requirements. The proposed method not only significantly enhanced the development of the in-context operator learning paradigm, but also created a new path for the application of language models. △ Less

Submitted 1 February, 2024; v1 submitted 9 August, 2023; originally announced August 2023.

arXiv:2306.11283 [pdf]

Computational Microscopy beyond Perfect Lenses

Authors: Xingyuan Lu, Minh Pham, Elisa Negrini, Damek Davis, Stanley J. Osher, Jianwei Miao

Abstract: We demonstrate that in situ coherent diffractive imaging (CDI), which harnesses the coherent interference between a strong and a weak beam illuminating a static and dynamic structure, can be a very dose-efficient imaging method. At low doses, in situ CDI can achieve higher resolution than perfect lenses with the point spread function as a delta function. Both our numerical simulation and experimen… ▽ More We demonstrate that in situ coherent diffractive imaging (CDI), which harnesses the coherent interference between a strong and a weak beam illuminating a static and dynamic structure, can be a very dose-efficient imaging method. At low doses, in situ CDI can achieve higher resolution than perfect lenses with the point spread function as a delta function. Both our numerical simulation and experimental results show that the combination of in situ CDI and ptychography can reduce the dose by two orders of magnitude over ptychography. We expect that computational microscopy based on in situ CDI can be implemented in different imaging modalities with photons and electrons for low-dose imaging of radiation-sensitive materials and biological samples. △ Less

Submitted 3 May, 2024; v1 submitted 20 June, 2023; originally announced June 2023.

arXiv:2306.06287 [pdf, other]

Generalized optimal transport and mean field control problems for reaction-diffusion systems with high-order finite element computation

Authors: Guosheng Fu, Stanley Osher, Will Pazner, Wuchen Li

Abstract: We design and compute a class of optimal control problems for reaction-diffusion systems. They form mean field control problems related to multi-density reaction-diffusion systems. To solve proposed optimal control problems numerically, we first apply high-order finite element methods to discretize the space-time domain and then solve the optimal control problem using augmented Lagrangian methods… ▽ More We design and compute a class of optimal control problems for reaction-diffusion systems. They form mean field control problems related to multi-density reaction-diffusion systems. To solve proposed optimal control problems numerically, we first apply high-order finite element methods to discretize the space-time domain and then solve the optimal control problem using augmented Lagrangian methods (ALG2). Numerical examples, including generalized optimal transport and mean field control problems between Gaussian distributions and image densities, demonstrate the effectiveness of the proposed modeling and computational methods for mean field control problems involving reaction-diffusion equations/systems. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: 40 pages, 12 figures

arXiv:2305.03945 [pdf, other]

A first-order computational algorithm for reaction-diffusion type equations via primal-dual hybrid gradient method

Authors: Shu Liu, Siting Liu, Stanley Osher, Wuchen Li

Abstract: We propose an easy-to-implement iterative method for resolving the implicit (or semi-implicit) schemes arising in solving reaction-diffusion (RD) type equations. We formulate the nonlinear time implicit scheme as a min-max saddle point problem and then apply the primal-dual hybrid gradient (PDHG) method. Suitable precondition matrices are applied to the PDHG method to accelerate the convergence of… ▽ More We propose an easy-to-implement iterative method for resolving the implicit (or semi-implicit) schemes arising in solving reaction-diffusion (RD) type equations. We formulate the nonlinear time implicit scheme as a min-max saddle point problem and then apply the primal-dual hybrid gradient (PDHG) method. Suitable precondition matrices are applied to the PDHG method to accelerate the convergence of algorithms under different circumstances. Furthermore, our method is applicable to various discrete numerical schemes with high flexibility. From various numerical examples tested in this paper, the proposed method converges properly and can efficiently produce numerical solutions with sufficient accuracy. △ Less

Submitted 6 May, 2023; originally announced May 2023.

Comments: Any feedbacks or comments are welcome

arXiv:2304.14574 [pdf, other]

doi 10.4310/AMSA.2024.v9.n2.a7

Primal-Dual Damping algorithms for optimization

Authors: X. Zuo, S. Osher, W. Li

Abstract: We propose an unconstrained optimization method based on the well-known primal-dual hybrid gradient (PDHG) algorithm. We first formulate the optimality condition of the unconstrained optimization problem as a saddle point problem. We then compute the minimizer by applying generalized primal-dual hybrid gradient algorithms. Theoretically, we demonstrate the continuous-time limit of the proposed alg… ▽ More We propose an unconstrained optimization method based on the well-known primal-dual hybrid gradient (PDHG) algorithm. We first formulate the optimality condition of the unconstrained optimization problem as a saddle point problem. We then compute the minimizer by applying generalized primal-dual hybrid gradient algorithms. Theoretically, we demonstrate the continuous-time limit of the proposed algorithm forms a class of second-order differential equations, which contains and extends the heavy ball ODEs and Hessian-driven damping dynamics. Following the Lyapunov analysis of the ODE system, we prove the linear convergence of the algorithm for strongly convex functions. Experimentally, we showcase the advantage of algorithms on several convex and non-convex optimization problems by comparing the performance with other well-known algorithms, such as Nesterov's accelerated gradient methods. In particular, we demonstrate that our algorithm is efficient in training two-layer and convolution neural networks in supervised learning problems. △ Less

Submitted 8 May, 2023; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: fixed typo in eq 2.6(b)

Journal ref: Annals of Mathematical Sciences and Applications, vol. 9, no. 2, pp. 467-504, 2024

arXiv:2304.07993 [pdf, other]

doi 10.1073/pnas.2310142120

In-Context Operator Learning with Data Prompts for Differential Equation Problems

Authors: Liu Yang, Siting Liu, Tingwei Meng, Stanley J. Osher

Abstract: This paper introduces a new neural-network-based approach, namely In-Context Operator Networks (ICON), to simultaneously learn operators from the prompted data and apply it to new questions during the inference stage, without any weight update. Existing methods are limited to using a neural network to approximate a specific equation solution or a specific operator, requiring retraining when switch… ▽ More This paper introduces a new neural-network-based approach, namely In-Context Operator Networks (ICON), to simultaneously learn operators from the prompted data and apply it to new questions during the inference stage, without any weight update. Existing methods are limited to using a neural network to approximate a specific equation solution or a specific operator, requiring retraining when switching to a new problem with different equations. By training a single neural network as an operator learner, we can not only get rid of retraining (even fine-tuning) the neural network for new problems, but also leverage the commonalities shared across operators so that only a few demos in the prompt are needed when learning a new operator. Our numerical results show the neural network's capability as a few-shot operator learner for a diversified type of differential equation problems, including forward and inverse problems of ordinary differential equations (ODEs), partial differential equations (PDEs), and mean-field control (MFC) problems, and also show that it can generalize its learning capability to operators beyond the training distribution. △ Less

Submitted 19 September, 2023; v1 submitted 17 April, 2023; originally announced April 2023.

Comments: The second and third authors contributed equally. This is an outdated preprint. Please refer to the updated version published in PNAS: www.pnas.org/doi/10.1073/pnas.2310142120 See code in https://github.com/LiuYangMage/in-context-operator-networks

arXiv:2303.08950 [pdf, other]

doi 10.1016/j.jcp.2023.112375

High order spatial discretization for variational time implicit schemes: Wasserstein gradient flows and reaction-diffusion systems

Authors: Guosheng Fu, Stanley Osher, Wuchen Li

Abstract: We design and compute first-order implicit-in-time variational schemes with high-order spatial discretization for initial value gradient flows in generalized optimal transport metric spaces. We first review some examples of gradient flows in generalized optimal transport spaces from the Onsager principle. We then use a one-step time relaxation optimization problem for time-implicit schemes, namely… ▽ More We design and compute first-order implicit-in-time variational schemes with high-order spatial discretization for initial value gradient flows in generalized optimal transport metric spaces. We first review some examples of gradient flows in generalized optimal transport spaces from the Onsager principle. We then use a one-step time relaxation optimization problem for time-implicit schemes, namely generalized Jordan-Kinderlehrer-Otto schemes. Their minimizing systems satisfy implicit-in-time schemes for initial value gradient flows with first-order time accuracy. We adopt the first-order optimization scheme ALG2 (Augmented Lagrangian method) and high-order finite element methods in spatial discretization to compute the one-step optimization problem. This allows us to derive the implicit-in-time update of initial value gradient flows iteratively. We remark that the iteration in ALG2 has a simple-to-implement point-wise update based on optimal transport and Onsager's activation functions. The proposed method is unconditionally stable for convex cases. Numerical examples are presented to demonstrate the effectiveness of the methods in two-dimensional PDEs, including Wasserstein gradient flows, Fisher--Kolmogorov-Petrovskii-Piskunov equation, and two and four species reversible reaction-diffusion systems. △ Less

Submitted 15 March, 2023; originally announced March 2023.

arXiv:2302.02308 [pdf, other]

doi 10.1016/j.jcp.2023.112346

High order computation of optimal transport, mean field planning, and mean field games

Authors: Guosheng Fu, Siting Liu, Stanley Osher, Wuchen Li

Abstract: Mean-field games (MFGs) have shown strong modeling capabilities for large systems in various fields, driving growth in computational methods for mean-field game problems. However, high order methods have not been thoroughly investigated. In this work, we explore applying general high-order numerical schemes with finite element methods in the space-time domain for computing the optimal transport (O… ▽ More Mean-field games (MFGs) have shown strong modeling capabilities for large systems in various fields, driving growth in computational methods for mean-field game problems. However, high order methods have not been thoroughly investigated. In this work, we explore applying general high-order numerical schemes with finite element methods in the space-time domain for computing the optimal transport (OT), mean-field planning (MFP), and MFG problems. We conduct several experiments to validate the convergence rate of the high order method numerically. Those numerical experiments also demonstrate the efficiency and effectiveness of our approach. △ Less

Submitted 5 February, 2023; originally announced February 2023.

MSC Class: 65M60

arXiv:2301.10301 [pdf, other]

A kernel formula for regularized Wasserstein proximal operators

Authors: Wuchen Li, Siting Liu, Stanley Osher

Abstract: We study a class of regularized proximal operators in Wasserstein-2 space. We derive their solutions by kernel integration formulas. We obtain the Wasserstein proximal operator using a pair of forward-backward partial differential equations consisting of a continuity equation and a Hamilton-Jacobi equation with a terminal time potential function and an initial time density function. We regularize… ▽ More We study a class of regularized proximal operators in Wasserstein-2 space. We derive their solutions by kernel integration formulas. We obtain the Wasserstein proximal operator using a pair of forward-backward partial differential equations consisting of a continuity equation and a Hamilton-Jacobi equation with a terminal time potential function and an initial time density function. We regularize the PDE pair by adding forward and backward Laplacian operators. We apply Hopf-Cole type transformations to rewrite these regularized PDE pairs into forward-backward heat equations. We then use the fundamental solution of the heat equation to represent the regularized Wasserstein proximal with kernel integral formulas. Numerical examples show the effectiveness of kernel formulas in approximating the Wasserstein proximal operator. △ Less

Submitted 24 January, 2023; originally announced January 2023.

arXiv:2301.00437 [pdf, other]

Neural Collapse in Deep Linear Networks: From Balanced to Imbalanced Data

Authors: Hien Dang, Tho Tran, Stanley Osher, Hung Tran-The, Nhat Ho, Tan Nguyen

Abstract: Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-laye… ▽ More Modern deep neural networks have achieved impressive performance on tasks from image classification to natural language processing. Surprisingly, these complex systems with massive amounts of parameters exhibit the same structural properties in their last-layer features and classifiers across canonical datasets when training until convergence. In particular, it has been observed that the last-layer features collapse to their class-means, and those class-means are the vertices of a simplex Equiangular Tight Frame (ETF). This phenomenon is known as Neural Collapse (NC). Recent papers have theoretically shown that NC emerges in the global minimizers of training problems with the simplified "unconstrained feature model". In this context, we take a step further and prove the NC occurrences in deep linear networks for the popular mean squared error (MSE) and cross entropy (CE) losses, showing that global solutions exhibit NC properties across the linear layers. Furthermore, we extend our study to imbalanced data for MSE loss and present the first geometric analysis of NC under bias-free setting. Our results demonstrate the convergence of the last-layer features and classifiers to a geometry consisting of orthogonal vectors, whose lengths depend on the amount of data in their corresponding classes. Finally, we empirically validate our theoretical analyses on synthetic and practical network architectures with both balanced and imbalanced scenarios. △ Less

Submitted 18 June, 2023; v1 submitted 1 January, 2023; originally announced January 2023.

Comments: 75 pages, 20 figures, 4 tables. Hien Dang and Tho Tran contributed equally to this work

arXiv:2211.16757 [pdf, other]

Taming Hyperparameter Tuning in Continuous Normalizing Flows Using the JKO Scheme

Authors: Alexander Vidal, Samy Wu Fung, Luis Tenorio, Stanley Osher, Levon Nurbekyan

Abstract: A normalizing flow (NF) is a mapping that transforms a chosen probability distribution to a normal distribution. Such flows are a common technique used for data generation and density estimation in machine learning and data science. The density estimate obtained with a NF requires a change of variables formula that involves the computation of the Jacobian determinant of the NF transformation. In o… ▽ More A normalizing flow (NF) is a mapping that transforms a chosen probability distribution to a normal distribution. Such flows are a common technique used for data generation and density estimation in machine learning and data science. The density estimate obtained with a NF requires a change of variables formula that involves the computation of the Jacobian determinant of the NF transformation. In order to tractably compute this determinant, continuous normalizing flows (CNF) estimate the mapping and its Jacobian determinant using a neural ODE. Optimal transport (OT) theory has been successfully used to assist in finding CNFs by formulating them as OT problems with a soft penalty for enforcing the standard normal distribution as a target measure. A drawback of OT-based CNFs is the addition of a hyperparameter, $α$, that controls the strength of the soft penalty and requires significant tuning. We present JKO-Flow, an algorithm to solve OT-based CNF without the need of tuning $α$. This is achieved by integrating the OT CNF framework into a Wasserstein gradient flow framework, also known as the JKO scheme. Instead of tuning $α$, we repeatedly solve the optimization problem for a fixed $α$ effectively performing a JKO update with a time-step $α$. Hence we obtain a "divide and conquer" algorithm by repeatedly solving simpler problems instead of solving a potentially harder problem with large $α$. △ Less

Submitted 30 November, 2022; originally announced November 2022.

arXiv:2211.15779 [pdf, other]

Revisiting Over-smoothing and Over-squashing Using Ollivier-Ricci Curvature

Authors: Khang Nguyen, Hieu Nong, Vinh Nguyen, Nhat Ho, Stanley Osher, Tan Nguyen

Abstract: Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues… ▽ More Graph Neural Networks (GNNs) had been demonstrated to be inherently susceptible to the problems of over-smoothing and over-squashing. These issues prohibit the ability of GNNs to model complex graph interactions by limiting their effectiveness in taking into account distant information. Our study reveals the key connection between the local graph geometry and the occurrence of both of these issues, thereby providing a unified framework for studying them at a local scale using the Ollivier-Ricci curvature. Specifically, we demonstrate that over-smoothing is linked to positive graph curvature while over-squashing is linked to negative graph curvature. Based on our theory, we propose the Batch Ollivier-Ricci Flow, a novel rewiring algorithm capable of simultaneously addressing both over-smoothing and over-squashing. △ Less

Submitted 31 May, 2023; v1 submitted 28 November, 2022; originally announced November 2022.

Comments: Accepted at ICML 2023; 24 pages, 4 figures

arXiv:2211.12997 [pdf, other]

doi 10.1073/pnas.2220469120

A Hamilton-Jacobi-based Proximal Operator

Authors: Stanley Osher, Howard Heaton, Samy Wu Fung

Abstract: First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from… ▽ More First-order optimization algorithms are widely used today. Two standard building blocks in these algorithms are proximal operators (proximals) and gradients. Although gradients can be computed for a wide array of functions, explicit proximal formulas are only known for limited classes of functions. We provide an algorithm, HJ-Prox, for accurately approximating such proximals. This is derived from a collection of relations between proximals, Moreau envelopes, Hamilton-Jacobi (HJ) equations, heat equations, and Monte Carlo sampling. In particular, HJ-Prox smoothly approximates the Moreau envelope and its gradient. The smoothness can be adjusted to act as a denoiser. Our approach applies even when functions are only accessible by (possibly noisy) blackbox samples. We show HJ-Prox is effective numerically via several examples. △ Less

Submitted 28 May, 2023; v1 submitted 23 November, 2022; originally announced November 2022.

arXiv:2209.15092 [pdf, other]

Improving Generative Flow Networks with Path Regularization

Authors: Anh Do, Duy Dinh, Tan Nguyen, Khuong Nguyen, Stanley Osher, Nhat Ho

Abstract: Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory… ▽ More Generative Flow Networks (GFlowNets) are recently proposed models for learning stochastic policies that generate compositional objects by sequences of actions with the probability proportional to a given reward function. The central problem of GFlowNets is to improve their exploration and generalization. In this work, we propose a novel path regularization method based on optimal transport theory that places prior constraints on the underlying structure of the GFlowNets. The prior is designed to help the GFlowNets better discover the latent structure of the target distribution or enhance its ability to explore the environment in the context of active learning. The path regularization controls the flow in GFlowNets to generate more diverse and novel candidates via maximizing the optimal transport distances between two forward policies or to improve the generalization via minimizing the optimal transport distances. In addition, we derive an efficient implementation of the regularization by finding its closed form solutions in specific cases and a meaningful upper bound that can be used as an approximation to minimize the regularization term. We empirically demonstrate the advantage of our path regularization on a wide range of tasks, including synthetic hypergrid environment modeling, discrete probabilistic modeling, and biological sequence design. △ Less

Submitted 29 September, 2022; originally announced September 2022.

Comments: 28 pages, 2 figures, 5 tables. Anh Do, Duy Dinh, and Tan Nguyen contributed equally to this work

arXiv:2208.00579 [pdf, other]

Momentum Transformer: Closing the Performance Gap Between Self-attention and Its Linearization

Authors: Tan Nguyen, Richard G. Baraniuk, Robert M. Kirby, Stanley J. Osher, Bao Wang

Abstract: Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accurac… ▽ More Transformers have achieved remarkable success in sequence modeling and beyond but suffer from quadratic computational and memory complexities with respect to the length of the input sequence. Leveraging techniques include sparse and linear attention and hashing tricks; efficient transformers have been proposed to reduce the quadratic complexity of transformers but significantly degrade the accuracy. In response, we first interpret the linear attention and residual connections in computing the attention map as gradient descent steps. We then introduce momentum into these components and propose the \emph{momentum transformer}, which utilizes momentum to improve the accuracy of linear transformers while maintaining linear memory and computational complexities. Furthermore, we develop an adaptive strategy to compute the momentum value for our model based on the optimal momentum for quadratic optimization. This adaptive momentum eliminates the need to search for the optimal momentum value and further enhances the performance of the momentum transformer. A range of experiments on both autoregressive and non-autoregressive tasks, including image generation and machine translation, demonstrate that the momentum transformer outperforms popular linear transformers in training efficiency and accuracy. △ Less

Submitted 31 July, 2022; originally announced August 2022.

Comments: 22 pages, 5 figures. arXiv admin note: substantial text overlap with arXiv:2110.07034

MSC Class: 65Pxx

Showing 1–50 of 145 results for author: Osher, S