Convergence of Langevin-simulated annealing algorithms with multiplicative noise
HTML articles powered by AMS MathViewer
- by Pierre Bras and Gilles Pagès;
- Math. Comp. 93 (2024), 1761-1803
- DOI: https://doi.org/10.1090/mcom/3899
- Published electronically: March 15, 2024
- HTML | PDF | Request permission
Abstract:
We study the convergence of Langevin-Simulated Annealing type algorithms with multiplicative noise, i.e. for $V : \mathbb {R}^d \to \mathbb {R}$ a potential function to minimize, we consider the stochastic differential equation $dY_t = - \sigma \sigma ^\top \nabla V(Y_t)$ $dt + a(t)\sigma (Y_t)dW_t + a(t)^2\Upsilon (Y_t)dt$, where $(W_t)$ is a Brownian motion, where $\sigma : \mathbb {R}^d \to \mathcal {M}_d(\mathbb {R})$ is an adaptive (multiplicative) noise, where $a : \mathbb {R}^+ \to \mathbb {R}^+$ is a function decreasing to $0$ and where $\Upsilon$ is a correction term. This setting can be applied to optimization problems arising in Machine Learning; allowing $\sigma$ to depend on the position brings faster convergence in comparison with the classical Langevin equation $dY_t = -\nabla V(Y_t)dt + \sigma dW_t$. The case where $\sigma$ is a constant matrix has been extensively studied; however little attention has been paid to the general case. We prove the convergence for the $L^1$-Wasserstein distance of $Y_t$ and of the associated Euler scheme $\bar {Y}_t$ to some measure $\nu ^\star$ which is supported by $\operatorname {argmin}(V)$ and give rates of convergence to the instantaneous Gibbs measure $\nu _{a(t)}$ of density $\propto \exp (-2V(x)/a(t)^2)$. To do so, we first consider the case where $a$ is a piecewise constant function. We find again the classical schedule $a(t) = A\log ^{-1/2}(t)$. We then prove the convergence for the general case by giving bounds for the Wasserstein distance to the stepwise constant case using ergodicity properties.References
- V. Bally and D. Talay, The law of the Euler scheme for stochastic differential equations. I. Convergence rate of the distribution function, Probab. Theory Related Fields 104 (1996), no. 1, 43–60. MR 1367666, DOI 10.1007/BF01303802
- Pierre Bras, Convergence rates of Gibbs measures with degenerate minimum, Bernoulli 28 (2022), no. 4, 2431–2458. MR 4474549, DOI 10.3150/21-bej1424
- S. Bubeck, R. Eldan, and J. Lehec, Finite-time analysis of projected Langevin Monte Carlo, Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015.
- Tzuu-Shuh Chiang, Chii-Ruey Hwang, and Shuenn Jyi Sheu, Diffusion for global optimization in $\textbf {R}^n$, SIAM J. Control Optim. 25 (1987), no. 3, 737–753. MR 885196, DOI 10.1137/0325042
- Arnak S. Dalalyan, Theoretical guarantees for approximate sampling from smooth and log-concave densities, J. R. Stat. Soc. Ser. B. Stat. Methodol. 79 (2017), no. 3, 651–676. MR 3641401, DOI 10.1111/rssb.12183
- Y. Dauphin, H. de Vries, and Y. Bengio, Equilibrated adaptive learning rates for non-convex optimization, Advances in Neural Information Processing Systems (C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, eds.), vol. 28, Curran Associates, Inc., 2015
- Y. N. Dauphin, R. Pascanu, C. Gulcehre, K. Cho, S. Ganguli, and Y. Bengio, Identifying and attacking the saddle point problem in high-dimensional non-convex optimization, Proceedings of the 27th International Conference on Neural Information Processing Systems, vol. 2 (Cambridge, MA, USA), NIPS’14, MIT Press, 2014, pp. 2933–2941.
- John Duchi, Elad Hazan, and Yoram Singer, Adaptive subgradient methods for online learning and stochastic optimization, J. Mach. Learn. Res. 12 (2011), 2121–2159. MR 2825422
- Alain Durmus and Éric Moulines, Nonasymptotic convergence analysis for the unadjusted Langevin algorithm, Ann. Appl. Probab. 27 (2017), no. 3, 1551–1587. MR 3678479, DOI 10.1214/16-AAP1238
- Alain Durmus and Éric Moulines, High-dimensional Bayesian inference via the unadjusted Langevin algorithm, Bernoulli 25 (2019), no. 4A, 2854–2882. MR 4003567, DOI 10.3150/18-BEJ1073
- Andreas Eberle, Reflection couplings and contraction rates for diffusions, Probab. Theory Related Fields 166 (2016), no. 3-4, 851–886. MR 3568041, DOI 10.1007/s00440-015-0673-1
- Saul B. Gelfand and Sanjoy K. Mitter, Recursive stochastic algorithms for global optimization in $\textbf {R}^d$, SIAM J. Control Optim. 29 (1991), no. 5, 999–1018. MR 1110084, DOI 10.1137/0329055
- Chii-Ruey Hwang, Laplace’s method revisited: weak convergence of probability measures, Ann. Probab. 8 (1980), no. 6, 1177–1182. MR 602391
- D. P. Kingma and J. Ba, Adam: a method for stochastic optimization, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (Yoshua Bengio and Yann LeCun, eds.), 2015.
- Damien Lamberton and Gilles Pagès, Recursive computation of the invariant distribution of a diffusion, Bernoulli 8 (2002), no. 3, 367–405. MR 1913112, DOI 10.1142/S0219493703000838
- A. Lamperski, Projected stochastic gradient Langevin algorithms for constrained sampling and non-convex learning, Proceedings of Thirty Fourth Conference on Learning Theory (Mikhail Belkin and Samory Kpotufe, eds.), Proceedings of Machine Learning Research, vol. 134, PMLR, 15–19 August 2021, pp. 2891–2937.
- V. A. Lazarev, Convergence of stochastic approximation procedures in the case of several roots of a regression equation, Problemy Peredachi Informatsii 28 (1992), no. 1, 75–88 (Russian); English transl., Problems Inform. Transmission 28 (1992), no. 1, 66–78. MR 1163142
- Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE 86 (1998), no. 11, 2278–2324.
- C. Li, C. Chen, D. Carlson, and L. Carin, Preconditioned stochastic gradient Langevin dynamics for deep neural networks, Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, AAAI Press, 2016, pp. 1788–1794.
- Y. Ma, T. Chen, and E. B. Fox, A complete recipe for stochastic gradient MCMC, Proceedings of the 28th International Conference on Neural Information Processing Systems–Volume 2 (Cambridge, MA, USA), NIPS’15, MIT Press, 2015, p. 2917–2925.
- Laurent Miclo, Recuit simulé sur $\textbf {R}^n$. Étude de l’évolution de l’énergie libre, Ann. Inst. H. Poincaré Probab. Statist. 28 (1992), no. 2, 235–266 (French, with English summary). MR 1162574
- Pierre Monmarché, Nicolas Fournier, and Camille Tardif, Simulated annealing in $\Bbb {R}^d$ with slowly growing potentials, Stochastic Process. Appl. 131 (2021), 276–291. MR 4157715, DOI 10.1016/j.spa.2020.09.014
- A. Neelakantan, L. Vilnis, Q. V. Le, I. Sutskever, L. Kaiser, K. Kurach, and J. Martens, Adding gradient noise improves learning for very deep networks, E-prints, arXiv:1511.06807, 2015.
- Jorge Nocedal and Stephen J. Wright, Numerical optimization, 2nd ed., Springer Series in Operations Research and Financial Engineering, Springer, New York, 2006. MR 2244940
- Gilles Pagès and Fabien Panloup, Unadjusted Langevin algorithm with multiplicative noise: total variation and Wasserstein bounds, Ann. Appl. Probab. 33 (2023), no. 1, 726–779. MR 4551562, DOI 10.1214/22-aap1828
- S. Patterson and Y. W. Teh, Stochastic gradient Riemannian Langevin dynamics on the probability simplex, Advances in Neural Information Processing Systems (C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, eds.), vol. 26, Curran Associates, Inc., 2013.
- G. Royer, A remark on simulated annealing of diffusion processes, SIAM J. Control Optim. 27 (1989), no. 6, 1403–1408. MR 1022435, DOI 10.1137/0327072
- L. Sagun, L. Bottou, and Y. LeCun, Eigenvalues of the Hessian in deep learning: singularity and beyond, E-prints, arXiv:1611.07476, 2016.
- L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou, Empirical analysis of the Hessian of over-parametrized neural networks, E-prints, arXiv:1706.04454, 2017.
- U. Simsekli, R. Badeau, A. T. Cemgil, and G. Richard, Stochastic quasi-Newton Langevin Monte Carlo, Proceedings of the 33rd International Conference on International Conference on Machine Learning, vol. 48, ICML’16, 2016, pp. 642–651.
- D. Talay, Second-order discretization schemes of stochastic differential systems for the computation of the invariant law, Stoch. Stoch. Rep. 29 (1990), no. 1, 13–36.
- Denis Talay and Luciano Tubaro, Expansion of the global error for numerical schemes solving stochastic differential equations, Stochastic Anal. Appl. 8 (1990), no. 4, 483–509 (1991). MR 1091544, DOI 10.1080/07362999008809220
- P. J. M. van Laarhoven and E. H. L. Aarts, Simulated annealing: theory and applications, Mathematics and its Applications, vol. 37, D. Reidel Publishing Co., Dordrecht, 1987. MR 904050, DOI 10.1007/978-94-015-7744-1
- Cédric Villani, Optimal transport, Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338, Springer-Verlag, Berlin, 2009. Old and new. MR 2459454, DOI 10.1007/978-3-540-71050-9
- Feng-Yu Wang, Coupling and applications, Stochastic analysis and applications to finance, Interdiscip. Math. Sci., vol. 13, World Sci. Publ., Hackensack, NJ, 2012, pp. 411–424. MR 2986856, DOI 10.1142/9789814383585_{0}020
- Feng-Yu Wang, Exponential contraction in Wasserstein distances for diffusion semigroups with negative curvature, Potential Anal. 53 (2020), no. 3, 1123–1144. MR 4140091, DOI 10.1007/s11118-019-09800-z
- M. Welling and Y. W. Teh, Bayesian learning via stochastic gradient Langevin dynamics, Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, Omnipress, 2011, pp. 681–688.
- M. D. Zeiler, ADADELTA: An adaptive learning rate method, E-prints, arXiv:1212.5701, 2012.
- Pierre-André Zitt, Annealing diffusions in a potential function with a slow growth, Stochastic Process. Appl. 118 (2008), no. 1, 76–119. MR 2376253, DOI 10.1016/j.spa.2007.04.002
Bibliographic Information
- Pierre Bras
- Affiliation: Sorbonne Université, Laboratoire de Probabilités, Statistique et Modélisation, UMR 8001, case 158, 4 pl. Jussieu, F-75252 Paris Cedex 5, France
- MR Author ID: 1523514
- ORCID: 0000-0001-9577-687X
- Email: pierre.bras@sorbonne-universite.fr
- Gilles Pagès
- Affiliation: Sorbonne Université, Laboratoire de Probabilités, Statistique et Modélisation, UMR 8001, case 158, 4 pl. Jussieu, F-75252 Paris Cedex 5, France
- ORCID: 0000-0001-6487-3079
- Email: gilles.pages@sorbonne-universite.fr
- Received by editor(s): February 16, 2022
- Received by editor(s) in revised form: February 6, 2023
- Published electronically: March 15, 2024
- Additional Notes: The first author is the corresponding author
- © Copyright 2024 American Mathematical Society
- Journal: Math. Comp. 93 (2024), 1761-1803
- MSC (2020): Primary 62L20; Secondary 65C30, 60H35
- DOI: https://doi.org/10.1090/mcom/3899
- MathSciNet review: 4730248