Mathematics of Computation

Published by the American Mathematical Society since 1960 (published as Mathematical Tables and other Aids to Computation 1943-1959), Mathematics of Computation is devoted to research articles of the highest quality in computational mathematics.

ISSN 1088-6842 (online) ISSN 0025-5718 (print)

The 2024 MCQ for Mathematics of Computation is 1.78.

What is MCQ? The Mathematical Citation Quotient (MCQ) measures journal impact by looking at citations over a five-year period. Subscribers to MathSciNet may click through for more detailed information.

 

Gradient descent for convex and smooth noisy optimization
HTML articles powered by AMS MathViewer

by Feifei Hu and Mathieu Gerber;
Math. Comp.
DOI: https://doi.org/10.1090/mcom/4103
Published electronically: June 9, 2025

Abstract:

We study gradient descent with backtracking line search (GD-BLS) to solve the noisy optimization problem $\theta _\star ≔argmin_{\theta \in \mathbb {R}^d} \mathbb {E}[f(\theta ,Z)]$, imposing that the objective function $F(\theta )≔\mathbb {E}[f(\theta ,Z)]$ is strictly convex but not necessarily $L$-smooth. Assuming that $\mathbb {E}[\|\nabla _\theta f(\theta _\star ,Z)\|^2]<\infty$, we first prove that sample average approximation based on GD-BLS allows to estimate $\theta _\star$ with an error of size $\mathcal {O}_\mathbb {P}(B^{-0.25})$, where $B$ is the available computational budget. We then show that we can improve upon this rate by stopping the optimization process earlier when the gradient of the objective function is sufficiently close to zero, and use the residual computational budget to optimize, again with GD-BLS, a finer approximation of $F$. By iteratively applying this strategy $J$ times we establish that we can estimate $\theta _\star$ with an error of size $\mathcal {O}_\mathbb {P}(B^{-\frac {1}{2}(1-\delta ^{J})})$, where $\delta \in (1/2,1)$ is a user-specified parameter. More generally, we show that if $\mathbb {E}[\|\nabla _\theta f(\theta _\star ,Z)\|^{1+\alpha }]<\infty$ for some known $\alpha \in (0,1]$ then this approach, which can be seen as a retrospective approximation algorithm with a fixed computational budget, allows to learn $\theta _\star$ with an error of size $\mathcal {O}_\mathbb {P}(B^{-\frac {\alpha }{1+\alpha }(1-\delta ^{J})})$, where $\delta \in (2\alpha /(1+3\alpha ),1)$ is a tuning parameter. Beyond knowing $\alpha$, achieving the aforementioned convergence rates does not require to tune the algorithms’ parameters according to the specific functions $F$ and $f$ at hand, and we exhibit a simple noisy optimization problem for which stochastic gradient is not guaranteed to converge while the algorithms discussed in this work are.
References
  • Herbert Amann and Joachim Escher, Analysis. I, Birkhäuser Verlag, Basel, 2005. Translated from the 1998 German original by Gary Brookfield. MR 2105047, DOI 10.1007/b137107
  • Hilal Asi and John C. Duchi, Stochastic (approximate) proximal point methods: convergence, optimality, and adaptivity, SIAM J. Optim. 29 (2019), no. 3, 2257–2290. MR 4000228, DOI 10.1137/18M1230323
  • Léon Bottou, Frank E. Curtis, and Jorge Nocedal, Optimization methods for large-scale machine learning, SIAM Rev. 60 (2018), no. 2, 223–311. MR 3797719, DOI 10.1137/16M1080173
  • Sébastien Bubeck, Nicolò Cesa-Bianchi, and Gábor Lugosi, Bandits with heavy tail, IEEE Trans. Inform. Theory 59 (2013), no. 11, 7711–7717. MR 3124669, DOI 10.1109/TIT.2013.2277869
  • Luc Devroye, Matthieu Lerasle, Gabor Lugosi, and Roberto I. Oliveira, Sub-Gaussian mean estimators, Ann. Statist. 44 (2016), no. 6, 2695–2725. MR 3576558, DOI 10.1214/16-AOS1440
  • M. Faw, I. Tziotis, C. Caramanis, A. Mokhtari, S. Shakkottai, and R. Ward, The power of adaptivity in SGD: self-tuning step sizes with unbounded gradients and affine variance, Conference on Learning Theory, PMLR, 2022, pp. 313–355.
  • Benjamin Fehrman, Benjamin Gess, and Arnulf Jentzen, Convergence rates for the stochastic gradient descent method for non-convex objective functions, J. Mach. Learn. Res. 21 (2020), Paper No. 136, 48. MR 4138120
  • G. Garrigos and R. M. Gower, Handbook of convergence theorems for (stochastic) gradient methods, Preprint, arXiv:2301.11235, 2023.
  • A. Khaled and P. Richtárik, Better theory for SGD in the nonconvex world, Preprint, arXiv:2002.03329, 2020.
  • S. Kim, R. Pasupathy, and S. G. Henderson, A guide to sample average approximation, Handbook of Simulation Optimization, 2015, pp. 207–243.
  • Y. Malitsky and K. Mishchenko, Adaptive gradient descent without descent, Preprint, arXiv:1910.09529, 2019.
  • Yurii Nesterov, Lectures on convex optimization, Springer Optimization and Its Applications, vol. 137, Springer, Cham, 2018. Second edition of [ MR2142598]. MR 3839649, DOI 10.1007/978-3-319-91578-4
  • D. Newton, R. Bollapragada, R. Pasupathy, and N. K. Yip, A retrospective approximation approach for smooth stochastic optimization, Mathematics of Operations Research (2024).
  • T. D. Nguyen, T. H. Nguyen, A. Ene, and H. Nguyen, Improved convergence in high probability of clipped gradient methods with heavy tailed noise, Adv. Neural Inf. Process. Syst. 36 (2024).
  • Vivak Patel, Stopping criteria for, and strong convergence of, stochastic gradient descent on Bottou-Curtis-Nocedal functions, Math. Program. 195 (2022), no. 1-2, Ser. A, 693–734. MR 4499069, DOI 10.1007/s10107-021-01710-6
  • V. Patel, S. Zhang, and B. Tian, Global convergence and stability of stochastic gradient descent, Adv. Neural Inf. Process. Syst. 35 (2022), 36014–36025.
  • Johannes O. Royset and Roberto Szechtman, Optimal budget allocation for sample average approximation, Oper. Res. 61 (2013), no. 3, 762–776. MR 3079744, DOI 10.1287/opre.2013.1163
  • Panos Toulis and Edoardo M. Airoldi, Asymptotic and finite-sample properties of estimators based on stochastic gradients, Ann. Statist. 45 (2017), no. 4, 1694–1727. MR 3670193, DOI 10.1214/16-AOS1506
  • A. W. van der Vaart, Asymptotic statistics, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 3, Cambridge University Press, Cambridge, 1998. MR 1652247, DOI 10.1017/CBO9780511802256
  • H. Wang, M. Gurbuzbalaban, L. Zhu, U. Simsekli, and M. A. Erdogdu, Convergence rates of stochastic gradient descent under infinite noise variance, Adv. Neural Inf. Process. Syst. 34 (2021), 18866–18877.
  • J. Zhang, T. He, S. Sra, and A. Jadbabaie, Why gradient clipping accelerates training: a theoretical justification for adaptivity, Preprint, arXiv:1905.11881, 2019.
Similar Articles
  • Retrieve articles in Mathematics of Computation with MSC (2020): 65K10, 65C99
  • Retrieve articles in all journals with MSC (2020): 65K10, 65C99
Bibliographic Information
  • Feifei Hu
  • Affiliation: School of Mathematics, University of Bristol, Fry Building, Woodland Road, Bristol, BS8 1UG, UK
  • Mathieu Gerber
  • Affiliation: School of Mathematics, University of Bristol, Fry Building, Woodland Road, Bristol, BS8 1UG, UK
  • MR Author ID: 1110126
  • ORCID: 0000-0001-6774-2330
  • Received by editor(s): November 21, 2024
  • Received by editor(s) in revised form: April 1, 2025
  • Published electronically: June 9, 2025
  • © Copyright 2025 American Mathematical Society
  • Journal: Math. Comp.
  • MSC (2020): Primary 65K10, 65C99
  • DOI: https://doi.org/10.1090/mcom/4103