Bulletin of the American Mathematical Society

The Bulletin publishes expository articles on contemporary mathematical research, written in a way that gives insight to mathematicians who may not be experts in the particular topic. The Bulletin also publishes reviews of selected books in mathematics and short articles in the Mathematical Perspectives section, both by invitation only.

ISSN 1088-9485 (online) ISSN 0273-0979 (print)

The 2024 MCQ for Bulletin of the American Mathematical Society is 0.84.

What is MCQ? The Mathematical Citation Quotient (MCQ) measures journal impact by looking at citations over a five-year period. Subscribers to MathSciNet may click through for more detailed information.

 

A mathematical perspective on transformers
HTML articles powered by AMS MathViewer

by Borjan Geshkovski, Cyril Letrouit, Yury Polyanskiy and Philippe Rigollet;
Bull. Amer. Math. Soc. 62 (2025), 427-479
DOI: https://doi.org/10.1090/bull/1863
Published electronically: April 29, 2025

Abstract:

Transformers play a central role in the inner workings of large language models. We develop a mathematical framework for analyzing transformers based on their interpretation as interacting particle systems, with a particular emphasis on long-time clustering behavior. Our study explores the underlying theory and offers new perspectives for mathematicians as well as computer scientists.
References
  • P. Abdalla, A. S. Bandeira, M. Kassabov, V. Souza, S. H. Strogatz, and A. Townsend, Expander graphs are globally synchronising, Preprint, arXiv:2210.12788, 2022.
  • J. A. Acebrón, L. L. Bonilla, C. J. Pérez Vicente, F. Ritort, and R. Spigler, The Kuramoto model: A simple paradigm for synchronization phenomena, Reviews of Modern Physics, 77(1):137, 2005.
  • K. Ahn, X. Cheng, H. Daneshmand, and S. Sra, Transformers learn to implement preconditioned gradient descent for in-context learning, Advances in Neural Information Processing Systems 36, 2023.
  • S. Alberti, N. Dern, L. Thesing, and G. Kutyniok, Sumformer: Universal Approximation for Efficient transformers, In Topological, Algebraic and Geometric Learning Workshops 2023, pages 72–86, PMLR, 2023.
  • Daniel Owusu Adu and Bahman Gharesifard, Approximate controllability of continuity equation of transformers, IEEE Control Syst. Lett. 8 (2024), 964–969. MR 4767113
  • Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré, Gradient flows in metric spaces and in the space of probability measures, Lectures in Mathematics ETH Zürich, Birkhäuser Verlag, Basel, 2005. MR 2129498
  • A. Agrachev and C. Letrouit, Generic controllability of equivariant systems and applications to particle systems and neural networks, Preprint, arXiv:2404.08289, 2024.
  • Andrei Agrachev and Andrey Sarychev, Control on the manifolds of mappings with a view to the deep learning, J. Dyn. Control Syst. 28 (2022), no. 4, 989–1008. MR 4480167, DOI 10.1007/s10883-021-09561-2
  • Andrew R. Barron, Universal approximation bounds for superpositions of a sigmoidal function, IEEE Trans. Inform. Theory 39 (1993), no. 3, 930–945. MR 1237720, DOI 10.1109/18.256500
  • Jean-David Benamou and Yann Brenier, A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem, Numer. Math. 84 (2000), no. 3, 375–393. MR 1738163, DOI 10.1007/s002110050002
  • Brandon Ballinger, Grigoriy Blekherman, Henry Cohn, Noah Giansiracusa, Elizabeth Kelly, and Achill Schürmann, Experimental study of energy-minimizing point configurations on spheres, Experiment. Math. 18 (2009), no. 3, 257–283. MR 2555698
  • Adrien Blanchet, José A. Carrillo, and Nader Masmoudi, Infinite time aggregation for the critical Patlak-Keller-Segel model in $\Bbb R^2$, Comm. Pure Appl. Math. 61 (2008), no. 10, 1449–1481. MR 2436186, DOI 10.1002/cpa.20225
  • Dario Benedetto, Emanuele Caglioti, and Umberto Montemagno, On the complete phase synchronization for the Kuramoto model in the mean-field limit, Commun. Math. Sci. 13 (2015), no. 7, 1775–1786. MR 3393174, DOI 10.4310/CMS.2015.v13.n7.a6
  • Dmitriy Bilyk and Feng Dai, Geodesic distance Riesz energy on the sphere, Trans. Amer. Math. Soc. 372 (2019), no. 5, 3141–3166. MR 3988605, DOI 10.1090/tran/7711
  • H. Bao, R. Hataya, and R. Karakida, Self-attention networks localize when qk-eigenspectrum concentrates, Preprint, arXiv:2402.02098, 2024.
  • J. L. Ba, J. R. Kiros, and G. E. Hinton, Layer normalization, Preprint, arXiv:1607.06450, 2016.
  • Emmanuel Boissard and Thibaut Le Gouic, On the mean speed of convergence of empirical and occupation measures in Wasserstein distance, Ann. Inst. Henri Poincaré Probab. Stat. 50 (2014), no. 2, 539–563 (English, with English and French summaries). MR 3189084, DOI 10.1214/12-AIHP517
  • Andrea L. Bertozzi, Thomas Laurent, and Jesús Rosado, $L^p$ theory for the multidimensional aggregation equation, Comm. Pure Appl. Math. 64 (2011), no. 1, 45–83. MR 2743876, DOI 10.1002/cpa.20334
  • José A. Carrillo, Young-Pil Choi, Seung-Yeal Ha, Moon-Jin Kang, and Yongduck Kim, Contractivity of transport distances for the kinetic Kuramoto equation, J. Stat. Phys. 156 (2014), no. 2, 395–415. MR 3215629, DOI 10.1007/s10955-014-1005-z
  • Louis-Pierre Chaintron and Antoine Diez, Propagation of chaos: a review of models, methods and applications. II. Applications, Kinet. Relat. Models 15 (2022), no. 6, 1017–1173. MR 4489769, DOI 10.3934/krm.2022018
  • J. A. Carrillo, M. DiFrancesco, A. Figalli, T. Laurent, and D. Slepčev, Global-in-time weak measure solutions and finite-time aggregation for nonlocal interaction equations, Duke Math. J. 156 (2011), no. 2, 229–271. MR 2769217, DOI 10.1215/00127094-2010-211
  • Hayato Chiba, A proof of the Kuramoto conjecture for a bifurcation structure of the infinite-dimensional Kuramoto model, Ergodic Theory Dynam. Systems 35 (2015), no. 3, 762–834. MR 3334903, DOI 10.1017/etds.2013.68
  • José A. Carrillo, Shi Jin, Lei Li, and Yuhua Zhu, A consensus-based global optimization method for high dimensional machine learning problems, ESAIM Control Optim. Calc. Var. 27 (2021), no. suppl., Paper No. S5, 22. MR 4222159, DOI 10.1051/cocv/2020046
  • Henry Cohn and Abhinav Kumar, Universally optimal distribution of points on spheres, J. Amer. Math. Soc. 20 (2007), no. 1, 99–148. MR 2257398, DOI 10.1090/S0894-0347-06-00546-7
  • Henry Cohn, Abhinav Kumar, Stephen D. Miller, Danylo Radchenko, and Maryna Viazovska, Universal optimality of the $E_8$ and Leech lattices and interpolation formulas, Ann. of Math. (2) 196 (2022), no. 3, 983–1082. MR 4502595, DOI 10.4007/annals.2022.196.3.3
  • J. Cheng, Q. Li, T. Lin, and Z. Shen, Interpolation, approximation and controllability of deep neural networks, Preprint, arXiv:2309.06015, 2023.
  • Marco Caponigro, Anna Chiara Lai, and Benedetto Piccoli, A nonlinear model of opinion formation on the sphere, Discrete Contin. Dyn. Syst. 35 (2015), no. 9, 4241–4268. MR 3392625, DOI 10.3934/dcds.2015.35.4241
  • Christa Cuchiero, Martin Larsson, and Josef Teichmann, Deep neural networks, generic universal interpolation, and controlled ODEs, SIAM J. Math. Data Sci. 2 (2020), no. 3, 901–919. MR 4154317, DOI 10.1137/19M1284117
  • A. Cowsik, T. Nebabu, X.-L. Qi, and S. Ganguli, Geometric dynamics of signal propagation predict trainability of transformers, Preprint, arXiv:2403.02579, 2024.
  • S. Chewi, J. Niles-Weed, and P. Rigollet, Statistical optimal transport, Preprint, arXiv:2407.18163, 2024.
  • R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K. Duvenaud, Neural ordinary differential equations, Advances in Neural Information Processing Systems, 31, 2018.
  • C. Criscitiello, Q. Rebjock, A. D. McRae, and N. Boumal, Synchronization on circles and spheres with nonlinear interactions, 2024.
  • Felipe Cucker and Steve Smale, Emergent behavior in flocks, IEEE Trans. Automat. Control 52 (2007), no. 5, 852–862. MR 2324245, DOI 10.1109/TAC.2007.895842
  • G. Cybenko, Approximation by superpositions of a sigmoidal function, Math. Control Signals Systems 2 (1989), no. 4, 303–314. MR 1015670, DOI 10.1007/BF02551274
  • T. Chen, Z. Zhang, Y. Cheng, A. Awadallah, and Z. Wang, The principle of diversity: Training stronger vision transformers calls for reducing all levels of redundancy, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12020–12030, 2022.
  • H. Daido, Order function and macroscopic mutual entrainment in uniformly coupled limit-cycle oscillators, Progress of Theoretical Physics, 88(6):1213–1218, 1992.
  • G. J. S. Dovonon, M. M. Bronstein, and M. J. Kusner, Setting the record straight on transformer oversmoothing, Preprint, arXiv:2401.04301, 2024.
  • G. De Bie, G. Peyré, and M. Cuturi, Stochastic deep networks, International Conference on Machine Learning, pages 1556–1565. PMLR, 2019.
  • Y. Dong, J.-B. Cordonnier, and A. Loukas, Attention is not all you need: Pure attention loses rank doubly exponentially with depth, International Conference on Machine Learning, pages 2793–2803. PMLR, 2021.
  • Helge Dietert, Bastien Fernandez, and David Gérard-Varet, Landau damping to partially locked states in the Kuramoto model, Comm. Pure Appl. Math. 71 (2018), no. 5, 953–993. MR 3794519, DOI 10.1002/cpa.21741
  • S. Dutta, T. Gautam, S. Chakrabarti, and Ta. Chakraborty, Redesigning the transformer architecture with insights from multi-particle dynamical systems, Advances in Neural Information Processing Systems, 34:5531–5544, 2021.
  • P. Delsarte, J.-M. Goethals, and J. J. Seidel, Spherical codes and designs, Geometry and Combinatorics, pages 68–93, Elsevier, 1991.
  • P. Deora, R. Ghaderi, H. Taheri, and C. Thrampoulidis, On the optimization and generalization of multi-head attention, Transactions on Machine Learning Research, 2024.
  • R. L. Dobrušin, Vlasov equations, Funktsional. Anal. i Prilozhen. 13 (1979), no. 2, 48–58, 96 (Russian). MR 541637
  • R. M. Dudley, The speed of mean Glivenko-Cantelli convergence, Ann. Math. Statist. 40 (1968), 40–50. MR 236977, DOI 10.1214/aoms/1177697802
  • Feng Dai and Yuan Xu, Approximation theory and harmonic analysis on spheres and balls, Springer Monographs in Mathematics, Springer, New York, 2013. MR 3060033, DOI 10.1007/978-1-4614-6660-4
  • Weinan E, A proposal on machine learning via dynamical systems, Commun. Math. Stat. 5 (2017), no. 1, 1–11. MR 3627592, DOI 10.1007/s40304-017-0103-z
  • C. Esteve, B. Geshkovski, D. Pighin, and E. Zuazua, Large-time asymptotics in deep learning, Preprint, arXiv:2008.02491, 2020.
  • Carlos Esteve-Yagüe and Borjan Geshkovski, Sparsity in long-time control of neural ODEs, Systems Control Lett. 172 (2023), Paper No. 105452, 14. MR 4527483, DOI 10.1016/j.sysconle.2022.105452
  • T. Furuya, M. V. de Hoop, and G. Peyré, Transformers are universal in-context learners, Preprint, arXiv:2408.01367, 2024.
  • Bastien Fernandez, David Gérard-Varet, and Giambattista Giacomin, Landau damping in the Kuramoto model, Ann. Henri Poincaré 17 (2016), no. 7, 1793–1823. MR 3510470, DOI 10.1007/s00023-015-0450-9
  • Massimo Fornasier, Hui Huang, Lorenzo Pareschi, and Philippe Sünnen, Consensus-based optimization on the sphere: convergence to global minimizers and machine learning, J. Mach. Learn. Res. 22 (2021), Paper No. 237, 55. MR 4329816
  • Amic Frouvelle and Jian-Guo Liu, Long-time dynamics for a simple aggregation equation on the sphere, Stochastic dynamics out of equilibrium, Springer Proc. Math. Stat., vol. 282, Springer, Cham, 2019, pp. 457–479. MR 3986076, DOI 10.1007/978-3-030-15096-9_{1}6
  • R. Feng, K. Zheng, Y. Huang, D. Zhao, M. Jordan, and Z.-J. Zha, Rank diminishing in deep neural networks, Advances in Neural Information Processing Systems, 35:33054–33065, 2022.
  • Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep learning, Adaptive Computation and Machine Learning, MIT Press, Cambridge, MA, 2016. MR 3617773
  • A. Guillin, P. Le Bris, and P. Monmarché, Uniform in time propagation of chaos for the 2D vortex model and other singular stochastic systems, Journal of the European Mathematical Society, 2024.
  • B. Geshkovski, C. Letrouit, Y. Polyanskiy, and P. Rigollet, The emergence of clusters in self-attention dynamics, Advances in Neural Information Processing Systems, 36, 2024.
  • François Golse, On the dynamics of large particle systems in the mean field limit, Macroscopic and large scale phenomena: coarse graining, mean field limits and ergodicity, Lect. Notes Appl. Math. Mech., vol. 3, Springer, [Cham], 2016, pp. 1–144. MR 3468297, DOI 10.1007/978-3-319-26883-5_{1}
  • Isabelle Gallagher, Laure Saint-Raymond, and Benjamin Texier, From Newton to Boltzmann: hard spheres and short-range potentials, Zurich Lectures in Advanced Mathematics, European Mathematical Society (EMS), Zürich, 2013. MR 3157048
  • X. Guo, Y. Wang, T. Du, and Y. Wang, Contranorm: A contrastive learning perspective on oversmoothing and beyond, In The Eleventh International Conference on Learning Representations, 2023.
  • Borjan Geshkovski and Enrique Zuazua, Turnpike in optimal control of PDEs, ResNets, and beyond, Acta Numer. 31 (2022), 135–263. MR 4436586, DOI 10.1017/S0962492922000046
  • Jiequn Han, Ruimeng Hu, and Jihao Long, A class of dimension-free metrics for the convergence of empirical measures, Stochastic Process. Appl. 164 (2023), 242–287. MR 4620848, DOI 10.1016/j.spa.2023.07.009
  • R. Hegselmann and U. Krause, Opinion dynamics and bounded confidence: models, analysis and simulation, Journal of Artifical Societies and Social Simulation (JASSS), 5(3), 2002.
  • Seung-Yeal Ha, Dongnam Ko, Jinyeong Park, and Xiongtao Zhang, Collective synchronization of classical and quantum oscillators, EMS Surv. Math. Sci. 3 (2016), no. 2, 209–267. MR 3576533, DOI 10.4171/EMSS/17
  • Seung-Yeal Ha, Dongnam Ko, and Sang Woo Ryoo, On the relaxation dynamics of Lohe oscillators on some Riemannian manifolds, J. Stat. Phys. 172 (2018), no. 5, 1427–1478. MR 3856949, DOI 10.1007/s10955-018-2091-0
  • Eldad Haber and Lars Ruthotto, Stable architectures for deep neural networks, Inverse Problems 34 (2018), no. 1, 014004, 22. MR 3742361, DOI 10.1088/1361-6420/aa9a90
  • Seung-Yeal Ha and Sang Woo Ryoo, Asymptotic phase-locking dynamics and critical coupling strength for the Kuramoto model, Comm. Math. Phys. 377 (2020), no. 2, 811–857. MR 4115007, DOI 10.1007/s00220-020-03786-1
  • K. He, X. Zhang, S. Ren, and J. Sun, Identity mappings in deep residual networks, Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 630–645, Springer, 2016.
  • A. Joudaki, H. Daneshmand, and F. Bach, On the impact of activation and normalization in obtaining isometric embeddings at initialization, Advances in Neural Information Processing Systems, 36:39855–39875, 2023.
  • Richard Jordan, David Kinderlehrer, and Felix Otto, The variational formulation of the Fokker-Planck equation, SIAM J. Math. Anal. 29 (1998), no. 1, 1–17. MR 1617171, DOI 10.1137/S0036141096303359
  • Haotian Jiang, Qianxiao Li, Zhong Li, and Shida Wang, A brief survey on the approximation theory for sequence modelling, J. Mach. Learn. 2 (2023), no. 1, 1–30. MR 4726305
  • Haotian Jiang, Qianxiao Li, Zhong Li, and Shida Wang, A brief survey on the approximation theory for sequence modelling, J. Mach. Learn. 2 (2023), no. 1, 1–30. MR 4726305
  • Pierre-Emmanuel Jabin and Sebastien Motsch, Clustering and asymptotic behavior in opinion formation, J. Differential Equations 257 (2014), no. 11, 4165–4187. MR 3264419, DOI 10.1016/j.jde.2014.08.005
  • Robert V. Kohn and Felix Otto, Upper bounds on coarsening rates, Comm. Math. Phys. 229 (2002), no. 3, 375–395. MR 1924360, DOI 10.1007/s00220-002-0693-4
  • U. Krause, A discrete nonlinear and non-autonomous model of consensus formation, Communications in difference equations (Poznan, 1998) Gordon and Breach, Amsterdam, 2000, pp. 227–236. MR 1792007
  • A. Krizhevsky, I. Sutskever, and G. E. Hinton, Imagenet classification with deep convolutional neural networks, Advances in Neural Information Processing Systems, 25, 2012.
  • Yoshiki Kuramoto, Self-entrainment of a population of coupled non-linear oscillators, International Symposium on Mathematical Problems in Theoretical Physics (Kyoto Univ., Kyoto, 1975) Lecture Notes in Phys., vol. 39, Springer, Berlin-New York, 1975, pp. 420–422. MR 676492
  • Daniel Lacker, Hierarchies, entropy, and quantitative propagation of chaos for mean field diffusions, Probab. Math. Phys. 4 (2023), no. 2, 377–432. MR 4595391, DOI 10.2140/pmp.2023.4.377
  • Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, International Conference on Learning Representations, 2020.
  • Qianxiao Li, Long Chen, Cheng Tai, and Weinan E, Maximum principle based algorithms for deep learning, J. Mach. Learn. Res. 18 (2017), Paper No. 165, 29. MR 3813814
  • Wuchen Li, Hessian metric via transport information geometry, J. Math. Phys. 62 (2021), no. 3, Paper No. 033301, 23. MR 4222140, DOI 10.1063/5.0012605
  • H. Lin and S. Jegelka, Resnet with one-neuron hidden layers is a universal approximator, Advances in Neural Information Processing Systems, 31, 2018.
  • Daniel Lacker and Luc Le Flem, Sharp uniform-in-time propagation of chaos, Probab. Theory Related Fields 187 (2023), no. 1-2, 443–480. MR 4634344, DOI 10.1007/s00440-023-01192-x
  • Y. Lu, Z. Li, D. He, Z. Sun, B. Dong, T. Qin, L. Wang, and T.-Y. Liu, Understanding and improving transformer from a multi-particle dynamic system point of view, International Conference on Learning Representations, 2020.
  • Qianxiao Li, Ting Lin, and Zuowei Shen, Deep learning via dynamical systems: an approximation perspective, J. Eur. Math. Soc. (JEMS) 25 (2023), no. 5, 1671–1709. MR 4592857, DOI 10.4171/jems/1221
  • S. Łojasiewicz, Une propriété topologique des sous-ensembles analytiques réels, Les Équations aux Dérivées Partielles (Paris, 1962) Colloq. Internat. CNRS, No. 117, CNRS, Paris, 1963, pp. 87–89 (French). MR 160856
  • Xueru Zhang and Mingyan Liu, Fairness in learning-based sequential decision algorithms: a survey, Handbook of reinforcement learning and control, Stud. Syst. Decis. Control, vol. 325, Springer, Cham, [2021] ©2021, pp. 525–555. MR 4328452, DOI 10.1007/978-3-030-60990-0_{1}8
  • Shuyang Ling, Ruitu Xu, and Afonso S. Bandeira, On the landscape of synchronization networks: a perspective from nonconvex optimization, SIAM J. Optim. 29 (2019), no. 3, 1879–1907. MR 3982681, DOI 10.1137/18M1217644
  • MistralAI, https://github.com/mistralai/mistral-finetune/blob/main/model/transformer.py, 2024.
  • Javier Morales and David Poyato, On the trend to global equilibrium for Kuramoto oscillators, Ann. Inst. H. Poincaré C Anal. Non Linéaire 40 (2023), no. 3, 631–716. MR 4604156, DOI 10.4171/aihpc/47
  • Sebastien Motsch and Eitan Tadmor, Heterophilious dynamics enhances consensus, SIAM Rev. 56 (2014), no. 4, 577–621. MR 3274797, DOI 10.1137/120901866
  • Johan Markdahl, Johan Thunberg, and Jorge Gonçalves, Almost global consensus on the $n$-sphere, IEEE Trans. Automat. Control 63 (2018), no. 6, 1664–1675. MR 3807655, DOI 10.1109/tac.2017.2752799
  • L. Noci, S. Anagnostidis, L. Biggio, A. Orvieto, S. P. Singh, and A. Lucchi., Signal propagation in transformers: Theoretical perspectives and the role of rank collapse, Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
  • L. Noci, C. Li, M. Li, B. He, T. Hofmann, C. J. Maddison, and D. Roy, The shaped transformer: Attention models in the infinite depth-and-width limit, Advances in Neural Information Processing Systems, 36, 2024.
  • OpenAI, https://github.com/openai/gpt-2/blob/master/src/model.py, 2024.
  • Felix Otto and Maria G. Reznikoff, Slow motion of gradient flows, J. Differential Equations 237 (2007), no. 2, 372–420. MR 2330952, DOI 10.1016/j.jde.2007.03.007
  • Felix Otto, The geometry of dissipative evolution equations: the porous medium equation, Comm. Partial Differential Equations 26 (2001), no. 1-2, 101–174. MR 1842429, DOI 10.1081/PDE-100002243
  • M. Phuong and M. Hutter, Formal algorithms for transformers, Preprint, arXiv:2207.09238, 2022.
  • René Pinnau, Claudia Totzeck, Oliver Tse, and Stephan Martin, A consensus-based model for global optimization and its mean-field limit, Math. Models Methods Appl. Sci. 27 (2017), no. 1, 183–204. MR 3597012, DOI 10.1142/S0218202517400061
  • Domènec Ruiz-Balet and Enrique Zuazua, Neural ODE control for classification, approximation, and transport, SIAM Rev. 65 (2023), no. 3, 735–773. MR 4624336, DOI 10.1137/21M1411433
  • Matthew Rosenzweig and Sylvia Serfaty, Global-in-time mean-field convergence for singular Riesz-type diffusive flows, Ann. Appl. Probab. 33 (2023), no. 2, 754–798. MR 4564418, DOI 10.1214/22-aap1833
  • L. Ru, H. Zheng, Y. Zhan, and B. Du, Token contrast for weakly-supervised semantic segmentation, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3093–3102, 2023.
  • M. E. Sander, P. Ablin, M. Blondel, and G. Peyré, Sinkformers: Transformers with doubly stochastic attention, International Conference on Artificial Intelligence and Statistics, pages 3515–3530. PMLR, 2022.
  • Sylvia Serfaty, Mean field limit for Coulomb-type flows, Duke Math. J. 169 (2020), no. 15, 2887–2935. With an appendix by Mitia Duerinckx and Serfaty. MR 4158670, DOI 10.1215/00127094-2020-0019
  • Bharath K. Sriperumbudur, Kenji Fukumizu, Arthur Gretton, Bernhard Schölkopf, and Gert R. G. Lanckriet, On the empirical estimation of integral probability metrics, Electron. J. Stat. 6 (2012), 1550–1599. MR 2988458, DOI 10.1214/12-EJS722
  • Michael Shub, Stabilité globale des systèmes dynamiques, Astérisque, vol. 56, Société Mathématique de France, Paris, 1978 (French). With an English preface and summary. MR 513592
  • Steven H. Strogatz, From Kuramoto to Crawford: exploring the onset of synchronization in populations of coupled oscillators, Phys. D 143 (2000), no. 1-4, 1–20. Bifurcations, patterns and symmetry. MR 1783382, DOI 10.1016/S0167-2789(00)00094-4
  • M. Scholkemper, X. Wu, A. Jadbabaie, and M. Schaub, Residual connections and normalization can provably prevent oversmoothing in gnns, Preprint, arXiv2406.02997, 2024.
  • Gabor Szegö, Orthogonal Polynomials, American Mathematical Society Colloquium Publications, Vol. 23, American Mathematical Society, New York, 1939. MR 77
  • Eitan Tadmor, Swarming: hydrodynamic alignment with pressure, Bull. Amer. Math. Soc. (N.S.) 60 (2023), no. 3, 285–325. MR 4588042, DOI 10.1090/bull/1793
  • Yan Shuo Tan, Energy optimization for distributions on the sphere and improvement to the Welch bounds, Electron. Commun. Probab. 22 (2017), Paper No. 43, 12. MR 3693769, DOI 10.1214/17-ECP73
  • Richard Taylor, There is no non-zero stable fixed point for dense networks in the homogeneous Kuramoto model, J. Phys. A 45 (2012), no. 5, 055102, 15. MR 2878025, DOI 10.1088/1751-8113/45/5/055102
  • Paulo Tabuada and Bahman Gharesifard, Universal approximation power of deep residual neural networks through the lens of control, IEEE Trans. Automat. Control 68 (2023), no. 5, 2715–2728. MR 4583452, DOI 10.1109/tac.2022.3190051
  • D. A. Tarzanagh, Y. Li, C. Thrampoulidis, and S. Oymak, Transformers as support vector machines, Advances in Neural Information Processing Systems, 36, 2023.
  • Alex Townsend, Michael Stillman, and Steven H. Strogatz, Dense networks that do not synchronize and sparse ones that do, Chaos 30 (2020), no. 8, 083142, 7. MR 4137558, DOI 10.1063/5.0018322
  • J. Vuckovic, A. Baratin, and R. T. des Combes, A mathematical theory of attention, Preprint, arXiv:2007.02876, 2020.
  • Tamás Vicsek, András Czirók, Eshel Ben-Jacob, Inon Cohen, and Ofer Shochet, Novel type of phase transition in a system of self-driven particles, Phys. Rev. Lett. 75 (1995), no. 6, 1226–1229. MR 3363421, DOI 10.1103/PhysRevLett.75.1226
  • Roman Vershynin, High-dimensional probability, Cambridge Series in Statistical and Probabilistic Mathematics, vol. 47, Cambridge University Press, Cambridge, 2018. An introduction with applications in data science; With a foreword by Sara van de Geer. MR 3837109, DOI 10.1017/9781108231596
  • C. Villani, Limite de champ moyen, Cours de DEA, 2002:49, 2001.
  • Cédric Villani, Optimal transport, Grundlehren der mathematischen Wissenschaften [Fundamental Principles of Mathematical Sciences], vol. 338, Springer-Verlag, Berlin, 2009. Old and new. MR 2459454, DOI 10.1007/978-3-540-71050-9
  • T. Veeravalli and M. Raginsky, Nonlinear controllability and function representation by neural stochastic differential equations, Learning for Dynamics and Control Conference, pages 838–850, PMLR, 2023.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, Attention is all you need, Advances in Neural Information Processing Systems, 30, 2017.
  • X. Wu, A. Ajorlou, Y. Wang, S. Jegelka, and A. Jadbabaie, On the role of attention masks and layernorm in transformers, Preprint, arXiv:2405.18781, 2024.
  • X. Wu, A. Ajorlou, Z. Wu, and A. Jadbabaie, Demystifying oversmoothing in attention-based graph neural networks, Advances in Neural Information Processing Systems, 36, 2024.
  • J. G. Wendel, A problem in geometric probability, Math. Scand. 11 (1962), 109–111. MR 146858, DOI 10.7146/math.scand.a-10655
  • M. Ram Murty, Ramanujan graphs: an introduction, Indian J. Discrete Math. 6 (2020), no. 2, 91–127. MR 4317633
  • C. Yun, S. Bhojanapalli, A. S. Rawat, S. Reddi, and S. Kumar, Are transformers universal approximators of sequence-to-sequence functions?, In International Conference on Learning Representations, 2019.
  • A. Zweig and J. Bruna, A functional perspective on learning symmetric functions with neural networks, International Conference on Machine Learning, pages 13023–13032. PMLR, 2021.
  • H. Zhang, X. Gao, J. Unterman, and T. Arodz, Approximation capabilities of neural ODEs and invertible residual networks, International Conference on Machine Learning, pages 11086–11095. PMLR, 2020.
  • S. Zhai, T. Likhomanenko, E. Littwin, D. Busbridge, J. Ramapuram, Y. Zhang, J. Gu, and J. M. Susskind, Stabilizing transformer training by preventing attention entropy collapse, International Conference on Machine Learning, pages 40770–40803. PMLR, 2023.
  • H. Zhao, S. Ma, D. Zhang, Z.-H. Deng, and F. Wei, Are more layers beneficial to graph transformers?, The Eleventh International Conference on Learning Representations, 2023.
  • B. Zhang and R. Sennrich, Root mean square layer normalization, Advances in Neural Information Processing Systems, 32, 2019.
Similar Articles
Bibliographic Information
  • Borjan Geshkovski
  • Affiliation: Department of Mathematics, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139
  • MR Author ID: 1408216
  • Email: borjan@mit.edu
  • Cyril Letrouit
  • Affiliation: CNRS & Université Paris-Saclay, Laboratoire de mathématiques d’Orsay, 307 rue Michel Magat, Bâtiment 307, 91400 Orsay, France
  • MR Author ID: 1189697
  • Email: cyril.letrouit@universite-paris-saclay.fr
  • Yury Polyanskiy
  • Affiliation: Department of EECS, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139
  • MR Author ID: 910504
  • ORCID: 0000-0002-2109-0979
  • Email: yp@mit.edu
  • Philippe Rigollet
  • Affiliation: Department of Mathematics, Massachusetts Institute of Technology, 77 Massachusetts Ave, Cambridge, Massachusetts 02139
  • MR Author ID: 751289
  • Email: rigollet@math.mit.edu
  • Published electronically: April 29, 2025
  • © Copyright 2025 American Mathematical Society
  • Journal: Bull. Amer. Math. Soc. 62 (2025), 427-479
  • MSC (2020): Primary 34D05, 34D06, 35Q83; Secondary 52C17
  • DOI: https://doi.org/10.1090/bull/1863
  • MathSciNet review: 4926874